A Tour of Python NLP Libraries


A Tour of Python NLP LibrariesA Tour of Python NLP Libraries
Picture Generated with DALL·E 3

 

NLP, or Pure Language Processing, is a discipline inside Synthetic Intelligence that focuses on the interplay between human language and computer systems. It tries to discover and apply textual content knowledge so computer systems can perceive the textual content meaningfully.

Because the NLP discipline analysis progresses, how we course of textual content knowledge in computer systems has developed. Fashionable instances, we have now used Python to assist discover and course of knowledge simply.

With Python turning into the go-to language for exploring textual content knowledge, many libraries have been developed particularly for the NLP discipline. On this article, we are going to discover numerous unimaginable and helpful NLP libraries.

So, let’s get into it.
 

NLTK

 
NLTK, or Pure Language Software Package, is an NLP Python library with many text-processing APIs and industrial-grade wrappers. It’s one of many greatest NLP Python libraries utilized by researchers, knowledge scientists, engineers, and others. It’s an ordinary NLP Python library for NLP duties.

Let’s attempt to discover what NLTK might do. First, we would wish to put in the library with the next code.

 

Subsequent, we’d see what NLTK might do. First, NLTK can carry out the tokenization course of utilizing the next code:

import nltk from nltk.tokenize
import word_tokenize

# Obtain the required sources
nltk.obtain('punkt')

textual content = "The fruit within the desk is a banana"
tokens = word_tokenize(textual content)

print(tokens)

 

Output>> 
['The', 'fruit', 'in', 'the', 'table', 'is', 'a', 'banana']

 

Tokenization mainly would divide every phrase in a sentence into particular person knowledge.

With NLTK, we are able to additionally carry out Half-of-Speech (POS) Tags on the textual content pattern.

from nltk.tag import pos_tag

nltk.obtain('averaged_perceptron_tagger')

textual content = "The fruit within the desk is a banana"
pos_tags = pos_tag(tokens)

print(pos_tags)

 

Output>>
[('The', 'DT'), ('fruit', 'NN'), ('in', 'IN'), ('the', 'DT'), ('table', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('banana', 'NN')]

 

The output of the POS tagger with NLTK is every token and its supposed POS tags. For instance, the phrase Fruit is Noun (NN), and the phrase ‘a’ is Determinant (DT).

It’s additionally doable to carry out Stemming and Lemmatization with NLTK. Stemming is lowering a phrase to its base type by reducing its prefixes and suffixes, whereas Lemmatization additionally transforms to the bottom type by contemplating the phrases’ POS and morphological evaluation.

from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.obtain('wordnet')
nltk.obtain('punkt')

textual content = "The striped bats are hanging on their ft for finest"
tokens = word_tokenize(textual content)

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(token) for token in tokens]
print("Stems:", stems)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

 

Output>> 
Stems: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmas: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']

 

You’ll be able to see that the stemming and lentmatization processes have barely totally different outcomes from the phrases.

That’s the easy utilization of NLTK. You’ll be able to nonetheless do many issues with them, however the above APIs are probably the most generally used.
 

SpaCy

 
SpaCy is an NLP Python library that’s designed particularly for manufacturing use. It’s a complicated library, and SpaCy is understood for its efficiency and skill to deal with massive quantities of textual content knowledge. It’s a preferable library for business use in lots of NLP instances.

To put in SpaCy, you possibly can take a look at their utilization web page. Relying in your necessities, there are a lot of mixtures to select from.

Let’s attempt utilizing SpaCy for the NLP process. First, we’d attempt performing Named Entity Recognition (NER) with the library. NER is a technique of figuring out and classifying named entities in textual content into predefined classes, reminiscent of particular person, tackle, location, and extra.

import spacy

nlp = spacy.load("en_core_web_sm")

textual content = "Brad is working within the U.Okay. Startup known as AIForLife for 7 Months."
doc = nlp(textual content)
#Carry out the NER
for ent in doc.ents:
    print(ent.textual content, ent.label_)

 

Output>>
Brad PERSON
the U.Okay. Startup ORG
7 Months DATE

 

As you possibly can see, the SpaCy pre-trained mannequin understands which phrase throughout the doc could be labeled.

Subsequent, we are able to use SpaCy to carry out Dependency Parsing and visualize them. Dependency Parsing is a technique of understanding how every phrase pertains to the opposite by forming a tree construction.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

textual content = "SpaCy excels at dependency parsing."
doc = nlp(textual content)
for token in doc:
    print(f"{token.textual content}: {token.dep_}, {token.head.textual content}")

displacy.render(doc, jupyter=True)

 

Output>> 
Brad: nsubj, working
is: aux, working
working: ROOT, working
in: prep, working
the: det, Startup
U.Okay.: compound, Startup
Startup: pobj, in
known as: advcl, working
AIForLife: oprd, known as
for: prep, known as
7: nummod, Months
Months: pobj, for
.: punct, working

 

The output ought to embody all of the phrases with their POS and the place they’re associated. The code above would additionally present tree visualization in your Jupyter Pocket book.

Lastly, let’s attempt performing textual content similarity with SpaCy. Textual content similarity measures how related or associated two items of textual content are. It has many strategies and measurements, however we are going to attempt the best one.

import spacy

nlp = spacy.load("en_core_web_sm")

doc1 = nlp("I like pizza")
doc2 = nlp("I really like hamburger")

# Calculate similarity
similarity = doc1.similarity(doc2)
print("Similarity:", similarity)

 

Output>>
Similarity: 0.6159097609586724

 

The similarity measure measures the similarity between texts by offering an output rating, normally between 0 and 1. The nearer the rating is to 1, the extra related each texts are.

There are nonetheless many issues you are able to do with SpaCy. Discover the documentation to seek out one thing helpful on your work.
 

TextBlob

 
TextBlob is an NLP Python library for processing textual knowledge constructed on high of NLTK. It simplifies lots of NLTK’s utilization and may streamline textual content processing duties.

You’ll be able to set up TextBlob utilizing the next code:

pip set up -U textblob
python -m textblob.download_corpora

 

First, let’s attempt to use TextBlob for NLP duties. The primary one we’d attempt is to do sentiment evaluation with TextBlob. We will do this with the code beneath.

from textblob import TextBlob

textual content = "I'm within the high of the world"
blob = TextBlob(textual content)
sentiment = blob.sentiment

print(sentiment)

 

Output>>
Sentiment(polarity=0.5, subjectivity=0.5)

 

The output is a polarity and subjectivity rating. Polarity is the sentiment of the textual content the place the rating ranges from -1 (unfavourable) to 1 (constructive). On the similar time, the subjectivity rating ranges from 0 (goal) to 1 (subjective).

We will additionally use TextBlob for textual content correction duties. You are able to do that with the next code.

from textblob import TextBlob

textual content = "I havv goood speling."
blob = TextBlob(textual content)

# Spelling Correction
corrected_blob = blob.appropriate()
print("Corrected Textual content:", corrected_blob)

 

Output>>
Corrected Textual content: I've good spelling.

 

Attempt to discover the TextBlob packages to seek out the APIs on your textual content duties.
 

Gensim

 
Gensim is an open-source Python NLP library specializing in subject modeling and doc similarity evaluation, particularly for large and streaming knowledge. It focuses extra on industrial real-time purposes.

Let’s attempt the library. First, we are able to set up them utilizing the next code:

 

After the set up is completed, we are able to attempt the Gensim functionality. Let’s attempt to do subject modeling with LDA utilizing Gensim.

import gensim
from gensim import corpora
from gensim.fashions import LdaModel

# Pattern paperwork
paperwork = [
    "Tennis is my favorite sport to play.",
    "Football is a popular competition in certain country.",
    "There are many athletes currently training for the olympic."
]

# Preprocess paperwork
texts = [[word for word in document.lower().split()] for doc in paperwork]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]


#The LDA mannequin
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

subjects = lda_model.print_topics()
for subject in subjects:
    print(subject)

 

Output>>
(0, '0.073*"there" + 0.073*"presently" + 0.073*"olympic." + 0.073*"the" + 0.073*"athletes" + 0.073*"for" + 0.073*"coaching" + 0.073*"many" + 0.073*"are" + 0.025*"is"')
(1, '0.094*"is" + 0.057*"soccer" + 0.057*"sure" + 0.057*"in style" + 0.057*"a" + 0.057*"competitors" + 0.057*"nation." + 0.057*"in" + 0.057*"favourite" + 0.057*"tennis"')

 

The output is a mixture of phrases from the doc samples that cohesively turn into a subject. You’ll be able to consider whether or not the end result is sensible or not.

Gensim additionally offers a approach for customers to embed content material. For instance, we use Word2Vec to create embedding from phrases.

import gensim
from gensim.fashions import Word2Vec

# Pattern sentences
sentences = [
    ['machine', 'learning'],
    ['deep', 'learning', 'models'],
    ['natural', 'language', 'processing']
]

# Practice Word2Vec mannequin
mannequin = Word2Vec(sentences, vector_size=20, window=5, min_count=1, staff=4)

vector = mannequin.wv['machine']
print(vector)

 


Output>>
[ 0.01174188 -0.02259516  0.04194366 -0.04929082  0.0338232   0.01457208
 -0.02466416  0.02199094 -0.00869787  0.03355692  0.04982425 -0.02181222
 -0.00299669 -0.02847819  0.01925411  0.01393313  0.03445538  0.03050548
  0.04769249  0.04636709]

 

There are nonetheless many purposes you should utilize with Gensim. Attempt to see the documentation and consider your wants.
 

Conclusion

 

On this article, we explored a number of Python NLP libraries important for a lot of textual content duties. All of those libraries can be helpful on your work, from Textual content Tokenization to Phrase Embedding. The libraries we’re discussing are:

  1. NLTK
  2. SpaCy
  3. TextBlob
  4. Gensim

I hope it helps
 
 

Cornellius Yudha Wijaya is an information science assistant supervisor and knowledge author. Whereas working full-time at Allianz Indonesia, he likes to share Python and knowledge suggestions by way of social media and writing media. Cornellius writes on quite a lot of AI and machine studying subjects.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *