nltk Archives - Turbolab Technologies

Entity Linking & Disambiguation using REL

Vasista Reddy — Tue, 12 Jul 2022 07:02:27 +0000

Entity extraction, also known as Named Entity Recognition(NER), is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social media, will be transformed into structured entities that help analysts query the data and generate insightful reports.

spaCy provides different models in various languages to perform NER and NLP-related tasks. Building a custom NER model using spaCy has been explained in one of our blogs. You can check out the link here.

Now, let’s look into the entity extraction from a random news article using spaCy and Flair:

Defending champion Novak Djokovic battled back from two sets to love down to defeat Jannik Sinner and reach his 11th Wimbledon semi-final on Tuesday. Djokovic triumphed 5-7, 2-6, 6-3, 6-2, 6-2 and will face Britain’s Cameron Norrie of Belgium for a place in Sunday’s final. It was the seventh time in the Serb’s career that he had recovered from two sets to love at the Slams. “Huge congrats to Jannik for a big fight, he’s so mature for his age, he has plenty of time ahead of him,” said Djokovic.

Entity Extraction using spaCy:

import spacy

nlp = spacy.load(‘en_core_web_lg’) # spacy load the model

ner_ent = {‘person’: [], ‘norp’: [], ‘fac’: [], ‘org’: [], ‘gpe’: [], ‘loc’: [], ‘product’: [], ‘event’: [], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [], ‘cardinal’: []}

doc = nlp(content)
for entity in doc.ents:
if entity.label_.lower() in ner_ent:
ner_ent[entity.label_.lower()].append(entity.text)

print(ner_ent)

# output

{‘person’: [‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘norp’: [‘Serb’, ‘Serb’], ‘fac’: [], ‘org’: [], ‘gpe’: [‘Britain’, ‘Belgium’, ‘Britain’, ‘Belgium’], ‘loc’: [], ‘product’: [], ‘event’: [‘Wimbledon’, ‘Wimbledon’], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [‘Tuesday’, ‘Sunday’, ‘Tuesday’, ‘Sunday’], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [’11th’, ‘seventh’, ’11th’, ‘seventh’], ‘cardinal’: [‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’, ‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’]}

Entity Extraction using Flair:

from flair.data import Sentence
from flair.models import SequenceTagger

ner_ent = {‘per’: [], ‘org’: [], ‘loc’: [], ‘misc’: []}

# make a sentence
sentence = Sentence(content)

# load the NER tagger
tagger = SequenceTagger.load(‘ner’)

# run NER over sentence
tagger.predict(sentence)

print(‘The following NER tags are found:’)
# iterate over each entity
for entity in sentence.get_spans(‘ner’):
if str(entity.labels[0]).split()[0].lower() in ner_ent:
ner_ent[str(entity.labels[0]).split()[0].lower()].append(entity.text)

# output

The following NER tags are found:

{‘per’: [‘George Washington’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Djokovic’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘org’: [], ‘loc’: [‘Washington’, ‘Britain’, ‘Belgium’], ‘misc’: [‘Wimbledon’, ‘Serb’, ‘Slams’]}

Flair NER models give us only 4 entity types whereas spaCy gives 18 entity types.

Entity Linking & Disambiguation

Entity Linking is the process of linking entities with the target knowledge base. Here, we map the entities to the wiki links or the wiki page titles. Hence the process is called Wikification. We can say entity linking is also referred to as entity validation. The entities extracted from the models of Spacy or Flair will get validated from the third-party knowledge base.

However, this job is entity linking is intricate due to entity ambiguity and name variants. For example, the word Amazon refers to an organization and a rainforest.

Let’s have a detailed discussion on Entity Linking & Entity Disambiguation

News Article Clip:

Deforestation in Brazil’s Amazon rainforest reached a record high for the first six months of the year, as an area five times the size of New York City was destroyed, preliminary government data showed on Friday.

Spacy Output:

‘org’: [‘Amazon’], ‘gpe’: [‘Brazil’, ‘New York City’]

Here, Amazon is detected as the organization.

Flair Output:

‘loc’: [‘Brazil’, ‘Amazon’, ‘New York City’]

Here, Amazon is detected as the location/GPE. The ambiguity problem is clearly visible here and can be solved by Radboud Entity Linker (REL).

REL Output:

REL

" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=800%2C148&ssl=1" class="size-full wp-image-908" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=800%2C149&ssl=1" alt="" width="800" height="149" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?w=1430&ssl=1 1430w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=300%2C56&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=768%2C143&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1024%2C190&ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1080%2C201&ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1280%2C238&ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=980%2C182&ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=480%2C89&ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />

Radboud Entity Linker (REL) deals with the tasks of Entity Linking and Entity Disambiguation. One can use the public API provided by REL or install it using Docker/Source code with the instructions mentioned in the documentation. By default, REL uses Flair to extract entities; you can replace Flair with spaCy. REL also provides pre-trained models with case-sensitive and insensitive models with an f1 score of almost 93%.

Wikimapper python library is used to fetch the wikidata_id from the Wikipedia titles. You can have a look at the project which helps you to map Wikipedia page titles to WikiData IDs and vice-versa.

BLINK, the Facebook research entity linking python library, uses Wikipedia as the target knowledge base, similar to REL. But, the BLINK documentation hasn’t revealed any information regarding entity disambiguation.

OpenTapioca is a simple and fast Named Entity Linking system for Wikidata. A spaCy wrapper of OpenTapioca called spaCyOpenTapioca is also available for the entity linking process. But the results are not as great when compared to REL.

spaCy includes a pipeline component called entitylinker for Named Entity Linking and Disambiguation.

Dealing with Disambiguation

Japan began the defence of their title with a lucky 2-1 win against Syria in a championship match on Friday.

Using the above statement, we will discuss the different approaches to choosing the appropriate entity in the case of Entity Disambiguation.

Let’s see how wikifier deals with the disambiguation:

Wikifier doesn’t use any entity extraction method for extracting entities; it goes with Parts of Speech (POS).

The entities Syria and Japan are linked to their respective countries’ Wikipedia pages, Syria and Japan. In the context of the above statement, Japan and Syria actually refer to their football teams. Wikifier fetches all the Wikipedia page entities related to the entity and maps the entity with the most link targets.

Wikifier considers the minLinkFrequency parameter to evaluate the score.

Let’s see how REL deals with the disambiguation:

In REL, entity linking decisions depend on the contextual similarity and coherence with the other entity linking decisions in the document. One entity mapping is dependent on the other entities found in the document. You can read the paper here.

This example doesn’t have any impact since only two entities are found and the content is a one-liner. Instead of the entity detection method, if we had passed the POS output, the result might have been different.

With passing the entire article to the REL, the results are quite better. The REL model can now understand the context and relate more entities from the entire article.

Brazil and Dutch mapped to their respective football team wiki pages. Mapping Japan to its respective football team is still a mystery though. LOL.

Conclusion

Instead of going with the score of the most link targets, REL considers the context and the relationship between the entities detected from the document. By improving the mentioned detection, REL can be used as a perfect Entity Disambiguation tool.

Last but not least, there is a tool called ExtEnD(Extractive Entity Disambiguation) which needs to explore. We can add this tool to the spaCy NLP pipeline.

The output documented by ExtEnD is much better compared to the REL-generated output. Before coming to conclusion, as mentioned above this tool needs to explore.

The post Entity Linking & Disambiguation using REL appeared first on Turbolab Technologies.

Stemming Vs. Lemmatization with Python NLTK

Vasista Reddy — Fri, 29 Oct 2021 17:00:31 +0000

Stemming and Lemmatization are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example:

Let’s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our data corpus. By applying stemming/lemmatization techniques, we can reduce the vocabulary size by converting the words to their base forms. This will make the vocabulary more distinct and will reduce the ambiguity for the model to train and yield better results.

In this post, we will discuss the practical examples of how stemming and lemmatization can be done on words and sentences using the python nltk package.

Stemming

Stemming is a rule-based normalization approach as it slices the word’s prefix and suffix to reduce them to its root form. Stemming is faster compared to lemmatization as it cuts the prefixes(pre-, extra-, in-, im-, ir-, etc.) and suffixes(ed-, ing-, es-, -ity, -ty, -ship, -ness, etc.) without considering the context of the words. Due to its aggressiveness, there is a possibility that the outcome from the stemming algorithm may not be a valid word.

In the above example, you can see that the outcomes of badly and pharmacies are invalid words.

Porter Stemmer

The Porter stemming algorithm (or “Porter stemmer”) uses suffix-stemming to produce stems. Here is a python code using nltk to create a stemmer object and generate results.

Code Snippet to perform Porter Stemming:

In:

from nltk.stem import PorterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ps = PorterStemmer()
print([ps.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘badli’]

Observing the drawbacks of PorterStemmer, the Snowball Stemming algorithm was introduced.

Snowball Stemmer

This Snowball Stemming Algorithm is also known as Porter2 Stemmer. It is the best version of Porter Stemmer in which a few of the above-discussed stemming issues are resolved.

Code Snippet to perform Snowball Stemming:

In:

from nltk.stem.snowball import SnowballStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ss = SnowballStemmer(language=’english’)
print([ss.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘bad’]

Here, we can see that the word “badly” is a valid stem, but the word “pharmacies” is still an invalid stem.

Lancaster Stemmer

Compared to snowball and porter stemming, lancaster is the most aggressive stemming algorithm because it tends to over-stem a lot of words. It tries to reduce the word to the shortest stem possible. Here is an example:

Here is an example:

“salty” —- “sal”

“sales” —- “sal”

Code Snippet to perform Lancaster Stemming:

In:

from nltk.stem import LancasterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ls = LancasterStemmer()
print([ls.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘play’, ‘pharm’, ‘bad’]

As mentioned in the beginning, we can reduce the vocabulary and maintain more unique words by stemming.

Code snippet to perform tokenization and stemming on a paragraph:

content = “China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.”

The above content will hereafter be used as the input to the code snippets.

In:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# Porter Stemmed version

porteredContent = [ps.stem(word) for word in word_tokenize(content)]

Try testing the above code snippet by replacing the Porter stemmer with Snowball and Lancaster stemmers.

Let us throw some statistics to compare these three stemming algorithms.

Length of the content is 1041(without spaces)
Length of the content after Porter Stemmer is 943 which took around 0.00499 seconds to process
Length of the content after Snowball Stemmer is 944 which took around 0.00399 seconds to process
Length of the content after Lancaster Stemmer is 835 which took around 0.00399 seconds to process

Obviously, Lancaster Stemmer will have less content length because of its aggressive over-stemming nature. With all the three stemmers discussed above, we weren’t able to get the root word of “pharmacies”. We will now move on to lemmatization since stemming didn’t get us the valid stem word in all cases. While stemming is fast, it is not 100% accurate.

Lemmatization

In Lemmatization, the parts of speech(POS) will be determined first, unlike stemming which stems the word to its root form without considering the context. Lemmatization always considers the context and converts the word to its meaningful root/dictionary(WordNet) form called Lemma.

WordNet Lemmatizer

WordNet is a lexical database (a collection of words) that has been used by major search engines and IR research projects for many years. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word) for word in words])

Out:

[‘play’, ‘playing’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Here, we can see that only “plays” and most anticipated “pharmacies” have been converted to their root forms while the remaining words are not. Without the POS tag, WordNet Lemmatizer assumes every word as a noun. We need to pass a respective POS tag along with the word to the WordNet Lemmatizer.

WordNet Lemmatizer with POS tag:

In:

word = “better”
print(lemmatizer.lemmatize(word, pos=”n”)) # n for noun and it is default
print(lemmatizer.lemmatize(word, pos=”a”)) # a for adjective
print(lemmatizer.lemmatize(word, pos=”v”)) # v for verb
print(lemmatizer.lemmatize(word, pos=”r”)) # r for adverb

Out:

better | good | better | well

For the word “better”, the output is not the same when the POS is an adjective and an adverb.

Now, determining the POS for the word will be an extra task for the lemmatization process. When we are converting a large number of text chunks, it will be difficult to pass a POS tag for each word – we need to automate the fetching of POS tags for each word we lemmatize. Here is a function for that:

In:

import nltk
from nltk.corpus import wordnet

def getWordNetPOS(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tagDict = {“J”: wordnet.ADJ,
“N”: wordnet.NOUN,
“V”: wordnet.VERB,
“R”: wordnet.ADV}
return tagDict.get(tag, wordnet.NOUN)

Out:

getWordNetPOS(“better”) — “r”

get_wordnet_pos(“play”) — “n”

get_wordnet_pos(“bad”) — “a”

Code Snippet to perform WordNet Lemmatization with POS:

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words])

Out:

[‘play’, ‘play’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Spacy Lemmatizer, TextBlob Lemmatizer, Stanford CoreNLP Lemmatizer, Gensim Lemmatizer are the other lemmatizers that can be tried. With a spacy lemmatizer, lemmatization can be done without passing any POS tag.

Code snippet to perform lemmatization on a paragraph:

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

wordnetContent = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(content)] # content defined earlier

Time taken to process this content on WordNet Lemmatizer is 0.2234 seconds which is a lot higher when compared to stemming.

Conclusion

Stemming and Lemmatization both generate the root/base form of the word. The only difference is that the stem may not be an actual word whereas the lemma is a meaningful word.

Compared to stemming, lemmatization is slow but helps to train the accurate ML model. If your data is huge, then snowball stemmer(porter2) is a better alternative. If your ML model uses a count vectorizer and it doesn’t bother with the context of the words/sentences, then stemming is the best process that can be considered.

For deep learning models and word embeddings in use, lemmatization is the perfect choice because you will not find word embeddings for invalid stem words.

We recommend you try other methods of lemmatization provided by Spacy, Textblob, Gensim, and Stanford core NLP.

The post Stemming Vs. Lemmatization with Python NLTK appeared first on Turbolab Technologies.