Entity extraction, also known as Named Entity Recognition(NER), is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social media, will be transformed into structured entities that help analysts query the data and generate insightful reports.
spaCy provides different models in various languages to perform NER and NLP-related tasks. Building a custom NER model using spaCy has been explained in one of our blogs. You can check out the link here.
Now, let’s look into the entity extraction from a random news article using spaCy and Flair:
Defending champion Novak Djokovic battled back from two sets to love down to defeat Jannik Sinner and reach his 11th Wimbledon semi-final on Tuesday. Djokovic triumphed 5-7, 2-6, 6-3, 6-2, 6-2 and will face Britain’s Cameron Norrie of Belgium for a place in Sunday’s final. It was the seventh time in the Serb’s career that he had recovered from two sets to love at the Slams. “Huge congrats to Jannik for a big fight, he’s so mature for his age, he has plenty of time ahead of him,” said Djokovic.
Entity Extraction using spaCy:
import spacy
nlp = spacy.load(‘en_core_web_lg’) # spacy load the model
ner_ent = {‘person’: [], ‘norp’: [], ‘fac’: [], ‘org’: [], ‘gpe’: [], ‘loc’: [], ‘product’: [], ‘event’: [], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [], ‘cardinal’: []}
doc = nlp(content)
for entity in doc.ents:
if entity.label_.lower() in ner_ent:
ner_ent[entity.label_.lower()].append(entity.text)print(ner_ent)
# output
{‘person’: [‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘norp’: [‘Serb’, ‘Serb’], ‘fac’: [], ‘org’: [], ‘gpe’: [‘Britain’, ‘Belgium’, ‘Britain’, ‘Belgium’], ‘loc’: [], ‘product’: [], ‘event’: [‘Wimbledon’, ‘Wimbledon’], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [‘Tuesday’, ‘Sunday’, ‘Tuesday’, ‘Sunday’], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [’11th’, ‘seventh’, ’11th’, ‘seventh’], ‘cardinal’: [‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’, ‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’]}
Entity Extraction using Flair:
from flair.data import Sentence
from flair.models import SequenceTaggerner_ent = {‘per’: [], ‘org’: [], ‘loc’: [], ‘misc’: []}
# make a sentence
sentence = Sentence(content)# load the NER tagger
tagger = SequenceTagger.load(‘ner’)# run NER over sentence
tagger.predict(sentence)print(‘The following NER tags are found:’)
# iterate over each entity
for entity in sentence.get_spans(‘ner’):
if str(entity.labels[0]).split()[0].lower() in ner_ent:
ner_ent[str(entity.labels[0]).split()[0].lower()].append(entity.text)# output
The following NER tags are found:
{‘per’: [‘George Washington’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Djokovic’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘org’: [], ‘loc’: [‘Washington’, ‘Britain’, ‘Belgium’], ‘misc’: [‘Wimbledon’, ‘Serb’, ‘Slams’]}
Flair NER models give us only 4 entity types whereas spaCy gives 18 entity types.
Entity Linking & Disambiguation
Entity Linking is the process of linking entities with the target knowledge base. Here, we map the entities to the wiki links or the wiki page titles. Hence the process is called Wikification. We can say entity linking is also referred to as entity validation. The entities extracted from the models of Spacy or Flair will get validated from the third-party knowledge base.
However, this job is entity linking is intricate due to entity ambiguity and name variants. For example, the word Amazon refers to an organization and a rainforest.
Let’s have a detailed discussion on Entity Linking & Entity Disambiguation
News Article Clip:
Deforestation in Brazil’s Amazon rainforest reached a record high for the first six months of the year, as an area five times the size of New York City was destroyed, preliminary government data showed on Friday.
Spacy Output:
‘org’: [‘Amazon’], ‘gpe’: [‘Brazil’, ‘New York City’]
Here, Amazon is detected as the organization.
Flair Output:
‘loc’: [‘Brazil’, ‘Amazon’, ‘New York City’]
Here, Amazon is detected as the location/GPE. The ambiguity problem is clearly visible here and can be solved by Radboud Entity Linker (REL).
REL Output:
Radboud Entity Linker (REL) deals with the tasks of Entity Linking and Entity Disambiguation. One can use the public API provided by REL or install it using Docker/Source code with the instructions mentioned in the documentation. By default, REL uses Flair to extract entities; you can replace Flair with spaCy. REL also provides pre-trained models with case-sensitive and insensitive models with an f1 score of almost 93%.
Wikimapper python library is used to fetch the wikidata_id from the Wikipedia titles. You can have a look at the project which helps you to map Wikipedia page titles to WikiData IDs and vice-versa.
BLINK, the Facebook research entity linking python library, uses Wikipedia as the target knowledge base, similar to REL. But, the BLINK documentation hasn’t revealed any information regarding entity disambiguation.
OpenTapioca is a simple and fast Named Entity Linking system for Wikidata. A spaCy wrapper of OpenTapioca called spaCyOpenTapioca is also available for the entity linking process. But the results are not as great when compared to REL.
spaCy includes a pipeline component called entitylinker for Named Entity Linking and Disambiguation.
Dealing with Disambiguation
Japan began the defence of their title with a lucky 2-1 win against Syria in a championship match on Friday.
Using the above statement, we will discuss the different approaches to choosing the appropriate entity in the case of Entity Disambiguation.
Let’s see how wikifier deals with the disambiguation:
Wikifier doesn’t use any entity extraction method for extracting entities; it goes with Parts of Speech (POS).
The entities Syria and Japan are linked to their respective countries’ Wikipedia pages, Syria and Japan. In the context of the above statement, Japan and Syria actually refer to their football teams. Wikifier fetches all the Wikipedia page entities related to the entity and maps the entity with the most link targets.
Wikifier considers the minLinkFrequency parameter to evaluate the score.
Let’s see how REL deals with the disambiguation:
In REL, entity linking decisions depend on the contextual similarity and coherence with the other entity linking decisions in the document. One entity mapping is dependent on the other entities found in the document. You can read the paper here.
This example doesn’t have any impact since only two entities are found and the content is a one-liner. Instead of the entity detection method, if we had passed the POS output, the result might have been different.
With passing the entire article to the REL, the results are quite better. The REL model can now understand the context and relate more entities from the entire article.
Brazil and Dutch mapped to their respective football team wiki pages. Mapping Japan to its respective football team is still a mystery though. LOL.
Conclusion
Instead of going with the score of the most link targets, REL considers the context and the relationship between the entities detected from the document. By improving the mentioned detection, REL can be used as a perfect Entity Disambiguation tool.
Last but not least, there is a tool called ExtEnD(Extractive Entity Disambiguation) which needs to explore. We can add this tool to the spaCy NLP pipeline.
The output documented by ExtEnD is much better compared to the REL-generated output. Before coming to conclusion, as mentioned above this tool needs to explore.