SpaCy is an open-source python library used for Natural Language Processing(NLP). Unlike NLTK, which is widely used in research, spaCy focuses on production usage. Industrial-strength NLP spaCy is a library for advanced NLP in Python and Cython. As of now, this is the best NLP tool available in the market.
SpaCy provides ready-to-use language-specific pre-trained models to perform parsing, tagging, NER, lemmatizer, tok2vec, attribute_ruler, and other NLP tasks. It supports 18 languages and 1 multi-language pipeline. Check the supported language list here.
SpaCy provides the following four pre-trained models with MIT license for the English language:
- en_core_web_sm(12 mb)
- en_core_web_md(43 mb)
- en_core_web_lg(741 mb)
- en_core_web_trf(438 mb)
Support for transformers and the pretrained pipeline(en_core_web_trf) has been introduced in spaCy 3.0.
Named Entity Recognition(NER) is the NLP task that recognizes entities in a given text. NER is a model which performs two tasks: Detect and Categorize. It has to detect the entities(India, America, Abdul Kalam) in the text and categorize(LOCATION, LOCATION, PERSON) the entities detected. This tool helps in information retrieval from bulk uncategorized texts.
Load a spaCy model and check if it has ner pipeline
In:
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load(“en_core_web_sm”)
nlp.pipe_names
Out:
[‘tok2vec’, ‘tagger’, ‘parser’, ‘attribute_ruler’, ‘lemmatizer’, ‘ner’]
ner is in the pipeline, let’s test how the entity detection will work on a sentence.
In:
sentence = “Daniil Medvedev and Novak Djokovic have built an intriguing rivalry since the Australian Open decider, which the Serb won comprehensively.”
doc = nlp(sentence)from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)
Let’s observe the doc to see how entities are being identified/tagged by the model.
In:
[(X, X.ent_iob_, X.ent_type_) for X in doc if X.ent_type_]
Out:
[(Daniil, ‘B’, ‘PERSON’),
(Medvedev, ‘I’, ‘PERSON’),
(Novak, ‘B’, ‘PERSON’),
(Djokovic, ‘I’, ‘PERSON’),
(Australian, ‘B’, ‘NORP’), # LOCATION
(Serb, ‘B’, ‘NORP’)]
Novak and Djokovic are correctly identified as PERSON but they are separate entities. But these are displayed as a single entity through Displacy. IOB Tagging plays a key role to combine the entities which are inclusive of one another.
Inside-Outside-Beginning(IOB) Tagging
IOB is the common tagging format for tagging the entities/chunks in the text.
- I stands for Inside and it indicates that the token is an insider of a chunk.
- B stands for Beginning and it indicates that the token is the beginning of a chunk.
- O stands for Outside and it indicates that the token doesn’t belong to any chunk.
In the above output, Daniil is tagged as B which is the beginning of the entity chunk, and Medvedev is tagged as I which is the insider token of the previous token Daniil. These two tokens combine to form a PERSON entity. Same is the scenario with Novak and Djokovic.
The tokens tagged as O are not classified as an entity type and we can see that no label has been assigned by the model.
[(and, ‘O’, ”),
(have, ‘O’, ”),
(built, ‘O’, ”),
(an, ‘O’, ”),
(intriguing, ‘O’, ”),
(rivalry, ‘O’, ”),
(since, ‘O’, ”),
(the, ‘O’, ”),
(Open, ‘O’, ”),
(decider, ‘O’, ”)]
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
These are the entity labels provided by the NER pre-trained model. We can execute the command given below to understand each label.
In:
spacy.explain(“NORP”)
Out:
Nationalities or religious or political groups
Why do we need a Custom NER?
SpaCy pre-trained models detect and categorize the text chunks into 18 types of entities. If the user requirement is to extract information from job postings, the above pre-trained model will not provide any support. Let’s see an example:
In:
sentence = “””As a Full Stack Developer, you will develop applications in a very passionate environment being responsible for Front-end and Back-end development. You will perform development and day-to-day maintenance on large applications. You have multiple opportunities to work on cross-system single-page applications.”””
doc = nlp(sentence)from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)Out:
UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.
The warning says that no entities were found in the Doc object.
This is where the custom NER model comes into the picture for our custom problem statement i.e., detecting the job_role from the job posts.
Steps to build the custom NER model for detecting the job role in job postings in spaCy 3.0:
- Annotate the data to train the model.
- Convert the annotated data into the spaCy bin object.
- Generate the config file from the spaCy website.
- Train the model in the command line.
- Load and test the saved model.
We will discuss the above steps in detail.
SpaCy NER annotation tool by agateteam
The agateteam provides a lightweight annotation tool to generate the spaCy-supported annotated data format.
Annotation of a sentence is shown in the above gif. We have shown the job_role tagging; you can add work_experience, work_location, experience to the entity list. Here is the sample annotated data:
Convert the annotated data into the spaCy bin object
In spaCy 2.x, we can use this raw data to train a model. But, in spaCy 3.x, we need to convert it to a doc bin object. Consider this: we assign the above-annotated data to the variable called trainData. We can convert it using the function below:
import spacyfrom spacy.tokens import DocBinfrom tqdm import tqdmnlp = spacy.blank(“en”) # load a new spacy modeldb = DocBin() # create a DocBin objectfor text, annot in tqdm(trainData): # data in previous formatdoc = nlp.make_doc(text) # create doc object from textents = []for start, end, label in annot[“entities”]: # add character indexesspan = doc.char_span(start, end, label=label, alignment_mode=”contract”)if span is None:print(“Skipping entity”)else:ents.append(span)try:doc.ents = ents # label the text with the entsdb.add(doc)except:print(text, annot)db.to_disk(“./train.spacy”) # save the docbin object
Generate the config file to train via Command line
spaCy train from the command line is the recommended way to train our spaCy pipelines. config.cfg includes all settings and hyperparameters. If necessary, we can overwrite it.
Go to the spaCy training link and follow the steps below:
Select the preferred language and component as ner. As per your system requirement, you can choose CPU/GPU. You can save this configuration as base_config.cfg
python -m spacy init fill-config base_config.cfg config.cfg
Training the model using the command line
[paths]
train = ./train.spacy
dev = ./dev.spacy
You can specify the train, dev, and output file paths in the config file. The batch size, max steps, epochs, patience, etc can also be specified in the config file.
Now that we have the config file and train data, let’s train the model using the command line.
The model output will be saved in the specified folder as an argument at the command line.
Load & Test the model
- Load the model.
import spacy
nlp = spacy.load(“output/model-last/”) #load the model
- Take the unseen data to test the model prediction.
sentence = “””We are looking for a Backend Developer who has 4-6 years of experience in designing, developing and implementing backend services using Python and Django.”””
doc = nlp(sentence)
from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)
Out:
Backend Developer is predicted as a job_role by the model.
Applications of NER:
- Enables Recommendation Systems.
- Simplify Customer Support.
- Classify the data of News Sources.
- Optimizing the Search Engine Algorithms.
EndNote:
We have taken just 10 records to train the model. For better accuracy and precision, we need to have a huge amount of annotated data to train a model.