Technology Archives - Turbolab Technologies

Entity Linking & Disambiguation using REL

Vasista Reddy — Tue, 12 Jul 2022 07:02:27 +0000

Entity extraction, also known as Named Entity Recognition(NER), is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social media, will be transformed into structured entities that help analysts query the data and generate insightful reports.

spaCy provides different models in various languages to perform NER and NLP-related tasks. Building a custom NER model using spaCy has been explained in one of our blogs. You can check out the link here.

Now, let’s look into the entity extraction from a random news article using spaCy and Flair:

Defending champion Novak Djokovic battled back from two sets to love down to defeat Jannik Sinner and reach his 11th Wimbledon semi-final on Tuesday. Djokovic triumphed 5-7, 2-6, 6-3, 6-2, 6-2 and will face Britain’s Cameron Norrie of Belgium for a place in Sunday’s final. It was the seventh time in the Serb’s career that he had recovered from two sets to love at the Slams. “Huge congrats to Jannik for a big fight, he’s so mature for his age, he has plenty of time ahead of him,” said Djokovic.

Entity Extraction using spaCy:

import spacy

nlp = spacy.load(‘en_core_web_lg’) # spacy load the model

ner_ent = {‘person’: [], ‘norp’: [], ‘fac’: [], ‘org’: [], ‘gpe’: [], ‘loc’: [], ‘product’: [], ‘event’: [], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [], ‘cardinal’: []}

doc = nlp(content)
for entity in doc.ents:
if entity.label_.lower() in ner_ent:
ner_ent[entity.label_.lower()].append(entity.text)

print(ner_ent)

# output

{‘person’: [‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘norp’: [‘Serb’, ‘Serb’], ‘fac’: [], ‘org’: [], ‘gpe’: [‘Britain’, ‘Belgium’, ‘Britain’, ‘Belgium’], ‘loc’: [], ‘product’: [], ‘event’: [‘Wimbledon’, ‘Wimbledon’], ‘work_of_art’: [], ‘law’: [], ‘language’: [], ‘date’: [‘Tuesday’, ‘Sunday’, ‘Tuesday’, ‘Sunday’], ‘time’: [], ‘percent’: [], ‘money’: [], ‘quantity’: [], ‘ordinal’: [’11th’, ‘seventh’, ’11th’, ‘seventh’], ‘cardinal’: [‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’, ‘two’, ‘5’, ‘2-6’, ‘6-3’, ‘6’, ‘6-2’, ‘two’]}

Entity Extraction using Flair:

from flair.data import Sentence
from flair.models import SequenceTagger

ner_ent = {‘per’: [], ‘org’: [], ‘loc’: [], ‘misc’: []}

# make a sentence
sentence = Sentence(content)

# load the NER tagger
tagger = SequenceTagger.load(‘ner’)

# run NER over sentence
tagger.predict(sentence)

print(‘The following NER tags are found:’)
# iterate over each entity
for entity in sentence.get_spans(‘ner’):
if str(entity.labels[0]).split()[0].lower() in ner_ent:
ner_ent[str(entity.labels[0]).split()[0].lower()].append(entity.text)

# output

The following NER tags are found:

{‘per’: [‘George Washington’, ‘Novak Djokovic’, ‘Jannik Sinner’, ‘Djokovic’, ‘Cameron Norrie’, ‘Jannik’, ‘Djokovic’], ‘org’: [], ‘loc’: [‘Washington’, ‘Britain’, ‘Belgium’], ‘misc’: [‘Wimbledon’, ‘Serb’, ‘Slams’]}

Flair NER models give us only 4 entity types whereas spaCy gives 18 entity types.

Entity Linking & Disambiguation

Entity Linking is the process of linking entities with the target knowledge base. Here, we map the entities to the wiki links or the wiki page titles. Hence the process is called Wikification. We can say entity linking is also referred to as entity validation. The entities extracted from the models of Spacy or Flair will get validated from the third-party knowledge base.

However, this job is entity linking is intricate due to entity ambiguity and name variants. For example, the word Amazon refers to an organization and a rainforest.

Let’s have a detailed discussion on Entity Linking & Entity Disambiguation

News Article Clip:

Deforestation in Brazil’s Amazon rainforest reached a record high for the first six months of the year, as an area five times the size of New York City was destroyed, preliminary government data showed on Friday.

Spacy Output:

‘org’: [‘Amazon’], ‘gpe’: [‘Brazil’, ‘New York City’]

Here, Amazon is detected as the organization.

Flair Output:

‘loc’: [‘Brazil’, ‘Amazon’, ‘New York City’]

Here, Amazon is detected as the location/GPE. The ambiguity problem is clearly visible here and can be solved by Radboud Entity Linker (REL).

REL Output:

REL

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=300%2C56&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=800%2C148&ssl=1" class="size-full wp-image-908" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=800%2C149&ssl=1" alt="" width="800" height="149" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?w=1430&ssl=1 1430w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=300%2C56&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=768%2C143&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1024%2C190&ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1080%2C201&ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1280%2C238&ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=980%2C182&ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=480%2C89&ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />

Radboud Entity Linker (REL) deals with the tasks of Entity Linking and Entity Disambiguation. One can use the public API provided by REL or install it using Docker/Source code with the instructions mentioned in the documentation. By default, REL uses Flair to extract entities; you can replace Flair with spaCy. REL also provides pre-trained models with case-sensitive and insensitive models with an f1 score of almost 93%.

Wikimapper python library is used to fetch the wikidata_id from the Wikipedia titles. You can have a look at the project which helps you to map Wikipedia page titles to WikiData IDs and vice-versa.

BLINK, the Facebook research entity linking python library, uses Wikipedia as the target knowledge base, similar to REL. But, the BLINK documentation hasn’t revealed any information regarding entity disambiguation.

OpenTapioca is a simple and fast Named Entity Linking system for Wikidata. A spaCy wrapper of OpenTapioca called spaCyOpenTapioca is also available for the entity linking process. But the results are not as great when compared to REL.

spaCy includes a pipeline component called entitylinker for Named Entity Linking and Disambiguation.

Dealing with Disambiguation

Japan began the defence of their title with a lucky 2-1 win against Syria in a championship match on Friday.

Using the above statement, we will discuss the different approaches to choosing the appropriate entity in the case of Entity Disambiguation.

Let’s see how wikifier deals with the disambiguation:

Wikifier doesn’t use any entity extraction method for extracting entities; it goes with Parts of Speech (POS).

The entities Syria and Japan are linked to their respective countries’ Wikipedia pages, Syria and Japan. In the context of the above statement, Japan and Syria actually refer to their football teams. Wikifier fetches all the Wikipedia page entities related to the entity and maps the entity with the most link targets.

Wikifier considers the minLinkFrequency parameter to evaluate the score.

Let’s see how REL deals with the disambiguation:

In REL, entity linking decisions depend on the contextual similarity and coherence with the other entity linking decisions in the document. One entity mapping is dependent on the other entities found in the document. You can read the paper here.

This example doesn’t have any impact since only two entities are found and the content is a one-liner. Instead of the entity detection method, if we had passed the POS output, the result might have been different.

With passing the entire article to the REL, the results are quite better. The REL model can now understand the context and relate more entities from the entire article.

Brazil and Dutch mapped to their respective football team wiki pages. Mapping Japan to its respective football team is still a mystery though. LOL.

Conclusion

Instead of going with the score of the most link targets, REL considers the context and the relationship between the entities detected from the document. By improving the mentioned detection, REL can be used as a perfect Entity Disambiguation tool.

Last but not least, there is a tool called ExtEnD(Extractive Entity Disambiguation) which needs to explore. We can add this tool to the spaCy NLP pipeline.

The output documented by ExtEnD is much better compared to the REL-generated output. Before coming to conclusion, as mentioned above this tool needs to explore.

The post Entity Linking & Disambiguation using REL appeared first on Turbolab Technologies.

Incremental/Online/Continuous Model Training using Creme

Vasista Reddy — Fri, 25 Feb 2022 09:01:02 +0000

Have you noticed the trained ML model performance degrades over time? Why will the model performance degrade? Let’s say we have a model which takes the person’s data as an input and detects the face. Now with the Covid situation, almost 90% of people wear masks and the model will not be able to detect faces which results in low performance and low accuracy of the model. What is the phenomenon called? This is called Model Drift and it is categorized into Concept Drift and Data Drift.

Concept Drift is where the properties of the dependent variable change i.e., output/prediction of the model.

Data Drift is where the properties of the independent variable change i.e., input to the model.

y = a + bx

The change in the dependent variable y leads to Concept Drift and the change in the independent variable x leads to Data Drift.

Change is inevitable

The world and the parameters with which we train the model are going to change over time. Let’s discuss another example of a travel agency with a model considering a person’s average salary, season, weather as the input to predict the number of people traveling to some X country. With the covid regulations of border closures, flying restrictions, job losses, inflation, and with the change in people’s mindset the model would go for a toss.

How to detect the model drift?

Monitoring the model in production is the only way to detect the model drift. With the alert triggering by keeping the threshold of precision, recall, and f1score metrics through the monitoring tools. Evidently AI is one such monitoring tool.

How to avoid the model drift?

Either we train the model with the streaming data coming in for the continuous learning of the model or retrain the model with the interval of a week, month, etc., in a scheduled manner with the updated data. Retraining the model with the latest data is not an efficient way to handle the models which are already deployed in production. Online/Incremental model training is the efficient way.

Model Retraining Approach

Let us explore the python tool creme to train the incremental ML model with the streaming data one record at a time.

With creme, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows learning from massive datasets that don’t fit in the main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you’re bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore creme!) might be what you’re looking for.

Install from PyPI

pip install creme

Dataset to train the model

docs = [
(“Cricket news: England James Anderson determined to revive international career despite West Indies axing”, “Cricket”),
(“Well Have Just One Head Coach For All Cricket Formats: CA Chairman”, “Cricket”),
(“Rod Marsh: Australian cricket legend in critical condition after suffering heart attack”, “Cricket”),
(“Facebook, Twitter highlight security steps for users in Ukraine”, “Technology”),
(“Apple launching new series of iphone”, “Technology”),
(“Galaxy S22 preorder sales indicate the phone is already a huge success”, “Technology”),
]

Setting up the model pipeline

from creme import compose
from creme import feature_extraction

model = compose.Pipeline(
(‘tokenize’, feature_extraction.TFIDF(lowercase=True)),
(‘nb’, naive_bayes.MultinomialNB(alpha=1))
)

Here, we are using TFIDF as the feature extraction method and Naive Bayes as the ML algorithm

These are the other feature extraction methods we can try

Fitting the data to the model, one record at a time

%%time
for sentence, label in docs:
model = model.fit_one(sentence, label)

Wall time: 998 µs

Predictions – Testing the model

model.predict_one(“Traffic arrangements for Australian cricket team’s visit, Pakistan Day events reviewed”)
Out: ‘Cricket’

model.predict_one(“Launching Facebook Reels Globally and New Ways for Creators to Make Money”)
Out: ‘Technology’

test = “Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad”
model.predict_one(test)
Out: ‘Cricket’

As we can see in the above model testing, the last record of football news is predicted as cricket. Both are related to the sports category, we can anyway train our model with the new category as football.

Training on a new data and new category

newDocs = [“Footballer took out insurance policy on BMW minutes after smashing into parked cars”, “Russian footballer Fedor Smolov, a 32-year-old striker currently playing for his country, became one of the first Russian sportsmen to express his heartbreak at the invasion of Ukraine by his country.”, “Ukraine’s international footballer Roman Yaremchuk scored the equalizer for Benfica in a Champions League match”]

for doc_ in newDocs:
model.fit_one(doc_, “Football”)

Retesting the model

test = “Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad”
model.predict_one(test)
Out: ‘Football’

We can update the model with the new data for the existing category or the new data with the new category.

Some benefits of using creme (and online machine learning in general):

Incremental: models can update themselves in real-time.
Adaptive: models can adapt to concept drift.
Production-ready: working with data streams makes it simple to replicate production scenarios during model development.
Efficient: models don’t have to be retrained and require little compute power, which lowers their carbon footprint
Fast: when the goal is to learn and predict with a single instance at a time, then creme is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.

The post Incremental/Online/Continuous Model Training using Creme appeared first on Turbolab Technologies.

Lazy Predict – Find the best suitable ML model

Anthony — Tue, 18 Jan 2022 06:38:11 +0000

As in the earlier blog “text classification using machine learning”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy. To overcome this problem we will discuss here an awesome python library “Lazy Predict”. This module helps us find the best model for classification and regression based on our data.

It provides a Lazy Classifier for classification problems and Lazy Regression for regression problems.

Note: Lazy Predict takes high computational power and it was a little time-consuming for me to run high dimensional data with multiple features.

Let us see how it works:

First, install this library in your local system

pip install lazypredict

Dataset

Here we are not concentrating more on the dataset or its feature extraction and transformation steps, as it has been shown in the previous blog on “text classification using machine learning”.

To demonstrate lazy predict classification and regression problems we are using “Drug type” and “Wine quality” data both taken from kaggle.com

Code

Importing required libraries

import lazypredict

import pandas as pd

from sklearn.model_selection import train_test_split

from lazypredict.Supervised import LazyClassifier, LazyRegressor

Importing data and LazyClassifier model fitting

classificationData = pd.read_csv(“drugType.csv”)

classificationData.head()

X = classificationData..drop(columns=”Drug”)

y = classificationData.[“Drug”]

# Splitting our data into a train and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2,

random_state=42)

classifiers = LazyClassifier(ignore_warnings=True, custom_metric=None)

models,predictions = classifiers.fit(X_train, X_test, y_train, y_test)

print(models)

Here the model returns two values, different model names with its prediction accuracy.

Importing data and LazyRegressor model fitting

regressionData = pd.read_csv(“winequality.csv”)

regressionData.head()

X = regressionData.drop(columns=”quality”)

y = regressionData[“quality”]

# Splitting our data into a train and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state = 42)

regressors = LazyRegressor(ignore_warnings=True, custom_metric=None)

models, predictions = regressors.fit(X_train, X_test, y_train, y_test)

print(models)

Conclusion

Here, when we use the “Lazy Predict” library, different models are fitted on our data, and model results provide us with accuracy metrics for the given data. Observing the result we can then select the top 5 base models based on the best accuracy.

Later we can tune the parameters of those top models and get better accuracy.

As this library runs many different models at once it takes a lot of computational power. If you have low computational power I would suggest you use Google Colab.

The post Lazy Predict – Find the best suitable ML model appeared first on Turbolab Technologies.

Text Classification with Keras and GloVe Word Embeddings

Vasista Reddy — Fri, 31 Dec 2021 13:25:16 +0000

Deep Learning(DL) is the subset of Machine Learning. It is a method of statistical learning that extracts features or attributes from raw data. DL uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. DL takes the data into a network of layers(Input, Hidden & Output) to extract the features and to learn from the data.

In this blog, we will learn how to train a supervised text classification model using the DL python module called Keras and pre-trained GloVe word embeddings to transform the text data into a machine-understandable numerical representation. We will be using Convolutional Neural Networks(CNN) architecture to train the classification model.

The dataset and the category labels of the data are discussed in Text Classification using Machine Learning blog. Please refer to the blog and we will be using the same dataset here to train our CNN model to predict the classification of the given text.

Assume the dataset is referred to as the pandas dataframe called df in the code snippet.

dataset

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=300%2C259&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=512%2C442&ssl=1" class="size-full wp-image-684" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=512%2C442&ssl=1" alt="dataset" width="512" height="442" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?w=512&ssl=1 512w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=300%2C259&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=480%2C414&ssl=1 480w" sizes="(max-width: 512px) 100vw, 512px" />

dataset

Cleaning the dataset

The data cleaning part is also discussed in the blog.

Cleaned Dataset

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?fit=300%2C132&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?fit=752%2C332&ssl=1" class="size-full wp-image-849" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=752%2C332&ssl=1" alt="" width="752" height="332" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?w=752&ssl=1 752w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=300%2C132&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=480%2C212&ssl=1 480w" sizes="(max-width: 752px) 100vw, 752px" />

Cleaned Dataset

We have used stemming and stopwords removal on the dataset content. We can replace stemming with lemmatization and check out this blog about stemming vs lemmatization to know the differences. Skipping the cleaning part on the dataset for this DL blog, because stemming can give us meaningless words which don’t have the embedding glove vector.

As a part of data preparation, we are going to perform these operations on the dataset df

Lowering the content because the glove embedding vectors are generated for the lower case words.
Stripping and making sure the word starts with alphanumerics ie., removing special characters.
Dropping the null and empty columns from the df.
Dropping the duplicates from the df.

df = df[[‘content’, ‘label’]]
df = df.astype(‘str’).applymap(str.lower)
df = df.applymap(str.strip).replace(r”[^a-z0-9 ]+”, ”)
df = df.dropna()
df = df.drop_duplicates()

Loading the Glove Embeddings

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Trained the models on Wiki, Twitter and common crawled data to have pre-trained word vectors with differences in size, tokens, and vocab size. For this blog, we will use the glove.6b.100d.txt pretrained glove word vector.

Glove Embeddings

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?fit=300%2C215&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?fit=800%2C573&ssl=1" class="wp-image-852" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=500%2C358&ssl=1" alt="" width="500" height="358" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=300%2C215&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=768%2C550&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1024%2C734&ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1080%2C774&ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1280%2C917&ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=980%2C702&ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=480%2C344&ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?w=1315&ssl=1 1315w" sizes="(max-width: 500px) 100vw, 500px" />

Glove Embeddings

In the above image, we can see the words that, on, is, was is represented by vector coefficients.

def loading_embeddings():
“”” loading glove embeddings “””
embeddings_index = {}
f = open(glove_path + ‘glove.6B.100d.txt’, encoding=”utf8″) # loading the file
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype=’float32′)
embeddings_index[word] = coefs
f.close()
return embeddings_index

Preparing the Embedding Matrix

MAX_NB_WORDS = 100000

def prepare_embedding_matrix(word_index):

“”” preparing embedding matrix with our data set “””

embeddings_index = loading_embeddings()

num_words = min(MAX_NB_WORDS, len(word_index))

embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))

for word, i in word_index.items():

if i >= MAX_NB_WORDS:

continue

embedding_vector = embeddings_index.get(word)

if embedding_vector is not None:

# words not found in embedding index will be all-zeros.

embedding_matrix[i] = embedding_vector

return embedding_matrix, num_words

MAX_NB_WORDS is the maximum number of words to consider as features for tokenizer

word_index is the tokenizer unique words list which extracted by fitting on our dataset content

The minimum of these two parameters is the num_words, which we keep as input_dim for the Keras Embedding layer.

Preparing the dataset for the model to train

MAX_SEQUENCE_LENGTH = 1000

VALIDATION_SPLIT = 0.1

def vectorizing_data(df):

“”” vectorizing and splitting the data for training, testing, validating “””

label_s = df[‘label’].tolist()

l = list(set(label_s))

l.sort()

labels_index = dict([(j,i) for i, j in enumerate(l)])

labels = [labels_index[i] for i in label_s]

print(‘Found %s texts.’ % len(df[‘content’]))

print(‘labels_index — ‘, labels_index)

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

tokenizer.fit_on_texts(df[‘content’])

sequences = tokenizer.texts_to_sequences(df[‘content’])

word_index = tokenizer.word_index

print(‘Found %s unique tokens.’ % len(word_index))

df = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))

# randomizing and splitting the df into a training set, test set and a validation set

indices = np.arange(df.shape[0])

np.random.shuffle(indices)

df = df[indices]

labels = labels[indices]

num_validation_samples = int(VALIDATION_SPLIT * df.shape[0])

x_train = df[:-num_validation_samples]

y_train = labels[:-num_validation_samples]

x_val = df[-num_validation_samples:]

y_val = labels[-num_validation_samples:]

x_test = x_train[-num_validation_samples:]

y_test = y_train[-num_validation_samples:]

return x_train, y_train, x_test, y_test, x_val, y_val, word_index

Split the dataset for train, test, and validation purposes. Created a tokenizer and generated the word_index on the dataset content. Padded the sequences with the max length of MAX_SEQUENCE_LENGTH.

Model construction

EMBEDDING_DIM = 100

def model_generation(embedding_matrix, num_words):

“”” model generation “””

embedding_layer = Embedding(num_words + 1,

EMBEDDING_DIM,

weights=[embedding_matrix],

input_length=MAX_SEQUENCE_LENGTH,

trainable=False)

convs = []

filter_sizes = [3,4,5]

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=’int32′)

embedded_sequences = embedding_layer(sequence_input)

for fsz in filter_sizes:

l_conv = Conv1D(filters=128, kernel_size=fsz, activation=’relu’)(embedded_sequences)

l_pool = MaxPooling1D(5)(l_conv)

convs.append(l_pool)

l_merge = Concatenate(axis=1)(convs)

l_cov1= Conv1D(filters=128, kernel_size=5, activation=’relu’)(l_merge)

l_cov1 = Dropout(0.2)(l_cov1)

l_pool1 = MaxPooling1D(5)(l_cov1)

l_cov2 = Conv1D(filters=128, kernel_size=5, activation=’relu’)(l_pool1)

l_cov2 = Dropout(0.2)(l_cov2)

l_pool2 = MaxPooling1D(30)(l_cov2)

l_flat = Flatten()(l_pool2)

l_dense = Dense(128, activation=’relu’)(l_flat)

preds = Dense(label_count, activation=’softmax’)(l_dense)

model = Model(sequence_input, preds)

return model

The model summary looks like this

Model Summary

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?fit=300%2C300&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?fit=800%2C796&ssl=1" class="size-full wp-image-856" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=800%2C796&ssl=1" alt="" width="800" height="796" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?w=819&ssl=1 819w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=150%2C150&ssl=1 150w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=300%2C300&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=768%2C764&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=480%2C478&ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />

Model Summary

The model is represented by the embedding layer followed by convolutional layers, pooling layers, and dropout layers. The final layer is the dense layer with the output size of labels/category count.

Dropout is dropping off the neurons to prevent an over-fitting problem in neural networks. It is an approach to regularization in neural networks which helps to reduce interdependent learning amongst the neurons. In Machine Learning we use regularization to prevent an over-fitting problem by adding a penalty to the loss function.

Batch normalization is another method to regularize a convolutional network.

f1-score, precision, and recall

def recall_m(y_true, y_pred):

true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

recall = true_positives / (possible_positives + K.epsilon())

return recall

def precision_m(y_true, y_pred):

true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))

predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))

precision = true_positives / (predicted_positives + K.epsilon())

return precision

def f1_m(y_true, y_pred):

precision = precision_m(y_true, y_pred)

recall = recall_m(y_true, y_pred)

return 2*((precision*recall)/(precision+recall+K.epsilon()))

These parameters evaluate the trained model. f1-score is the harmonic mean of precision and recall.

Precision is calculated as the number of true positives divided by the total number of true positives and false positives.

Recall is calculated as the number of true positives divided by the total number of true positives and false negatives.

Model training and evaluation

def training_evaluating_model(model, x_train, y_train, x_test, y_test, x_val, y_val):

“”” training the model with the train and validation data

and evaluating the model with the test data “””

model.compile(loss=’categorical_crossentropy’,

optimizer=’rmsprop’,

metrics=[‘acc’, f1_m, precision_m, recall_m])

# Displays the network structure

model.summary()

# fitting the model

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=epochs, batch_size=batch_size)

“””

model.save_weights(home_path + ‘model_trained’) # Saving the model

“””

# evaluating the model

loss, accuracy, f1_score, precision, recall = model.evaluate(x_test, y_test, verbose=0)

return loss, accuracy, f1_score, precision, recall

Model Evaluation

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?fit=300%2C114&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?fit=800%2C304&ssl=1" class="size-full wp-image-859" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=800%2C304&ssl=1" alt="" width="800" height="304" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?w=1598&ssl=1 1598w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=300%2C114&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=768%2C292&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1024%2C389&ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1080%2C410&ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1280%2C486&ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=980%2C372&ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=480%2C182&ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />

Model Evaluation

The model training accuracy is around 99.33% and validation accuracy is around 90.8%. Validation loss is more compared to training loss which resulted in the reduction of validation accuracy. We have trained on dataset sample of 10000 rows only, if we have trained on complete dataset and increase in the number of epochs, results would have been much better.

The complete code discussed above can be found here.

The post Text Classification with Keras and GloVe Word Embeddings appeared first on Turbolab Technologies.

How to monitor work-flow of scraping project with Apache-Airflow

Vasista Reddy — Wed, 22 Dec 2021 08:16:05 +0000

Apache Airflow is a platform to programmatically monitor workflows, schedule, and authorize projects.

In this blog, we will discuss handling the workflow of scraping yelp.com with Apache Airflow.

Quick setup of Airflow on ubuntu 20.04 LTS

# make sure your system is up-to-date

sudo apt update
sudo apt upgrade

# install airflow dependencies

sudo apt-get install libmysqlclient-dev

sudo apt-get install libssl-dev

sudo apt-get install libkrb5-dev

# create the virtual env and install the airflow using pip

sudo apt install python3-virtualenv
virtualenv airflow_test
cd airflow_test/
source bin/activate
export AIRFLOW_HOME=~/airflow # set Airflow home
pip3 install apache-airflow
pip3 install typing_extensions
airflow db init # initialize the db

db, unittests, logs, configuration(cfg) files will be generated inside Airflow_Home

# Start a WebServer & Scheduler

airflow webserver -p 8080 # start the webserver

airflow scheduler # start the scheduler

By default it is localhost. If you wish to change, you can give the command like this

airflow webserver -H xxx.xxx.xxx.xxx -p 9005

Check the quick installation guide here.

If everything goes well, we can see the apache airflow web interface

http://localhost:8080/admin/ # web-server

Airflow WebServer

Everything in Airflow works as DAGs(Directed acyclic Graphs). We need to create a DAG with a unique dag_id and nest the tasks to that dag_id created. Simply put, DAG is the collection of tasks we want to run. Parameters like schedule_time, start_time, author, and other parameters can also be passed to the DAG object.

Create a folder named dags inside the Airflow_Home,the Scheduler will be checking for new DAGs for every 300’s, if any new dags are found — you can see them at web-server.

Airflow Scheduler

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=300%2C54&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=800%2C144&ssl=1" class="size-full wp-image-832" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=800%2C144&ssl=1" alt="" width="800" height="144" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?w=1077&ssl=1 1077w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=300%2C54&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=768%2C138&ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=1024%2C184&ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=980%2C177&ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=480%2C86&ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />

Airflow Scheduler

We are going to create a workflow to scrape yelp.com for business listings & save the data to MongoDB.

The code to be used in this tutorial to scrape the yelp.com can check here.

Creation of DAG

from airflow import DAG
from datetime import datetime

# dag creation
default_args = {'owner': 'turbolab', 'start_date': datetime(2019, 1, 1), 'depends_on_past': False}
_yelp_workflow = DAG('_yelp_workflow', catchup=False, schedule_interval=None, default_args=default_args) # creating a DAG

DAG Created

_yelp_workflow DAG is created. schedule_interval=None is for manual triggering the DAG. Other options are @daily, @weekly, “* * * */2 1”(cron schedule). Know about catchup, depends_on_past the airflow documentation here.

Task Creation

With the airflow set of operators, we can define tasks of the DAG workflow. An operator describes a single task in a workflow. While DAGs describes how to run a workflow, Operators determine what actually gets done. To call a python function — PythonOperator, for an Email — EmailOperator, for a Bash command — BashOperator, for a SQL instruction — MySqlOperator etc.,

Generally, operators run independently with no sharing of information in the order specified. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom.

def url_generator(**kwargs):
    """ 
    generating the yelp url to find the business listings with place and search_query 
    {'place': 'Location | Address | zip code'}
    {'search_query': "Restaurants | Breakfast & Brunch | Coffee & Tea | Delivery | Reservations"}
    """
    place = Variable.get("place")
    search_query = Variable.get("search_query")
    yelp_url = "https://www.yelp.com/search?find_desc={0}&find_loc={1}".format(search_query,place)
    return yelp_url

"""defining a task"""
yelp_url_generator = PythonOperator(
    task_id='url_generator',
    python_callable=url_generator,
    provide_context=True,
    dag=_yelp_workflow)

Likewise, 6 tasks were created and the concepts like variables and xcom are used among the tasks.

Concept of xcom

def get_response(**kwargs):
    """
    validating the url and forwarding the response
    """
    ti = kwargs['ti']
    url = ti.xcom_pull(task_ids='url_generator')
    print('url generated: ', url)
    headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chrome/70.0.3538.77 Safari/537.36'}
    success = False
    
    for retry in range(10):
        response = requests.get(url, verify=False, headers=headers)
        if response.status_code == 200:
            success = True
            break
        else:
            print("Response received: %s. Retrying : %s"%(response.status_code, url))
            success = False
    
    if success == False:
        print("Failed to process the URL: ", url)
        raise ValueError("Failed to process the URL: ", url)
    return response

response_generator = PythonOperator(
    task_id='response_generator',
    python_callable=get_response,
    provide_context=True,
    dag=_yelp_workflow)

url_generator task returning the yelp_url has to pass to response_generator task, where we will be checking the response of the URL. If the status_code of the response is 200, we are returning — otherwise raising a ValueError to stop the pipeline.

xcom’s can be viewed at the admin page after the successful task runs.

Concept of variable

This concept is used when the user has to input the values(like command-line arguments in python) to the tasks created.

place = Variable.get("place")
search_query = Variable.get("search_query")

These variables place and search_query are used in the url_generator python function of yelp_url_generator task.

Variables Creation

Tasks Relationship/Arrangement

The DAG will make sure that operators run in the correct certain order. Check here.

end_task << validate_db << writing_to_db << validate_data << get_data << response_generator << yelp_url_generator << start_task

airflow upstream arrangement of tasks with start_task and end_task is dummy tasks(optional). Others yelp_url_generator →response_generator →get_data →validate_data →writing_to_db →validate_db are python tasks.

Check the complete code here

Triggering the DAG

Since we kept schedule_interval=None, we have to manually trigger the DAG. Let’s see how to do that →

MongoDB data

Tasks Successfully Completed

Tree View of each DAG run

Handling Cases

You must be wondering why to use this setup of airflow for simple scraping. The reason is,

We can break down the whole single task into multiple tasks and have control over each task at any point.
Will have clear logs at every level.
Can easily connect to other servers with airflow operators to execute the script.

Here are the few cases handled in the work-flow

When we are trying to write the same set of data into the Database with multiple DAG runs.

Duplicate Key Error

task_id=writing_to_db will be handling this case.

When the data scraped and pushed to the database doesn’t match.

task_id=validate_db will be handling this case. In case the anomaly is detected, we will be raising the Value Error.

The post How to monitor work-flow of scraping project with Apache-Airflow appeared first on Turbolab Technologies.

Text Similarity using fastText Word Embeddings in Python

Vasista Reddy — Thu, 09 Dec 2021 09:41:31 +0000

Text Similarity is one of the essential techniques of NLP which is used to find similarities between two chunks of text. In order to perform text similarity, word embedding techniques are used to convert chunks of text to certain dimension vectors. We also perform some mathematical operations on these vectors to find the similarity between the text chunks. Recommendation System, Text Summarization, Information Retrieval, and Text Categorization are some of the main applications of text similarity.

In this tutorial, we will discuss how sentence similarity can be achieved with the fastText module and also the use-case of generating related news articles.

Dataset

Here, we have a science and technology news dataset with a sample of 6188 titles.

Problem Statement

From the above dataset, we are going to pick one article title i.e.,

In:

ROI = “Samsung to spend whopping $22B on artificial intelligence, cars”

Henceforth, we are going to call it ROI in the tutorial. We will be fetching related articles of ROI from the dataset using a fastText sentence vector.

Out:

Generate sentence vectors

Import the fastText module and load the model(300 dimension vector).

import fasttext
modelPath = “D://” # user defined path
ft = fasttext.load_model(modelPath+’cc.en.300.bin’)

Generate sentence vector to the ROI and call it as vector1.

def generateVector(sentence):
return ft.get_sentence_vector(sentence)

vector1 = generateVector(‘Samsung to spend whopping $22B on artificial intelligence, cars’)

Out:

array([-6.36741472e-03,  1.08614033e-02,  9.33997519e-03, -2.33159624e-02,
       -9.58340534e-04,  1.86185073e-02,  2.20048483e-02, -2.02285256e-02,
       -1.13004427e-02, -1.38842128e-02, -6.33053621e-03,  1.18326535e-02,
       -2.36112420e-02,  9.13483184e-03,  5.59101533e-03,  1.09400013e-02,
        4.77387244e-03, -1.54347951e-02, -1.35055669e-02, -2.90185958e-02,
        1.35819204e-02,  2.80883280e-03,  3.43523137e-02, -2.22271457e-02,

…………..

Generate sentence vector for the entire dataset.

df[“vector”] = df[“title”].apply(generateVector)

Out:

Calculate Spatial Distance

Calculate the spatial distance between the ROI and the rest of the dataframe titles to determine the related articles of ROI. The lesser the distance value, the more the related content.

from scipy import spatial

def spatialDistance(vector1, vector2):
return spatial.distance.euclidean(vector1, vector2)

vector1 is the vector of ROI generated above. vector2 is the vector column of each title of the dataframe.

Generating distance as a score between the static ROI and the rest of the dataframe titles.

df[“score”] = df.apply(lambda x: spatialDistance(vector1, x[‘vector’]), axis=1)

Sorting the score column of the dataframe to determine the closest related titles.

df.drop_duplicates(subset=[“score”]).sort_values(by=[‘score’])

From the dataset, the top 10 article titles related to the ROI are:

In:

outputs = df.drop_duplicates(subset=[“score”]).sort_values(by=[‘score’])[0:10][“title”].tolist()

Out:

[‘OnePlus phones go cheaper on Amazon up to Rs 10,000, lots of EMI and exchange offers on latest models’,
‘Samsung to invest nearly $500 mn to set up display factory in India’,
‘Samsung Galaxy S20+ gets listed on Geekbench, revealed to bring 120Hz display, 8K video and more’,
‘Worldwide spend on robotics systems, drones to hit $128.7 billion in 2020’,
‘This is the pitch deck that the CEO of AI startup Directly used to convince its top customers Microsoft and Samsung to invest in a $20 million round’,
‘Dell is working on a software to let users control iPhones from their laptops’,
“Here’s an exclusive look at the pitch deck AI privacy startup Mine used to raise $3 million to help people ask companies to delete their data”,
‘Samsung offering instant cashback of up to Rs 20,000 on Galaxy S10 series’,
‘Google exec reveals how its cloud is helping retailers to keep their sites from crashing on their biggest shopping days of the year’,
“Here’s the pitch deck that email startup Front used to get get top tech execs like Zoom CEO Eric Yuan to invest in its $59 million Series C round”]

Another Example:

If the ROI is “SpaceX launches third batch of 60 Starlink mini satellites”,

In:

SpaceX launches third batch of 60 Starlink mini satellites

Out:
[‘SpaceX launches third batch of Starlink satellites’,
“ISRO’s GSAT 30 satellite successfully rides the Ariane 5 rocket into orbit abroad the first launch of 2020”,
‘SpaceX launch LIVE stream: Watch Elon Musk blast next Starlink satellites into orbit today’,
‘ISRO targets to launch 19 satellites within a period of 7 months’,
‘Asteroid alert: NASA tracks four large space rocks racing towards Earth in next 48 hours’,
‘Huawei launches Mate 30 Pro 5G outside of China for first time, enters UAE’,
‘ISRO’s first mission of the decade on this date! Ariane rocket to launch GSAT-30 satellite’,
‘Samsung teases launch of new Galaxy phone in 11 Feb event announcement’,
‘SpaceX launch LIVE stream: Watch Elon Musk’s first launch of 2020 online HERE’,
‘NASA news: Space agency outlines goals for 2020 including a launch to Mars’]

Conclusion

In this tutorial, we have discussed generating related content using fastText sentence embeddings and a mathematical operation called Spatial Distance. We can also try replacing the spatial distance with the cosine similarity between the vectors to find the related content. Pre-processing techniques like lemmatization, stemming and removal of stopwords can also be performed on the dataset before the vector generation to improve the accuracy of the result. This specific use-case of generating related content can be enhanced to the recommendation system application considering the user’s interests.

The post Text Similarity using fastText Word Embeddings in Python appeared first on Turbolab Technologies.

Data Cleaning using Regular Expression

Anthony — Tue, 30 Nov 2021 12:06:01 +0000

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.

In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.

Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.

In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a ‘re’ module which is in-built in Python so you don’t need to install it separately.

The re module offers a set of functions that allows us to search a string for a match.

The most commonly used methods provided by ‘re’ package are:

re.match()

re.search()

re.findall()

re.split()

re.sub()

re.compile()

Replacing Multi-Spaces

Removing extra white spaces from data is an important step as it makes your data look well structured.

import re

tweet = “if you hold an empty gatorade bottle up to your ear you can hear the sports”

x = re.sub(‘\s+’, ” “, tweet)

Input: x

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports

Dealing with Special Characters

In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance

1. Removing special characters and keeping only alphabets and numbers

import re

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(“[^a-zA-Z0-9 ]+”, “ ”, tweet)

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports 100’

2. Keeping either of alphabets or numbers

import re

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(“[^a-zA-Z ]+”,” ”, tweet)

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports’

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(” +”, “”,re.sub(“[^0-9 ]+”,”, tweet))

Output: ‘100’

Detect and Remove URLs

Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.

import re

tweet = ‘follow this website for more details www.knowmore.com and login to http://login.com’

pattern = re.compile(r”https?://\S+|www\.\S+”)

x = re.findall(pattern, tweet)

Input: x

Output: [‘www.knowmore.com’, ‘http://login.com‘]

# remove urls

z = re.sub(pattern, “”, tweet)

Input: z

Output: follow this website for more details and login to

Detect and Remove HTML Tags

Import re

tweet = ‘
follow this website for more details.
’

x = re.findall(‘<.*?>’, tweet)

Input : x

Output: [‘
’, ‘’, ‘’, ‘
’]

# remove html tags

z = re.sub(‘<.*?>’, “”, tweet)

Input: z

Output: ‘follow this website for more details.’

Detect and Remove Email IDs

Here we’ll use “re.search” to find e-mail ID. re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.

import re

tweet = “please send your feedback to myemail@gmail.com “

x = re.search(“[\w\.-]+@[\w\.-]+\.\w+”, tweet)

Input: x

Output: myemail@gmail.com‘>

tweet = “please send your feedback to myemail@gmail.com “

z = re.sub(“[\w\.-]+@[\w\.-]+\.\w+”, ””, tweet)

Output: please send your feedback to

Detect and Remove the Hashtag

import re

tweet = “love to explore. #nature #traveller”

x = re.findall(‘#[_]*[a-z]+’,tweet)

Input: x

Output: [‘#nature’, ‘#traveller’]

# remove html tags

z = re.sub(‘#[_]*[a-z]+’, ‘ ’, tweet)

Input: z

Output: “love to explore.”

Detect Mentions using re.match() and re.findall()

Here we’ll use re.match and re.findall to detect mentions.

re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.

import re

tweet = “@Bryan appointed as the new team captain”

x = re.match(“(@\w+)”, tweet)

Output:

tweet = “@Bryan appointed as the new team captain announced in @SportsLive”

x = re.findall(“@\S+”, tweet)

Input: x

Output: [ ‘@Bryan’, ‘@SportsLive’]

Conclusion

Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on regex.

The post Data Cleaning using Regular Expression appeared first on Turbolab Technologies.

Build a Custom NER model using spaCy 3.0

Vasista Reddy — Thu, 11 Nov 2021 12:45:37 +0000

SpaCy is an open-source python library used for Natural Language Processing(NLP). Unlike NLTK, which is widely used in research, spaCy focuses on production usage. Industrial-strength NLP spaCy is a library for advanced NLP in Python and Cython. As of now, this is the best NLP tool available in the market.

SpaCy provides ready-to-use language-specific pre-trained models to perform parsing, tagging, NER, lemmatizer, tok2vec, attribute_ruler, and other NLP tasks. It supports 18 languages and 1 multi-language pipeline. Check the supported language list here.

SpaCy provides the following four pre-trained models with MIT license for the English language:

en_core_web_sm(12 mb)
en_core_web_md(43 mb)
en_core_web_lg(741 mb)
en_core_web_trf(438 mb)

Support for transformers and the pretrained pipeline(en_core_web_trf) has been introduced in spaCy 3.0.

Named Entity Recognition(NER) is the NLP task that recognizes entities in a given text. NER is a model which performs two tasks: Detect and Categorize. It has to detect the entities(India, America, Abdul Kalam) in the text and categorize(LOCATION, LOCATION, PERSON) the entities detected. This tool helps in information retrieval from bulk uncategorized texts.

Load a spaCy model and check if it has ner pipeline

In:

!python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load(“en_core_web_sm”)
nlp.pipe_names

Out:

[‘tok2vec’, ‘tagger’, ‘parser’, ‘attribute_ruler’, ‘lemmatizer’, ‘ner’]

ner is in the pipeline, let’s test how the entity detection will work on a sentence.

In:

sentence = “Daniil Medvedev and Novak Djokovic have built an intriguing rivalry since the Australian Open decider, which the Serb won comprehensively.”
doc = nlp(sentence)

from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)

Let’s observe the doc to see how entities are being identified/tagged by the model.

In:

[(X, X.ent_iob_, X.ent_type_) for X in doc if X.ent_type_]

Out:

[(Daniil, ‘B’, ‘PERSON’),
(Medvedev, ‘I’, ‘PERSON’),
(Novak, ‘B’, ‘PERSON’),
(Djokovic, ‘I’, ‘PERSON’),
(Australian, ‘B’, ‘NORP’), # LOCATION
(Serb, ‘B’, ‘NORP’)]

Novak and Djokovic are correctly identified as PERSON but they are separate entities. But these are displayed as a single entity through Displacy. IOB Tagging plays a key role to combine the entities which are inclusive of one another.

Inside-Outside-Beginning(IOB) Tagging

IOB is the common tagging format for tagging the entities/chunks in the text.

I stands for Inside and it indicates that the token is an insider of a chunk.
B stands for Beginning and it indicates that the token is the beginning of a chunk.
O stands for Outside and it indicates that the token doesn’t belong to any chunk.

In the above output, Daniil is tagged as B which is the beginning of the entity chunk, and Medvedev is tagged as I which is the insider token of the previous token Daniil. These two tokens combine to form a PERSON entity. Same is the scenario with Novak and Djokovic.

The tokens tagged as O are not classified as an entity type and we can see that no label has been assigned by the model.

[(and, ‘O’, ”),
(have, ‘O’, ”),
(built, ‘O’, ”),
(an, ‘O’, ”),
(intriguing, ‘O’, ”),
(rivalry, ‘O’, ”),
(since, ‘O’, ”),
(the, ‘O’, ”),
(Open, ‘O’, ”),
(decider, ‘O’, ”)]

CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

These are the entity labels provided by the NER pre-trained model. We can execute the command given below to understand each label.

In:

spacy.explain(“NORP”)

Out:

Nationalities or religious or political groups

Why do we need a Custom NER?

SpaCy pre-trained models detect and categorize the text chunks into 18 types of entities. If the user requirement is to extract information from job postings, the above pre-trained model will not provide any support. Let’s see an example:

In:

sentence = “””As a Full Stack Developer, you will develop applications in a very passionate environment being responsible for Front-end and Back-end development. You will perform development and day-to-day maintenance on large applications. You have multiple opportunities to work on cross-system single-page applications.”””
doc = nlp(sentence)

from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)

Out:

UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.

The warning says that no entities were found in the Doc object.

This is where the custom NER model comes into the picture for our custom problem statement i.e., detecting the job_role from the job posts.

Steps to build the custom NER model for detecting the job role in job postings in spaCy 3.0:

Annotate the data to train the model.
Convert the annotated data into the spaCy bin object.
Generate the config file from the spaCy website.
Train the model in the command line.
Load and test the saved model.

We will discuss the above steps in detail.

SpaCy NER annotation tool by agateteam

The agateteam provides a lightweight annotation tool to generate the spaCy-supported annotated data format.

Annotation of a sentence is shown in the above gif. We have shown the job_role tagging; you can add work_experience, work_location, experience to the entity list. Here is the sample annotated data:

Convert the annotated data into the spaCy bin object

In spaCy 2.x, we can use this raw data to train a model. But, in spaCy 3.x, we need to convert it to a doc bin object. Consider this: we assign the above-annotated data to the variable called trainData. We can convert it using the function below:

import spacy

from spacy.tokens import DocBin

from tqdm import tqdm

nlp = spacy.blank(“en”) # load a new spacy model

db = DocBin() # create a DocBin object

for text, annot in tqdm(trainData): # data in previous format

doc = nlp.make_doc(text) # create doc object from text

ents = []

for start, end, label in annot[“entities”]: # add character indexes

span = doc.char_span(start, end, label=label, alignment_mode=”contract”)

if span is None:

print(“Skipping entity”)

else:

ents.append(span)

try:

doc.ents = ents # label the text with the ents

db.add(doc)

except:

print(text, annot)

db.to_disk(“./train.spacy”) # save the docbin object

Now, we have the trainData saved as train.spacy.

Generate the config file to train via Command line

spaCy train from the command line is the recommended way to train our spaCy pipelines. config.cfg includes all settings and hyperparameters. If necessary, we can overwrite it.

Go to the spaCy training link and follow the steps below:

Select the preferred language and component as ner. As per your system requirement, you can choose CPU/GPU. You can save this configuration as base_config.cfg

To fill the remaining system defaults, run this command on the command line to generate the config.cfg file.

python -m spacy init fill-config base_config.cfg config.cfg

Training the model using the command line

[paths]

train = ./train.spacy

dev = ./dev.spacy

You can specify the train, dev, and output file paths in the config file. The batch size, max steps, epochs, patience, etc can also be specified in the config file.

Now that we have the config file and train data, let’s train the model using the command line.

The model output will be saved in the specified folder as an argument at the command line.

Load & Test the model

Load the model.

import spacy

nlp = spacy.load(“output/model-last/”) #load the model

Take the unseen data to test the model prediction.

sentence = “””We are looking for a Backend Developer who has 4-6 years of experience in designing, developing and implementing backend services using Python and Django.”””

doc = nlp(sentence)

from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)

Out:

Backend Developer is predicted as a job_role by the model.

Applications of NER:

Enables Recommendation Systems.
Simplify Customer Support.
Classify the data of News Sources.
Optimizing the Search Engine Algorithms.

EndNote:

We have taken just 10 records to train the model. For better accuracy and precision, we need to have a huge amount of annotated data to train a model.

The post Build a Custom NER model using spaCy 3.0 appeared first on Turbolab Technologies.

Stemming Vs. Lemmatization with Python NLTK

Vasista Reddy — Fri, 29 Oct 2021 17:00:31 +0000

Stemming and Lemmatization are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example:

Let’s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our data corpus. By applying stemming/lemmatization techniques, we can reduce the vocabulary size by converting the words to their base forms. This will make the vocabulary more distinct and will reduce the ambiguity for the model to train and yield better results.

In this post, we will discuss the practical examples of how stemming and lemmatization can be done on words and sentences using the python nltk package.

Stemming

Stemming is a rule-based normalization approach as it slices the word’s prefix and suffix to reduce them to its root form. Stemming is faster compared to lemmatization as it cuts the prefixes(pre-, extra-, in-, im-, ir-, etc.) and suffixes(ed-, ing-, es-, -ity, -ty, -ship, -ness, etc.) without considering the context of the words. Due to its aggressiveness, there is a possibility that the outcome from the stemming algorithm may not be a valid word.

In the above example, you can see that the outcomes of badly and pharmacies are invalid words.

Porter Stemmer

The Porter stemming algorithm (or “Porter stemmer”) uses suffix-stemming to produce stems. Here is a python code using nltk to create a stemmer object and generate results.

Code Snippet to perform Porter Stemming:

In:

from nltk.stem import PorterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ps = PorterStemmer()
print([ps.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘badli’]

Observing the drawbacks of PorterStemmer, the Snowball Stemming algorithm was introduced.

Snowball Stemmer

This Snowball Stemming Algorithm is also known as Porter2 Stemmer. It is the best version of Porter Stemmer in which a few of the above-discussed stemming issues are resolved.

Code Snippet to perform Snowball Stemming:

In:

from nltk.stem.snowball import SnowballStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ss = SnowballStemmer(language=’english’)
print([ss.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘bad’]

Here, we can see that the word “badly” is a valid stem, but the word “pharmacies” is still an invalid stem.

Lancaster Stemmer

Compared to snowball and porter stemming, lancaster is the most aggressive stemming algorithm because it tends to over-stem a lot of words. It tries to reduce the word to the shortest stem possible. Here is an example:

Here is an example:

“salty” —- “sal”

“sales” —- “sal”

Code Snippet to perform Lancaster Stemming:

In:

from nltk.stem import LancasterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ls = LancasterStemmer()
print([ls.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘play’, ‘pharm’, ‘bad’]

As mentioned in the beginning, we can reduce the vocabulary and maintain more unique words by stemming.

Code snippet to perform tokenization and stemming on a paragraph:

content = “China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.”

The above content will hereafter be used as the input to the code snippets.

In:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# Porter Stemmed version

porteredContent = [ps.stem(word) for word in word_tokenize(content)]

Try testing the above code snippet by replacing the Porter stemmer with Snowball and Lancaster stemmers.

Let us throw some statistics to compare these three stemming algorithms.

Length of the content is 1041(without spaces)
Length of the content after Porter Stemmer is 943 which took around 0.00499 seconds to process
Length of the content after Snowball Stemmer is 944 which took around 0.00399 seconds to process
Length of the content after Lancaster Stemmer is 835 which took around 0.00399 seconds to process

Obviously, Lancaster Stemmer will have less content length because of its aggressive over-stemming nature. With all the three stemmers discussed above, we weren’t able to get the root word of “pharmacies”. We will now move on to lemmatization since stemming didn’t get us the valid stem word in all cases. While stemming is fast, it is not 100% accurate.

Lemmatization

In Lemmatization, the parts of speech(POS) will be determined first, unlike stemming which stems the word to its root form without considering the context. Lemmatization always considers the context and converts the word to its meaningful root/dictionary(WordNet) form called Lemma.

WordNet Lemmatizer

WordNet is a lexical database (a collection of words) that has been used by major search engines and IR research projects for many years. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word) for word in words])

Out:

[‘play’, ‘playing’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Here, we can see that only “plays” and most anticipated “pharmacies” have been converted to their root forms while the remaining words are not. Without the POS tag, WordNet Lemmatizer assumes every word as a noun. We need to pass a respective POS tag along with the word to the WordNet Lemmatizer.

WordNet Lemmatizer with POS tag:

In:

word = “better”
print(lemmatizer.lemmatize(word, pos=”n”)) # n for noun and it is default
print(lemmatizer.lemmatize(word, pos=”a”)) # a for adjective
print(lemmatizer.lemmatize(word, pos=”v”)) # v for verb
print(lemmatizer.lemmatize(word, pos=”r”)) # r for adverb

Out:

better | good | better | well

For the word “better”, the output is not the same when the POS is an adjective and an adverb.

Now, determining the POS for the word will be an extra task for the lemmatization process. When we are converting a large number of text chunks, it will be difficult to pass a POS tag for each word – we need to automate the fetching of POS tags for each word we lemmatize. Here is a function for that:

In:

import nltk
from nltk.corpus import wordnet

def getWordNetPOS(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tagDict = {“J”: wordnet.ADJ,
“N”: wordnet.NOUN,
“V”: wordnet.VERB,
“R”: wordnet.ADV}
return tagDict.get(tag, wordnet.NOUN)

Out:

getWordNetPOS(“better”) — “r”

get_wordnet_pos(“play”) — “n”

get_wordnet_pos(“bad”) — “a”

Code Snippet to perform WordNet Lemmatization with POS:

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words])

Out:

[‘play’, ‘play’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Spacy Lemmatizer, TextBlob Lemmatizer, Stanford CoreNLP Lemmatizer, Gensim Lemmatizer are the other lemmatizers that can be tried. With a spacy lemmatizer, lemmatization can be done without passing any POS tag.

Code snippet to perform lemmatization on a paragraph:

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

wordnetContent = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(content)] # content defined earlier

Time taken to process this content on WordNet Lemmatizer is 0.2234 seconds which is a lot higher when compared to stemming.

Conclusion

Stemming and Lemmatization both generate the root/base form of the word. The only difference is that the stem may not be an actual word whereas the lemma is a meaningful word.

Compared to stemming, lemmatization is slow but helps to train the accurate ML model. If your data is huge, then snowball stemmer(porter2) is a better alternative. If your ML model uses a count vectorizer and it doesn’t bother with the context of the words/sentences, then stemming is the best process that can be considered.

For deep learning models and word embeddings in use, lemmatization is the perfect choice because you will not find word embeddings for invalid stem words.

We recommend you try other methods of lemmatization provided by Spacy, Textblob, Gensim, and Stanford core NLP.

The post Stemming Vs. Lemmatization with Python NLTK appeared first on Turbolab Technologies.

Text Classification using Machine Learning

Vasista Reddy — Fri, 15 Oct 2021 13:11:59 +0000

Machine Learning, Deep Learning, Artificial Intelligence are the popular buzzwords in present trends.

Artificial Intelligence(AI) is the branch of computer science which deals with developing intelligence artificially to the machines which are able to think, act and behave like humans.

Machine Learning(ML) is a subset of AI and is the way to implement artificial intelligence. It is the statistical approach where each instance in a data-set is described by a set of features or attributes. Feature Extraction is key in ML.

Deep Learning(DL) is the next evolution and subset of ML. It is a method of statistical learning that extracts features or attributes from raw data. DL uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. DL takes the data into a network of layers(Input, Hidden & Output) to extract the features and to learn from the data. Let’s end about the DL here – We will discuss more in the coming blogs.

outline

In ML/DL, there are models that fall into different categories like supervised, unsupervised & reinforcement learning. In this tutorial, we will discuss Supervised learning which involves an output label associated with each instance in the data-set.

Supervised Learning Model Flow Chart

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?fit=300%2C141&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?fit=711%2C335&ssl=1" class="size-full wp-image-674" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=711%2C335&ssl=1" alt="" width="711" height="335" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?w=711&ssl=1 711w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=300%2C141&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=480%2C226&ssl=1 480w" sizes="(max-width: 711px) 100vw, 711px" />

Supervised Learning Model Flow Chart

Text(Document) Classification/Text(Document) Categorization is one of the important and typical tasks in supervised ML. This technique allows machines to understand and then categorize text into known organized groups.

In this post, we will look into the supervised learning technique of how classification on the document dataset can be approached with ML algorithms.

Some of the ML algorithms are:

Naive Bayes.
Decision Trees.
Logistic Regression (Linear Model).
Support Vector Machines (SVM).
Random Forest.
K-Means Clustering.
K-Nearest Neighbour.
Gaussian Mixture Model.
Hidden Markov Model. et cetera

Among these ML Algorithms, we will discuss how the Naive Bayes, Logistic Regression and SVM classifier models perform on the data-set feature vectors.

Dataset

news dataset

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?fit=266%2C300&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?fit=613%2C691&ssl=1" class="wp-image-681 size-full" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=613%2C691&ssl=1" alt="news dataset" width="613" height="691" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?w=613&ssl=1 613w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=266%2C300&ssl=1 266w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=480%2C541&ssl=1 480w" sizes="(max-width: 613px) 100vw, 613px" />

news dataset

Categorized the data-set with the above 10 categories with each category of 1000 entries with content and label as two columns. I would call this dataset a dataframe(df) hereafter in the following code snippets.

dataset

Data Cleaning

Pre-processing of data will have an impact on the output ie., the accuracy, performance of the model. Some of the data-cleaning steps are as follows

Removing Stop Words. (NLTK)
Performing Stemming on the text. (NLTK)
Removing special characters & extra spaces or keeping only Alpha-Numeric characters in the text.

# NLTK python module for stemming and stopwords removal
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import string, re

stemmer = SnowballStemmer('english') # stemmer
t = str.maketrans(dict.fromkeys(string.punctuation)) # special char removal

def clean_text(text):  
    ## Remove Punctuation
    text = text.translate(t) 
    text = text.split()

    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [stemmer.stem(w) for w in text if not w in stops]
    
    text = " ".join(text)
    text = re.sub(' +',' ', text) # extra consecutive space removal 
    return text

df["content"] = df["content"].apply(clean_text)

This data cleaning part is optional – one can test the model accuracy with and without the data cleaning. Removing stop words and performing stemming can take away the context essence from the data.

Stemming removes or stems the last few characters of a word, often leading to meaningless words. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

lemmatization Vs Stemming

" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?fit=300%2C65&ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?fit=380%2C82&ssl=1" class="size-full wp-image-688" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?resize=380%2C82&ssl=1" alt="" width="380" height="82" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?w=380&ssl=1 380w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?resize=300%2C65&ssl=1 300w" sizes="(max-width: 380px) 100vw, 380px" />

lemmatization Vs Stemming

We have performed the stemming and stop word removal on the df before the data transformation process.

Data Transformation

Transforming the data into feature vectors with the following methods

Count Vectorization.
TF-IDF Word Vectorization.
TF-IDF N-Gram Vectorization.

We recommend you go through these feature extraction methods which are explained in detail in one of our blogs.

Dataset(df) has been split into train and validate samples in the range of 75 and 25% respectively by sklearn’s train_test_split function.

Code Snippet to transform data into vectors using the scikit-learn(sklearn) module.

from sklearn import model_selection, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

'''Assume df is the dataset with columns "content" and "label"'''

# split the data into training and validation
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['content'], df['label'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# count vectorization 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['content'])
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)

# word level tf-idf vectorization
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['content'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)

# ngram level tf-idf vectorization
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df['content'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)

Training with Naive Bayes, Logistic Regression, and SVM

We use Naive Bayes, Logistic Regression, and SVM algorithms to train the data-set feature vectors to form classifier models that are used for prediction.

From the above data transformation snippet, we have respective train and validation vectorization objects along with their label encoder values. We will use them here to fit the respective classifier and predict the model using the validation sample dataset.

This report_generation function is the common function used by the three ML Algorithms to predict validation data and pass the model’s accuracy.

from sklearn import linear_model, naive_bayes, svm, metrics
from sklearn.metrics import classification_report

target_names = list(encoder.classes_) # output labels for report generation

def report_generation(classifier, train_data, valid_data, train_y, valid_y):
   classifier.fit(train_data, train_y)
   predictions = classifier.predict(valid_data)
   print("Accuracy :", metrics.accuracy_score(predictions, valid_y))
   report = classification_report(valid_y, predictions, output_dict=True, target_names=target_names)
   return report

Naive Bayes

# Naive Bayes
classifier = naive_bayes.MultinomialNB()
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)
print("NB Count Vectorizer Report :", report['weighted avg'])

# Results
Accuracy : 0.9436
NB Count Vectorizer Report : {'precision': 0.9448178637882411, 'recall': 0.9436, 'f1-score': 0.9434664656369504, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)
print("NB TFIDF-Word Report :", report['weighted avg'])

# Results
Accuracy : 0.9416
NB Count Vectorizer Report : {'precision': 0.9430346010709252, 'recall': 0.9416, 'f1-score': 0.9416037073783431, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)
print("NB TFIDF-NGram Report :", report['weighted avg'])

# Results
Accuracy : 0.9208
NB Count Vectorizer Report : {'precision': 0.9233051466162964, 'recall': 0.9208, 'f1-score': 0.9206511260527037, 'support': 2500}

Logistic Regression

# Logistic Regression

classifier = linear_model.LogisticRegression()    
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)    
print("LogisticRegression Count Vectorizer Report :", report['weighted avg'])

# Results
Accuracy : 0.9804
LogisticRegression Count Vectorizer Report : {'precision': 0.9806682334322502, 'recall': 0.9804, 'f1-score': 0.9804527264151257, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y) 
print("LogisticRegression TFIDF-Word Report :", report['weighted avg'])

# Results
Accuracy : 0.9792
LogisticRegression TFIDF-Word Report : {'precision': 0.9794911617869886, 'recall': 0.9792, 'f1-score': 0.9792657461379974, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)    
print("LogisticRegression TFIDF-NGram Report :", report['weighted avg'])

# Results
Accuracy : 0.932
LogisticRegression TFIDF-NGram Report : {'precision': 0.9329064009056843, 'recall': 0.932, 'f1-score': 0.9320786137751711, 'support': 2500}

SVM

# Support Vector Machines
    
classifier = svm.SVC(gamma="scale")    
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)    
print("SVM Count Vectorizer Report :", report['weighted avg']) 

# Results
Accuracy : 0.9668
SVM Count Vectorizer Report : {'precision': 0.9687847838287942, 'recall': 0.9668, 'f1-score': 0.9672306318670637, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)    
print("SVM TFIDF-Word Report :", report['weighted avg']) 

# Results
Accuracy : 0.9804
SVM TFIDF-Word Report : {'precision': 0.980766234757573, 'recall': 0.9804, 'f1-score': 0.9804795388691244, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y) 
print("SVM TFIDF-NGram Report :", report['weighted avg'])

# Results
Accuracy : 0.9304
SVM TFIDF-NGram Report : {'precision': 0.9324797933370057, 'recall': 0.9304, 'f1-score': 0.9306949638900389, 'support': 2500}

conclusion

SVM with TF-IDF Word Vectorizer and Logistic Regression with Count Vectorizer gives better accuracy compared with other ML algorithms tested.

We recommend you train the model without the data cleaning part to check which ML algorithm works better following the similar approach shown above.

Disclaimer: We can not say which model is best here – It all depends on your data. Then, how do we conclude which ML algorithm suits our data?

Selecting the best model for your ML problem is definitely a difficult task. There is an awesome python library called Lazy Predict which helps to understand which models work better for your data without any parameter tuning. Check out the documentation here. In the coming posts, we will discuss the Lazy Predict python module with some examples.

The post Text Classification using Machine Learning appeared first on Turbolab Technologies.

Technology Archives - Turbolab Technologies

Entity Linking & Disambiguation using REL

Entity Extraction using spaCy:

Entity Extraction using Flair:

Entity Linking & Disambiguation

News Article Clip:

Spacy Output:

Flair Output:

REL Output:

Dealing with Disambiguation

Let’s see how wikifier deals with the disambiguation:

Let’s see how REL deals with the disambiguation:

Conclusion

Incremental/Online/Continuous Model Training using Creme

How to detect the model drift?

How to avoid the model drift?

Model Retraining Approach

Install from PyPI

Dataset to train the model

Setting up the model pipeline

Fitting the data to the model, one record at a time

Predictions – Testing the model

Training on a new data and new category

Retesting the model

Lazy Predict – Find the best suitable ML model

Dataset

Code

Importing required libraries

Importing data and LazyClassifier model fitting

Importing data and LazyRegressor model fitting

Conclusion

Text Classification with Keras and GloVe Word Embeddings

Cleaning the dataset

Loading the Glove Embeddings

Preparing the Embedding Matrix

Preparing the dataset for the model to train

Model construction

f1-score, precision, and recall

Model training and evaluation

How to monitor work-flow of scraping project with Apache-Airflow

Quick setup of Airflow on ubuntu 20.04 LTS

Creation of DAG

Task Creation

Concept of xcom

Concept of variable

Tasks Relationship/Arrangement

Triggering the DAG

Tree View of each DAG run

Handling Cases

Here are the few cases handled in the work-flow

Text Similarity using fastText Word Embeddings in Python

Dataset

Problem Statement

Generate sentence vectors

Calculate Spatial Distance

Another Example:

Conclusion

Data Cleaning using Regular Expression

Replacing Multi-Spaces

Dealing with Special Characters

1. Removing special characters and keeping only alphabets and numbers

2. Keeping either of alphabets or numbers

Detect and Remove URLs

Detect and Remove HTML Tags

Detect and Remove Email IDs

Detect and Remove the Hashtag

Detect Mentions using re.match() and re.findall()

Conclusion

Build a Custom NER model using spaCy 3.0

Load a spaCy model and check if it has ner pipeline

Inside-Outside-Beginning(IOB) Tagging

Why do we need a Custom NER?

SpaCy NER annotation tool by agateteam

Convert the annotated data into the spaCy bin object

Generate the config file to train via Command line

Training the model using the command line

Load & Test the model

Applications of NER:

EndNote:

Stemming Vs. Lemmatization with Python NLTK