Data Science Archives - Turbolab Technologies

Lazy Predict – Find the best suitable ML model

Anthony — Tue, 18 Jan 2022 06:38:11 +0000

As in the earlier blog “text classification using machine learning”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy. To overcome this problem we will discuss here an awesome python library “Lazy Predict”. This module helps us find the best model for classification and regression based on our data.

It provides a Lazy Classifier for classification problems and Lazy Regression for regression problems.

Note: Lazy Predict takes high computational power and it was a little time-consuming for me to run high dimensional data with multiple features.

Let us see how it works:

First, install this library in your local system

pip install lazypredict

Dataset

Here we are not concentrating more on the dataset or its feature extraction and transformation steps, as it has been shown in the previous blog on “text classification using machine learning”.

To demonstrate lazy predict classification and regression problems we are using “Drug type” and “Wine quality” data both taken from kaggle.com

Code

Importing required libraries

import lazypredict

import pandas as pd

from sklearn.model_selection import train_test_split

from lazypredict.Supervised import LazyClassifier, LazyRegressor

Importing data and LazyClassifier model fitting

classificationData = pd.read_csv(“drugType.csv”)

classificationData.head()

X = classificationData..drop(columns=”Drug”)

y = classificationData.[“Drug”]

# Splitting our data into a train and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2,

random_state=42)

classifiers = LazyClassifier(ignore_warnings=True, custom_metric=None)

models,predictions = classifiers.fit(X_train, X_test, y_train, y_test)

print(models)

Here the model returns two values, different model names with its prediction accuracy.

Importing data and LazyRegressor model fitting

regressionData = pd.read_csv(“winequality.csv”)

regressionData.head()

X = regressionData.drop(columns=”quality”)

y = regressionData[“quality”]

# Splitting our data into a train and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state = 42)

regressors = LazyRegressor(ignore_warnings=True, custom_metric=None)

models, predictions = regressors.fit(X_train, X_test, y_train, y_test)

print(models)

Conclusion

Here, when we use the “Lazy Predict” library, different models are fitted on our data, and model results provide us with accuracy metrics for the given data. Observing the result we can then select the top 5 base models based on the best accuracy.

Later we can tune the parameters of those top models and get better accuracy.

As this library runs many different models at once it takes a lot of computational power. If you have low computational power I would suggest you use Google Colab.

The post Lazy Predict – Find the best suitable ML model appeared first on Turbolab Technologies.

Data Cleaning using Regular Expression

Anthony — Tue, 30 Nov 2021 12:06:01 +0000

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.

In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.

Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.

In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a ‘re’ module which is in-built in Python so you don’t need to install it separately.

The re module offers a set of functions that allows us to search a string for a match.

The most commonly used methods provided by ‘re’ package are:

re.match()

re.search()

re.findall()

re.split()

re.sub()

re.compile()

Replacing Multi-Spaces

Removing extra white spaces from data is an important step as it makes your data look well structured.

import re

tweet = “if you hold an empty gatorade bottle up to your ear you can hear the sports”

x = re.sub(‘\s+’, ” “, tweet)

Input: x

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports

Dealing with Special Characters

In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance

1. Removing special characters and keeping only alphabets and numbers

import re

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(“[^a-zA-Z0-9 ]+”, “ ”, tweet)

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports 100’

2. Keeping either of alphabets or numbers

import re

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(“[^a-zA-Z ]+”,” ”, tweet)

Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports’

tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”

x = re.sub(” +”, “”,re.sub(“[^0-9 ]+”,”, tweet))

Output: ‘100’

Detect and Remove URLs

Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.

import re

tweet = ‘follow this website for more details www.knowmore.com and login to http://login.com’

pattern = re.compile(r”https?://\S+|www\.\S+”)

x = re.findall(pattern, tweet)

Input: x

Output: [‘www.knowmore.com’, ‘http://login.com‘]

# remove urls

z = re.sub(pattern, “”, tweet)

Input: z

Output: follow this website for more details and login to

Detect and Remove HTML Tags

Import re

tweet = ‘
follow this website for more details.
’

x = re.findall(‘<.*?>’, tweet)

Input : x

Output: [‘
’, ‘’, ‘’, ‘
’]

# remove html tags

z = re.sub(‘<.*?>’, “”, tweet)

Input: z

Output: ‘follow this website for more details.’

Detect and Remove Email IDs

Here we’ll use “re.search” to find e-mail ID. re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.

import re

tweet = “please send your feedback to myemail@gmail.com “

x = re.search(“[\w\.-]+@[\w\.-]+\.\w+”, tweet)

Input: x

Output: myemail@gmail.com‘>

tweet = “please send your feedback to myemail@gmail.com “

z = re.sub(“[\w\.-]+@[\w\.-]+\.\w+”, ””, tweet)

Output: please send your feedback to

Detect and Remove the Hashtag

import re

tweet = “love to explore. #nature #traveller”

x = re.findall(‘#[_]*[a-z]+’,tweet)

Input: x

Output: [‘#nature’, ‘#traveller’]

# remove html tags

z = re.sub(‘#[_]*[a-z]+’, ‘ ’, tweet)

Input: z

Output: “love to explore.”

Detect Mentions using re.match() and re.findall()

Here we’ll use re.match and re.findall to detect mentions.

re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.

import re

tweet = “@Bryan appointed as the new team captain”

x = re.match(“(@\w+)”, tweet)

Output:

tweet = “@Bryan appointed as the new team captain announced in @SportsLive”

x = re.findall(“@\S+”, tweet)

Input: x

Output: [ ‘@Bryan’, ‘@SportsLive’]

Conclusion

Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on regex.

The post Data Cleaning using Regular Expression appeared first on Turbolab Technologies.

Feature Extraction in Natural Language Processing

Anthony — Fri, 08 Oct 2021 10:49:26 +0000

In simple terms, Feature Extraction is transforming textual data into numerical data. In Natural Language Processing, Feature Extraction is a very trivial method to be followed to better understand the context. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. So we go for numerical representation for individual words as it’s easy for the computer to process numbers.

In this blog, we will discuss various feature extraction methods with examples using sklearn and gensim.

Countvectorizer

TF-IDF Vectorizer

Word Embeddings

Countvectorizer

It is a simple and flexible way of extracting features from documents. A Countvectorizer model is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag of words” because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not wherein the document.

Here is a basic snippet of using count vectorization to get vectors

from sklearn.feature_extraction.text import CountVectorizer

corpus = [“We become what we think about”, “Happiness is not something readymade. It comes from your own actions”]

# initialize count vectorizer object

vect = CountVectorizer()

# get counts of each token (word) in text data

X = vect.fit_transform(corpus)

# convert sparse matrix to numpy array to view

X.toarray()

# view token vocabulary and counts

print(“vocabulary”, vect.vocabulary_)

print(“shape”, X.shape)

print(‘vectors: ‘, X.toarray())

Output

Vocabulary : {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}

Shape : (2, 10)

Vectors : [[1 1 0 0 0 0 0 1 2 1]

[0 0 1 1 1 1 1 0 0 0]]

TF – IDF Vectorizer (Term Frequency – Inverse Document Frequency)

TF-IDF is short for term frequency-inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus.

The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

And similar to the Countvectorizer, sklearn.feature_extraction.text provides a method.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [“We become what we think about”, “Happiness is not something readymade.”]

# initialize tf-idf vectorizer object

vectorizer = TfidfVectorizer()

# compute bag of word counts and tf-idf values

tf = vectorizer.fit_transform(corpus)

# convert sparse matrix to numpy array to view

print(“Vocabulary”, vectorizer.vocabulary_)

print(“idf”, vectorizer.idf_)

print(“Vectors”, tf.toarray())

Output:

Vocabulary : {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}

idf : [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511

1.40546511 1.40546511 1.40546511 1.40546511]

Vectors : [[0.35355339 0.35355339 0. 0. 0. 0.

0.35355339 0.70710678 0.35355339]

[0. 0. 0.4472136 0.4472136 0.4472136 0.4472136

0.4472136 0. 0. 0. ]]

Word Embeddings

Word embedding is a learned representation of text, where each word is represented as a real-valued vector in a lower-dimensional space.

In simple terms, word embeddings are the texts converted into numbers and there may be different numerical representations of the same text, but texts with similar context have similar representations.

Word embedding preserves contexts and relationships of words so that it detects similar words more accurately.

Word embedding has several different implementations such as word2vec, GloVe, FastText etc.

Here we will explain word2vec, as it is the most popular implementation.

Word2vec

Word2vec is widely used in most of the NLP models. It transforms every word into vectors. Word2vec can make the most accurate predictions about the meaning of words. It can capture the contextual meaning of words very well. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in space.

There are two neural embedding algorithms:

Continuous Bag-of-Words (CBOW) – predicts target word from context
Skip-gram – predicts context from the target word

Here is an example of Word2vec using Gensim. Gensim is a python library for NLP.

from gensim.models import Word2Vec

# Get document data.

common_texts = [[‘interface’, ‘computer’, ‘technology’],

[‘survey’, ‘computer’, ‘system’, ‘response’],

[ ‘brother’, ‘boy’, ‘man’, ‘animal’, ‘human’]]

# Initializing Model

model = Word2Vec(common_texts, window=5, min_count=1, workers=4)

Result 1 :

# Get most similar words of “computer”

model.wv.most_similar(“computer”)

Output :

[(‘technology’, 0.21617145836353302),

(‘system’, 0.09291724115610123),

(‘interface’, 0.06285080313682556),

(‘survey’, 0.027057476341724396),

(‘response’, 0.016134709119796753),

(‘human’, -0.010839173570275307),

(‘boy’, -0.02775038219988346),

(‘animal’, -0.052346907556056976),

(‘brother’, -0.05987627059221268),

(‘man’, -0.111670583486557)]

Result 2 :

# Get most similar words of “computer”

model.wv.most_similar(“human”)

Output :

[(‘man’, 0.0679759532213211),

(‘survey’, 0.03364055976271629),

(‘brother’, 0.00939119141548872),

(‘boy’, 0.004503018222749233),

(‘computer’, -0.010839177295565605),

(‘animal’, -0.02365921437740326),

(‘technology’, -0.09575347602367401),

(‘response’, -0.11410721391439438),

(‘system’, -0.11555543541908264),

(‘interface’, -0.13429945707321167)]

Conclusion

In this post, we have discovered different types of text Feature Extraction Methods where we moved from non-context vectorization methods (count vectorizer/BOWs) to context preserving methods (TF-IDF/Word Embeddings). We have explored the above methods practically using Scikit-learn (sklearn) and Gensim libraries.

There are other advanced techniques for Word Embeddings like Facebook’s FastText. We will discuss them in our coming blogs.

Apart from Word Embeddings, Dimension Reductionality is also a Feature Extraction technique that aims to reduce the number of features in a dataset by creating new features from the existing ones and then discarding the original features.

Different techniques that you can explore for dimension reductional are Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE), and many more.

The post Feature Extraction in Natural Language Processing appeared first on Turbolab Technologies.

Abstractive Summarization Using Google’s T5

Vasista Reddy — Mon, 04 Oct 2021 04:04:00 +0000

In this article, we will discuss abstractive summarization using T5, and how it is different from BERT-based models.

T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output, in contrast to BERT-style models that can only output either a class label or a span of the input. This text-to-text formatting makes the T5 model fit for multiple NLP tasks like Summarization, Question-Answering, Machine Translation, and Classification problems.

How T5 is different from BERT?

Both T5 and BERT are trained with MLM (Masked Language Model) approach.

What is MLM?

The MLM is a fill-in-the-blank task, where the model masks part of the input text and tries to predict what that masked word should be.

Example:

“I like to eat peanut butter and sandwiches,”

“I like to eat peanut butter and jelly sandwiches,”

The only difference is that T5 replaces multiple consecutive tokens with the single Mask Keyword, unlike, BERT which uses Mask token for each word. This illustration is shown below.

Source: Journal of Machine Learning

About T5 Models

Google has released the pre-trained T5 text-to-text framework models which are trained on the unlabelled large text corpus called C4 (Colossal Clean Crawled Corpus) using deep learning. C4 is the web extract text of 800Gb cleaned data. The cleaning process involves deduplication, discarding incomplete sentences, and removing offensive or noisy content.

You can get these T5 pre-trained models from the HuggingFace website:

T5-small with 60 million parameters.
T5-base with 220 million parameters.
T5-large with 770 million parameters.
T5-3B with 3 billion parameters.
T5-11B with 11 billion parameters.

T5 expects a prefix before the input text to understand the task given by the user. For example, “summarize:” for the summarization, “cola sentence:” for the classification, “translate English to Spanish:” for the machine translation, etc., You can have a look at the below image to understand the above illustration.

Source: Google AI Blog

" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?fit=744%2C328&ssl=1" class="size-full wp-image-595" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=744%2C328&ssl=1" alt="" width="744" height="328" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=744%2C328&ssl=1 744w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=300%2C132&ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=480%2C212&ssl=1 480w" sizes="(max-width: 744px) 100vw, 744px" />

Source: Google AI Blog

Every task we consider uses text as input to the model, which is trained to generate some target text. This allows us to use the same model, loss function, and hyperparameters across our diverse set of tasks including translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue).

Besides the improved transformer architecture and massive unsupervised training data, better decoding methods have also played an important role. Currently, the most prominent decoding methods are Greedy Search, Beam Search, Top-K Sampling, and Top-p Sampling.

Visit this link to know the detailed information about these methods.

Using T5 through the HuggingFace transformers:

HuggingFace, an open-source NLP library that helps load pre-trained models, which are similar to sci-kit learn for machine learning algorithms.

We define the content we are going to summarize.

content = “China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.”

Importing the necessary packages

from transformers import T5Tokenizer, T5ForConditionalGeneration

Loading the tokenizer and model architecture with weights

T5_PATH = ‘t5-large’ # T5 model name

# initialize the model architecture and weights

t5_model = T5ForConditionalGeneration.from_pretrained(T5_PATH)

# initialize the model tokenizer

t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)

The pre-trained model used here is t5-large. Other pre-trained models of t5 are discussed above.

Encode the text

# encode the text into tensor of integers using the tokenizer

inputs = tokenizer.encode(“summarize: ” + article, return_tensors=”pt”, max_length=512, padding=’max_length’, truncation=True)

Generate the summarized text and decode it

summary_ids = t5_model.generate(inputs,

                                    num_beams=int(2),

                                    no_repeat_ngram_size=3,

                                    length_penalty=2.0,

                                    min_length=min_length,

                                    max_length=max_length,

                                    early_stopping=True)

output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

The decoding method used here is Beam Search with num_beams value as 2.

With min_length 50 and max_length 50, the output is:

“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales of Huawei’s”

and the time taken to generate the summary is 8.07 seconds with 16 cores CPU host.

With min_length 50 and max_length 100, the output is:

“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But company increased its dominance of the china market which has been faster to recover from COVID-19.”

and the time taken to generate the summary is 14.32 seconds with 16 cores CPU host.

With min_length 100 and max_length 200, the output is:

“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But Huawei increased its dominance of the china market which has been faster to recover from COVID-19.. Apple is due to release its Q2 iPhone shipment data on friday.”

and the time taken to generate the summary is 23.15 seconds with 16 cores CPU host.

As you increase any of these parameters num_beams, min_lenth, and max_length, the time taken to generate the summary is going to increase.

Conclusion

In this article, we have used the Beam Search decoding method. For a better summary, we can suggest increasing the beam value and trying the other decoding methods(Greedy Search, Beam Search, Top-K Sampling, and Top-p Sampling) mentioned.

With Pegasus, we can only perform abstractive summarization but T5 can perform various NLP tasks like Classification tasks (eg: Sentiment Analysis), Question-Answering, Machine Translation, and Document Summarization. We recommend you go through the other NLP tasks of T5.

The post Abstractive Summarization Using Google’s T5 appeared first on Turbolab Technologies.

Sentiment Analysis: Concepts, Models, and Examples

Anthony — Mon, 27 Sep 2021 06:30:32 +0000

Sentiment analysis is a sub field of Natural Language Processing (NLP) that identifies and extracts emotions expressed in given texts. It is a machine learning tool that understands the context and determines the polarity of text, whether it is positive, neutral, or negative.

This article will discuss what sentiment analysis is, where it is being used, and how to use a pre-trained model to analyze sentiments from texts.

We will also explore the approach on how Machine Learning models are used to build sentiment analytic tools.

Use cases of sentiment analysis:

Brand Monitoring
Customers Feedback
Product Analytics
Monitoring Market Research
Analyzing Movie Reviews

There are various pre-trained sentiment analysis tools available in Natural Language Processing (NLP) libraries. Such as NLTK’s Vader sentiment analysis tool, TextBlob, Flair sentiment classifier based on LSTM neural network, etc.

Part 1- Sentiment analysis using a pre-trained model (TextBlob)

TextBlob is a python library for Natural Language Processing (NLP). It helps you perform complex analysis and operations on textual data.

Steps to apply the TextBlob model to achieve sentiments are given here:

Before applying Textblob, basic text cleaning should be done. You can check NLTK or Spacy libraries for various text cleaning methods.

from textblob import TextBlob

def sentimental(text: str) -> str:

    sentiment = None

    if text:

        text = ‘ ‘.join(text.split()).strip() # removing empty strings

        blob = TextBlob(text)

        if blob.sentiment.polarity > 0:

            sentiment = ‘Positive’

        if blob.sentiment.polarity < 0:

            sentiment = ‘Negative’

        if blob.sentiment.polarity == 0:

            sentiment = ‘Neutral’

    return sentiment

Output Result:

sentimental(“This is what a true masterpiece looks like. Thrillingly played with a flawless ensemble cast.”)

Out: ‘Positive‘

TextBlob returns the ‘polarity’ of a sentence. Polarity lies between [-1,1].

-1 defines Negative, 0 defines Neutral, and 1 defines Positive.

Part 2 – Train a Machine Learning Model for sentiment analysis

In this part, we will be using a Supervised Machine Learning model called Support Vector Machines (SVM) to train the model.

Data Gathering:

Here we will choose sentiment polarity datasets 2.0 which is a classified movie dataset with labels, and transformed into CSVs.

Data is divided into “trainData” and “testData”. The dataset contains “Content” and “Label” columns.

Data Vectorization

Before feeding our model with data, we need to extract features from our textual dataset, basically converting the text data into vectors. TF-IDF is one of many methods to extract features from text documents. TF-IDF stands for ‘Term Frequency – Inverse Document Frequency.

from sklearn.feature_extraction.text import TfidfVectorizer

# Create feature vectors

vectorizer = TfidfVectorizer(min_df = 5,

                             max_df = 0.8,

                             sublinear_tf = True,

                             use_idf = True)

train_vectors = vectorizer.fit_transform(trainData[‘Content’])

test_vectors = vectorizer.transform(testData[‘Content’])

Model Building

After generating vectors for both train and test input sets, we can now feed the SVC model with this data and train it.

# importing libraries

from sklearn import svm

from sklearn.metrics import classification_report

# Initialising SVM classifier with linear kernel

svm_classifier = svm.SVC(kernel=’linear’)

# training the model with the train data

svm_classifier.fit(train_vectors, trainData[‘Label’])

# testing the model in test data content

predicted_result = svm_classifier.predict(test_vectors)

# results

report = classification_report(testData[‘Label’], predicted_result, output_dict=True)

print(‘Model accuracy: ‘, report[‘accuracy’])

Model Results and Statistics:

Model accuracy: 0.915

Model accuracy shows the ratio of the number of correctly predicted classes to the total number of input samples. Accuracy is one of many metrics used for evaluating classification problems.

Here the accuracy is 0.915, which shows that the model has learned the data quite well as the range of accuracy is calculated between 0 to 1.

Testing the Model to Predict on Movie Reviews:

svm_classifier.predict(vectorizer.transform(“This is what a true masterpiece looks like. Thrillingly played with a flawless ensemble cast.”))

Out: ‘Positive’

Classification accuracy alone can be misleading if you have an unequal number of observations. A confusion matrix can give you a better idea of what our model is predicting correctly.

Here we have taken 200 test samples and as shown in the matrix above, we got 9 False positives, which means it has falsely predicted the negative as positive. There were also 8 False negatives, where it has falsely predicted the positive as negative.

To reduce these errors we can train the model with a larger dataset.

Conclusion

In this article, we have mentioned the TextBlob (pre-trained) Python package and SVM (Machine Learning) model to determine sentiment analysis. But the field of sentiment analysis is an exciting research direction due to a large number of real-world applications where discovering people’s opinions is important in better decision-making.

Although detecting sentiment using NLP is surprisingly a difficult task, such as when we face sentences that are put in sarcastic ways. These types of textual context can mislead NLP-based model predictions. We can even see that both the model prediction results are not the same for all samples. Here the TextBlob model performs and predicts better with ‘neutral’ tagging of articles. This is because TextBlob is using more data to train the model and has neutral tagged data in the training set.

To overcome such difficult tasks, we can use deep learning models like LSTM, RNN, etc. We can even make use of transformer-based models like GPT-3 and T5 from google for sentiment analysis.

The post Sentiment Analysis: Concepts, Models, and Examples appeared first on Turbolab Technologies.

Data Science Archives - Turbolab Technologies

Lazy Predict – Find the best suitable ML model

Dataset

Code

Importing required libraries

Importing data and LazyClassifier model fitting

Importing data and LazyRegressor model fitting

Conclusion

Data Cleaning using Regular Expression

Replacing Multi-Spaces

Dealing with Special Characters

1. Removing special characters and keeping only alphabets and numbers

2. Keeping either of alphabets or numbers

Detect and Remove URLs

Detect and Remove HTML Tags

Detect and Remove Email IDs

Detect and Remove the Hashtag

Detect Mentions using re.match() and re.findall()

Conclusion

Feature Extraction in Natural Language Processing

Countvectorizer

Output

TF – IDF Vectorizer (Term Frequency – Inverse Document Frequency)

Output:

Word Embeddings

Word2vec

Result 1 :

Output :

Result 2 :

Output :

Conclusion

Abstractive Summarization Using Google’s T5

How T5 is different from BERT?

About T5 Models

Using T5 through the HuggingFace transformers:

Importing the necessary packages

Loading the tokenizer and model architecture with weights

Encode the text

Generate the summarized text and decode it

Conclusion

Sentiment Analysis: Concepts, Models, and Examples

Use cases of sentiment analysis:

Part 1- Sentiment analysis using a pre-trained model (TextBlob)

Steps to apply the TextBlob model to achieve sentiments are given here:

Output Result:

Part 2 – Train a Machine Learning Model for sentiment analysis

Data Gathering:

Data Vectorization

Model Building

Model Results and Statistics:

Testing the Model to Predict on Movie Reviews:

Conclusion