<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data Science Archives - Turbolab Technologies</title>
	<atom:link href="https://turbolab.in/tag/data-science/feed/" rel="self" type="application/rss+xml" />
	<link>https://turbolab.in/tag/data-science/</link>
	<description>Big Data and News Analysis Startup in Kochi</description>
	<lastBuildDate>Tue, 18 Jan 2022 13:39:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/turbolab.in/wp-content/uploads/2018/03/turbo_black_trans-space.png?fit=32%2C32&#038;ssl=1</url>
	<title>Data Science Archives - Turbolab Technologies</title>
	<link>https://turbolab.in/tag/data-science/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">98237731</site>	<item>
		<title>Lazy Predict &#8211; Find the best suitable ML model</title>
		<link>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/</link>
					<comments>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Tue, 18 Jan 2022 06:38:11 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[prediction]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regression]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=869</guid>

					<description><![CDATA[<p>As in the earlier blog “text classification using machine learning”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy.  To overcome this problem we will discuss here an awesome python library “Lazy Predict”. This module helps us [&#8230;]</p>
<p>The post <a href="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/">Lazy Predict &#8211; Find the best suitable ML model</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">As in the earlier blog “<a href="https://turbolab.in/text-classification-using-machine-learning/">text classification using machine learning</a>”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy.  To overcome this problem we will discuss here an awesome python library “<a href="https://lazypredict.readthedocs.io/en/latest/">Lazy Predict</a>”. This module helps us find the best model for classification and regression based on our data.</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">It provides a Lazy Classifier for classification problems and Lazy Regression for regression problems. </span></p>
<ul>
<li><strong><strong>Note: </strong></strong>Lazy Predict takes high computational power and it was a little time-consuming for me to run high dimensional data with multiple features.</li>
</ul>
<p>&nbsp;</p>
<p><b>Let us see how it works:</b></p>
<p><span style="font-weight: 400">First, install this library in your local system</span></p>
<blockquote><p><i><span style="font-weight: 400">pip  install lazypredict</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Dataset</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here we are not concentrating more on the dataset or its feature extraction and transformation steps, as it has been shown in the previous blog on “</span><a href="https://turbolab.in/text-classification-using-machine-learning/"><span style="font-weight: 400">text classification using machine learning</span></a><span style="font-weight: 400">”. </span></p>
<p><span style="font-weight: 400">To demonstrate lazy predict classification and regression problems we are using &#8220;D</span><span style="font-weight: 400">rug type&#8221;</span><span style="font-weight: 400"> and &#8220;W</span><span style="font-weight: 400">ine quality&#8221;</span><span style="font-weight: 400"> data both taken from kaggle.com</span></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Code</b></h3>
<p>&nbsp;</p>
<h4><b>Importing required libraries</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">import lazypredict</span></i></p>
<p><i><span style="font-weight: 400">import pandas as pd </span></i></p>
<p><i><span style="font-weight: 400">from sklearn.model_selection import train_test_split </span></i></p>
<p><i><span style="font-weight: 400">from lazypredict.Supervised import LazyClassifier, LazyRegressor</span></i></p></blockquote>
<p>&nbsp;</p>
<h4><b>Importing data and LazyClassifier model fitting</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">classificationData = pd.read_csv(&#8220;drugType.csv&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">classificationData.head()</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" fetchpriority="high" decoding="async" data-attachment-id="870" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-10-20-37-38/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?fit=406%2C185&amp;ssl=1" data-orig-size="406,185" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-10 20-37-38" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?fit=406%2C185&amp;ssl=1" class="wp-image-870 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?resize=443%2C203&#038;ssl=1" alt="" width="443" height="203" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?resize=300%2C137&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?w=406&amp;ssl=1 406w" sizes="(max-width: 443px) 100vw, 443px" /></p>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">X = classificationData..drop(columns=”Drug”)</span></i></p>
<p><i><span style="font-weight: 400">y = classificationData.[“Drug”]</span></i></p>
<p><i><span style="font-weight: 400"># Splitting our data into a train and test set</span></i></p>
<p><i><span style="font-weight: 400">X_train, X_test, y_train, y_test = train_test_split(X, y,</span></i></p>
<p><i><span style="font-weight: 400">                                                    test_size=0.2,</span></i></p>
<p><i><span style="font-weight: 400">                                                    random_state=42)</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">classifiers = LazyClassifier(ignore_warnings=True, custom_metric=None)</span></i></p>
<p><i><span style="font-weight: 400">models,predictions = classifiers.fit(X_train, X_test, y_train, y_test)</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">print(models)</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="871" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-10-20-53-05/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?fit=767%2C568&amp;ssl=1" data-orig-size="767,568" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-10 20-53-05" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?fit=767%2C568&amp;ssl=1" class="wp-image-871 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=534%2C395&#038;ssl=1" alt="" width="534" height="395" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=300%2C222&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=480%2C355&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?w=767&amp;ssl=1 767w" sizes="(max-width: 534px) 100vw, 534px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here the model returns two values, different model names with its prediction accuracy. </span></p>
<p>&nbsp;</p>
<h4><b>Importing data and LazyRegressor model fitting</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">regressionData = pd.read_csv(&#8220;winequality.csv&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">regressionData.head()</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="875" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-11-17-38-16/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?fit=1063%2C188&amp;ssl=1" data-orig-size="1063,188" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-11 17-38-16" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?fit=800%2C141&amp;ssl=1" class=" wp-image-875 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=663%2C117&#038;ssl=1" alt="" width="663" height="117" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=300%2C53&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=768%2C136&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=1024%2C181&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=980%2C173&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=480%2C85&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?w=1063&amp;ssl=1 1063w" sizes="(max-width: 663px) 100vw, 663px" /></p>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">X = regressionData.drop(columns=”quality”)</span></i></p>
<p><i><span style="font-weight: 400">y = regressionData[“quality”]</span></i></p>
<p><i><span style="font-weight: 400"># Splitting our data into a train and test set</span></i></p>
<p><i><span style="font-weight: 400">X_train, X_test, y_train, y_test = train_test_split(X, y,</span></i></p>
<p><i><span style="font-weight: 400">                                                    test_size=0.2, random_state = 42)</span></i></p>
<p><i><span style="font-weight: 400">regressors = LazyRegressor(ignore_warnings=True, custom_metric=None)</span></i></p>
<p><i><span style="font-weight: 400">models, predictions = regressors.fit(X_train, X_test, y_train, y_test)</span></i></p>
<p><i><span style="font-weight: 400">print(models)</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="882" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-11-17-43-05-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?fit=629%2C696&amp;ssl=1" data-orig-size="629,696" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-11 17-43-05" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?fit=629%2C696&amp;ssl=1" class=" wp-image-882 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=515%2C570&#038;ssl=1" alt="" width="515" height="570" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=271%2C300&amp;ssl=1 271w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=480%2C531&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?w=629&amp;ssl=1 629w" sizes="(max-width: 515px) 100vw, 515px" /></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Conclusion</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here, when we use the “Lazy Predict” library, different models are fitted on our data, and model results provide us with accuracy metrics for the given data. Observing the result we can then select the top 5 base models based on the best accuracy. </span></p>
<p><span style="font-weight: 400">Later we can tune the parameters of those top models and get better accuracy. </span></p>
<p><span style="font-weight: 400">As this library runs many different models at once it takes a lot of computational power. If you have low computational power I would suggest you use Google Colab.</span></p>
<p>&nbsp;</p>
<p>The post <a href="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/">Lazy Predict &#8211; Find the best suitable ML model</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">869</post-id>	</item>
		<item>
		<title>Data Cleaning using Regular Expression</title>
		<link>https://turbolab.in/data-cleaning-using-regular-expression/</link>
					<comments>https://turbolab.in/data-cleaning-using-regular-expression/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Tue, 30 Nov 2021 12:06:01 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[data cleaning]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[text cleaning]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=779</guid>

					<description><![CDATA[<p>Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty [&#8230;]</p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.</span></p>
<p><span style="font-weight: 400">The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.</span></p>
<p><span style="font-weight: 400">In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.</span></p>
<p><span style="font-weight: 400">In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a &#8216;re&#8217; module which is in-built in Python so you don’t need to install it separately.</span></p>
<p><span style="font-weight: 400">The re module offers a set of functions that allows us to search a string for a match.</span></p>
<p><span style="font-weight: 400">The most commonly used methods provided by ‘re’ package are:</span></p>
<p>&nbsp;</p>
<ul>
<li><strong>re.match()</strong></li>
</ul>
<ul>
<li><strong>re.search()</strong></li>
</ul>
<ul>
<li><strong>re.findall()</strong></li>
</ul>
<ul>
<li><strong>re.split()</strong></li>
</ul>
<ul>
<li><strong>re.sub()</strong></li>
</ul>
<ul>
<li><strong>re.compile()</strong></li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Replacing Multi-Spaces</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Removing extra white spaces from data is an important step as it makes your data look well structured.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if       you hold an empty gatorade bottle up to your ear   you can hear      the sports&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8216;\s+&#8217;, &#8221; &#8220;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports</span></i></p>
<p>&nbsp;</p></blockquote>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Dealing with Special Characters</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance</strong></p>
<p>&nbsp;</p>
<h4><b>1.   Removing special characters and keeping only alphabets and numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z0-9 ]+&#8221;, “ ”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports 100&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<h4><b>2. Keeping either of alphabets or numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z ]+&#8221;,” &#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports&#8217;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8221; +&#8221;, &#8220;&#8221;,re.sub(&#8220;[^0-9 ]+&#8221;,&#8221;, tweet))</span></i></p>
<p><i><span style="font-weight: 400">Output: ‘100’</span></i></p></blockquote>
<p><b><b><br />
</b></b><i></i></p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove URLs</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;follow this website for more details www.knowmore.com and login to http://login.com&#8217;</span></i></p>
<p><i><span style="font-weight: 400">pattern = re.compile(r&#8221;https?://\S+|www\.\S+&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(pattern, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;www.knowmore.com&#8217;, &#8216;</span></i><a href="http://login.com"><i><span style="font-weight: 400">http://login.com</span></i></a><i><span style="font-weight: 400">&#8216;]</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400"># remove urls </span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(pattern, “”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: follow this website for more details and login to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove HTML Tags<br />
</b></h3>
</li>
</ul>
<blockquote><p><i><span style="font-weight: 400">Import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;&lt;p&gt;follow this &lt;b&gt;website&lt;/b&gt; for more details. &lt;/p&gt;&#8217;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;&lt;.*?&gt;&#8217;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input : x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;&lt;p&gt;&#8217;, &#8216;&lt;b&gt;&#8217;, &#8216;&lt;/b&gt;&#8217;, &#8216;&lt;/p&gt;&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8216;&lt;.*?&gt;&#8217;, &#8220;&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;follow this website for more details.&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove Email IDs </b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use “re.search” to find e-mail ID.  re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">x = re.search(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(29, 40), match=&#8217;</span></i><a href="mailto:my@gmal.com"><i><span style="font-weight: 400">myemail@gmail.com</span></i></a><i><span style="font-weight: 400">&#8216;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, ””, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: please send your feedback to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove the Hashtag</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;love to explore. #nature #traveller&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;#[_]*[a-z]+&#8217;,tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">[&#8216;#nature&#8217;, &#8216;#traveller&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(</span></i><i><span style="font-weight: 400">&#8216;#[_]*[a-z]+&#8217;, ‘ ’, tweet</span></i><i><span style="font-weight: 400">)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">&#8220;love to explore.”</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect Mentions using re.match() and re.findall()</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use re.match and re.findall to detect mentions. </strong></p>
<p><strong>re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.match(&#8220;(@\w+)&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(0, 6), match=&#8217;@Bryan&#8217;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain announced in @SportsLive&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8220;@\S+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [ &#8216;@Bryan&#8217;, &#8216;@SportsLive&#8217;]</span></i></p></blockquote>
<p>&nbsp;</p>
<h3><b>Conclusion</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on <a href="https://docs.python.org/3/library/re.html">regex</a>.</span></p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/data-cleaning-using-regular-expression/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">779</post-id>	</item>
		<item>
		<title>Feature Extraction in Natural Language Processing</title>
		<link>https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/</link>
					<comments>https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Fri, 08 Oct 2021 10:49:26 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[bag of words]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[feature extraction]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[tfidf]]></category>
		<category><![CDATA[word embeddings]]></category>
		<category><![CDATA[word2vec]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=629</guid>

					<description><![CDATA[<p>In simple terms, Feature Extraction is transforming textual data into numerical data. In Natural Language Processing, Feature Extraction is a very trivial method to be followed to better understand the context. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. [&#8230;]</p>
<p>The post <a href="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/">Feature Extraction in Natural Language Processing</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400;">In simple terms, Feature Extraction is transforming textual data into numerical data. In Natural Language Processing, Feature Extraction is a very trivial method to be followed to better understand the context. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. So we go for numerical representation for individual words as it’s easy for the computer to process numbers.</span></p>
<p><span style="font-weight: 400;">In this blog, we will discuss various feature extraction methods with examples using sklearn and gensim.</span></p>
<p>&nbsp;</p>
<ul>
<li><b>Countvectorizer</b></li>
</ul>
<ul>
<li><strong>TF-IDF Vectorizer</strong></li>
</ul>
<ul>
<li><strong>Word Embeddings</strong></li>
</ul>
<p>&nbsp;</p>
<h3></h3>
<h2><strong>Countvectorizer</strong></h2>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="632" data-permalink="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/countvectorizer-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?fit=1726%2C800&amp;ssl=1" data-orig-size="1726,800" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="countvectorizer" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?fit=800%2C371&amp;ssl=1" class="alignnone wp-image-632" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=642%2C297&#038;ssl=1" alt="" width="642" height="297" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=300%2C139&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=768%2C356&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=1024%2C475&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=1080%2C501&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=1280%2C593&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=980%2C454&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?resize=480%2C222&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?w=1726&amp;ssl=1 1726w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/countvectorizer-1.png?w=1600&amp;ssl=1 1600w" sizes="(max-width: 642px) 100vw, 642px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">It is a simple and flexible way of extracting features from documents. A Countvectorizer model is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag of words” because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not wherein the document.</span></p>
<p><span style="font-weight: 400;">Here is a basic snippet of using count vectorization to get vectors</span></p>
<p>&nbsp;</p>
<blockquote><p><strong><i>from sklearn.feature_extraction.text import CountVectorizer</i></strong></p>
<p>&nbsp;</p>
<p><strong><i>corpus = [&#8220;We become what we think about&#8221;, &#8220;Happiness is not something readymade. It comes from your own actions&#8221;]</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># initialize count vectorizer object</i></strong></p>
<p><strong><i>vect = CountVectorizer()</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># get counts of each token (word) in text data</i></strong></p>
<p><strong><i>X = vect.fit_transform(corpus)</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># convert sparse matrix to numpy array to view</i></strong></p>
<p><strong><i>X.toarray()</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># view token vocabulary and counts</i></strong></p>
<p><strong><i>print(&#8220;vocabulary&#8221;, vect.vocabulary_)</i></strong></p>
<p><strong><i>print(&#8220;shape&#8221;, X.shape)</i></strong></p>
<p><strong><i>print(&#8216;vectors: &#8216;, X.toarray())</i></strong></p></blockquote>
<p>&nbsp;</p>
<h4><strong>Output</strong></h4>
<p>&nbsp;</p>
<blockquote><p><em><span style="font-weight: 400;"><strong>Vocabulary</strong> :  {&#8216;we&#8217;: 8, &#8216;become&#8217;: 1, &#8216;what&#8217;: 9, &#8216;think&#8217;: 7, &#8216;about&#8217;: 0, &#8216;happiness&#8217;: 2, &#8216;is&#8217;: 3, &#8216;not&#8217;: 4, &#8216;something&#8217;: 6, &#8216;readymade&#8217;: 5}</span></em></p>
<p>&nbsp;</p>
<p><em><span style="font-weight: 400;"><strong>Shape</strong> :  (2, 10)</span></em></p>
<p>&nbsp;</p>
<p><em><span style="font-weight: 400;"><strong>Vectors</strong> :  [[1 1 0 0 0 0 0 1 2 1]</span></em></p>
<p><em><span style="font-weight: 400;">                 [0 0 1 1 1 1 1 0 0 0]]</span></em></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3></h3>
<h2><b>TF &#8211; IDF Vectorizer (Term Frequency &#8211; Inverse Document Frequency)</b></h2>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="642" data-permalink="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/tf-idf/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?fit=1200%2C399&amp;ssl=1" data-orig-size="1200,399" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="tf-idf" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?fit=800%2C266&amp;ssl=1" class="alignnone wp-image-642" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=705%2C235&#038;ssl=1" alt="" width="705" height="235" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=300%2C100&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=768%2C255&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=1024%2C340&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=1080%2C359&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=980%2C326&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?resize=480%2C160&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/tf-idf.png?w=1200&amp;ssl=1 1200w" sizes="(max-width: 705px) 100vw, 705px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">TF-IDF is short for term frequency-inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus.</span></p>
<p><span style="font-weight: 400;">The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.</span></p>
<p><span style="font-weight: 400;">And similar to the Countvectorizer, </span><i><span style="font-weight: 400;">sklearn.feature_extraction.text</span></i><span style="font-weight: 400;"> provides a method.</span></p>
<p>&nbsp;</p>
<blockquote><p><strong><i>from sklearn.feature_extraction.text import TfidfVectorizer</i></strong></p>
<p>&nbsp;</p>
<p><strong><i>corpus = [&#8220;We become what we think about&#8221;, &#8220;Happiness is not something readymade.&#8221;]</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># initialize tf-idf vectorizer object</i></strong></p>
<p><strong><i>vectorizer = TfidfVectorizer()</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># compute bag of word counts and tf-idf values</i></strong></p>
<p><strong><i>tf = vectorizer.fit_transform(corpus)</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># convert sparse matrix to numpy array to view</i></strong></p>
<p><strong><i>print(&#8220;Vocabulary&#8221;, vectorizer.vocabulary_)</i></strong></p>
<p><strong><i>print(&#8220;idf&#8221;, vectorizer.idf_)</i></strong></p>
<p><strong><i>print(&#8220;Vectors&#8221;, tf.toarray())</i></strong></p></blockquote>
<p>&nbsp;</p>
<h4><b>Output:</b></h4>
<p>&nbsp;</p>
<blockquote><p><em><span style="font-weight: 400;"><strong>Vocabulary</strong> : {&#8216;we&#8217;: 8, &#8216;become&#8217;: 1, &#8216;what&#8217;: 9, &#8216;think&#8217;: 7, &#8216;about&#8217;: 0, &#8216;happiness&#8217;: 2, &#8216;is&#8217;: 3, &#8216;not&#8217;: 4, &#8216;something&#8217;: 6, &#8216;readymade&#8217;: 5}</span></em></p>
<p>&nbsp;</p>
<p><em><span style="font-weight: 400;"><strong>idf</strong> : [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511</span></em></p>
<p><em><span style="font-weight: 400;"> 1.40546511 1.40546511 1.40546511 1.40546511]</span></em></p>
<p>&nbsp;</p>
<p><em><span style="font-weight: 400;"><strong>Vectors</strong> : [[0.35355339 0.35355339 0.         0.         0.         0.</span></em></p>
<ol>
<li><em><span style="font-weight: 400;">         0.35355339 0.70710678 0.35355339]</span></em></li>
</ol>
<p><em><span style="font-weight: 400;">      [0.         0.         0.4472136  0.4472136  0.4472136  0.4472136</span></em></p>
<p><em><span style="font-weight: 400;">       0.4472136  0.         0.         0.        ]]</span></em></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2><b>Word Embeddings</b></h2>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">Word embedding is a learned representation of text, where each word is represented as a real-valued vector in a lower-dimensional space.</span></p>
<p><span style="font-weight: 400;">In simple terms, word embeddings are the texts converted into numbers and there may be different numerical representations of the same text, but texts with similar context have similar representations.</span></p>
<p><span style="font-weight: 400;">Word embedding preserves contexts and relationships of words so that it detects similar words more accurately.</span></p>
<p><span style="font-weight: 400;">Word embedding has several different implementations such as word2vec, GloVe, FastText etc.</span></p>
<p><span style="font-weight: 400;">Here we will explain word2vec, as it is the most popular implementation.</span></p>
<p>&nbsp;</p>
<h3><b>Word2vec</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">Word2vec is widely used in most of the NLP models. It transforms every word into vectors. Word2vec can make the most accurate predictions about the meaning of words. It can capture the contextual meaning of words very well. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in space.</span></p>
<p><span style="font-weight: 400;">There are two neural embedding algorithms:</span></p>
<p>&nbsp;</p>
<ul>
<li><b>Continuous Bag-of-Words (CBOW) &#8211; <span style="font-weight: 400;">predicts target word from context</span></b></li>
<li><b>Skip-gram &#8211; </b>predicts context from the target word</li>
</ul>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="638" data-permalink="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/word2vec_auto_x2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?fit=3248%2C1228&amp;ssl=1" data-orig-size="3248,1228" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;1&quot;}" data-image-title="word2vec_auto_x2" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?fit=800%2C302&amp;ssl=1" class="alignnone wp-image-638" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=669%2C252&#038;ssl=1" alt="" width="669" height="252" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=300%2C113&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=768%2C290&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=1024%2C387&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=1080%2C408&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=1280%2C484&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=980%2C371&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?resize=480%2C181&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?w=1600&amp;ssl=1 1600w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/word2vec_auto_x2.jpg?w=2400&amp;ssl=1 2400w" sizes="(max-width: 669px) 100vw, 669px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">Here is an example of Word2vec using Gensim. Gensim is a python library for NLP.</span></p>
<p>&nbsp;</p>
<blockquote><p><strong><i>from gensim.models import Word2Vec</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># Get document data.</i></strong></p>
<p><strong><i>common_texts = [[&#8216;interface&#8217;, &#8216;computer&#8217;, &#8216;technology&#8217;],</i></strong></p>
<p><strong><i> [&#8216;survey&#8217;, &#8216;computer&#8217;, &#8216;system&#8217;, &#8216;response&#8217;],</i></strong></p>
<p><strong><i> [ &#8216;brother&#8217;, &#8216;boy&#8217;, &#8216;man&#8217;, &#8216;animal&#8217;, &#8216;human&#8217;]]</i></strong></p>
<p>&nbsp;</p>
<p><strong><i># Initializing Model</i></strong></p>
<p><strong><i>model = Word2Vec(common_texts, window=5, min_count=1, workers=4)</i></strong></p></blockquote>
<p>&nbsp;</p>
<h3><strong>Result 1 :</strong></h3>
<p>&nbsp;</p>
<p><b># Get most similar words of &#8220;computer&#8221;</b></p>
<p><b>model.wv.most_similar(&#8220;computer&#8221;)</b></p>
<p>&nbsp;</p>
<h4><b>Output :</b><span style="font-weight: 400;"> </span></h4>
<p>&nbsp;</p>
<blockquote><p><span style="font-weight: 400;">[(&#8216;technology&#8217;, 0.21617145836353302),</span></p>
<p><span style="font-weight: 400;"> (&#8216;system&#8217;, 0.09291724115610123),</span></p>
<p><span style="font-weight: 400;"> (&#8216;interface&#8217;, 0.06285080313682556),</span></p>
<p><span style="font-weight: 400;"> (&#8216;survey&#8217;, 0.027057476341724396),</span></p>
<p><span style="font-weight: 400;"> (&#8216;response&#8217;, 0.016134709119796753),</span></p>
<p><span style="font-weight: 400;"> (&#8216;human&#8217;, -0.010839173570275307),</span></p>
<p><span style="font-weight: 400;"> (&#8216;boy&#8217;, -0.02775038219988346),</span></p>
<p><span style="font-weight: 400;"> (&#8216;animal&#8217;, -0.052346907556056976),</span></p>
<p><span style="font-weight: 400;"> (&#8216;brother&#8217;, -0.05987627059221268),</span></p>
<p><span style="font-weight: 400;"> (&#8216;man&#8217;, -0.111670583486557)]</span></p></blockquote>
<p>&nbsp;</p>
<h3><b>Result 2 :</b></h3>
<p>&nbsp;</p>
<p><b># Get most similar words of &#8220;computer&#8221;</b></p>
<p><b>model.wv.most_similar(&#8220;human&#8221;)</b></p>
<p>&nbsp;</p>
<h4><b>Output :</b></h4>
<p>&nbsp;</p>
<blockquote><p><span style="font-weight: 400;">[(&#8216;man&#8217;, 0.0679759532213211),</span></p>
<p><span style="font-weight: 400;"> (&#8216;survey&#8217;, 0.03364055976271629),</span></p>
<p><span style="font-weight: 400;"> (&#8216;brother&#8217;, 0.00939119141548872),</span></p>
<p><span style="font-weight: 400;"> (&#8216;boy&#8217;, 0.004503018222749233),</span></p>
<p><span style="font-weight: 400;"> (&#8216;computer&#8217;, -0.010839177295565605),</span></p>
<p><span style="font-weight: 400;"> (&#8216;animal&#8217;, -0.02365921437740326),</span></p>
<p><span style="font-weight: 400;"> (&#8216;technology&#8217;, -0.09575347602367401),</span></p>
<p><span style="font-weight: 400;"> (&#8216;response&#8217;, -0.11410721391439438),</span></p>
<p><span style="font-weight: 400;"> (&#8216;system&#8217;, -0.11555543541908264),</span></p>
<p><span style="font-weight: 400;"> (&#8216;interface&#8217;, -0.13429945707321167)]</span></p></blockquote>
<p>&nbsp;</p>
<h2><b>Conclusion</b></h2>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">In this post, we have discovered different types of text Feature Extraction Methods where we moved from non-context vectorization methods (count vectorizer/BOWs) to context preserving methods (TF-IDF/Word Embeddings). We have explored the above methods practically using Scikit-learn (sklearn) and Gensim libraries.</span></p>
<p><span style="font-weight: 400;">There are other advanced techniques for Word Embeddings like Facebook&#8217;s FastText. We will discuss them in our coming blogs.</span></p>
<p><span style="font-weight: 400;">Apart from Word Embeddings, Dimension Reductionality is also a Feature Extraction technique that aims to reduce the number of features in a dataset by creating new features from the existing ones and then discarding the original features.</span></p>
<p><span style="font-weight: 400;">Different techniques that you can explore for dimension reductional are Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE), and many more.</span></p>
<p>The post <a href="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/">Feature Extraction in Natural Language Processing</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">629</post-id>	</item>
		<item>
		<title>Abstractive Summarization Using Google&#8217;s T5</title>
		<link>https://turbolab.in/abstractive-summarization-using-googles-t5/</link>
					<comments>https://turbolab.in/abstractive-summarization-using-googles-t5/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Mon, 04 Oct 2021 04:04:00 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[abstractive summarization]]></category>
		<category><![CDATA[bert]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[text summmarization]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=592</guid>

					<description><![CDATA[<p>In this article, we will discuss abstractive summarization using T5, and how it is different from BERT-based models. T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output, in contrast to BERT-style models that can only output either a class label [&#8230;]</p>
<p>The post <a href="https://turbolab.in/abstractive-summarization-using-googles-t5/">Abstractive Summarization Using Google&#8217;s T5</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">In this article, we will discuss abstractive summarization using T5, and how it is different from BERT-based models.</span></p>
<p><span style="font-weight: 400">T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with </span><b>text as input</b><span style="font-weight: 400"> and modified </span><b>text as output</b><span style="font-weight: 400">,</span><span style="font-weight: 400"> in contrast to BERT-style models that can only output either a class label or a span of the input. </span><span style="font-weight: 400">This text-to-text formatting makes the T5 model fit for multiple <strong>NLP</strong> tasks like <strong>Summarization</strong>, <strong>Question-Answering</strong>, <strong>Machine Translation</strong>, and <strong>Classification</strong> problems.</span></p>
<h2><span style="font-weight: 400">How T5 is different from BERT?</span></h2>
<p><span style="font-weight: 400">Both T5 and BERT are trained with MLM (Masked Language Model) approach. </span></p>
<p><strong>What is MLM? </strong></p>
<p><span style="font-weight: 400">The MLM is a fill-in-the-blank task, where the model masks part of the input text and tries to predict what that masked word should be.</span></p>
<p><span style="font-weight: 400">Example:</span></p>
<ul>
<li><b><i>“I like to eat peanut butter and &lt;MASK&gt; sandwiches,”</i></b></li>
</ul>
<ul>
<li><b><i>“I like to eat peanut butter and </i></b><b>jelly</b><b><i> sandwiches,”</i></b></li>
</ul>
<p><span style="font-weight: 400">The only difference is that T5 replaces multiple consecutive tokens with the single Mask Keyword, unlike, BERT which uses Mask token for each word. This illustration is shown below.</span></p>
<figure id="attachment_593" aria-describedby="caption-attachment-593" style="width: 591px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="593" data-permalink="https://turbolab.in/abstractive-summarization-using-googles-t5/mlm/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?fit=591%2C266&amp;ssl=1" data-orig-size="591,266" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="mlm" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?fit=591%2C266&amp;ssl=1" class="wp-image-593 size-full" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?resize=591%2C266&#038;ssl=1" alt="" width="591" height="266" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?w=591&amp;ssl=1 591w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?resize=300%2C135&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/mlm.png?resize=480%2C216&amp;ssl=1 480w" sizes="(max-width: 591px) 100vw, 591px" /><figcaption id="caption-attachment-593" class="wp-caption-text">Source: Journal of Machine Learning</figcaption></figure>
<h2><span style="font-weight: 400">About T5 Models</span></h2>
<p><span style="font-weight: 400">Google has released the pre-trained T5 text-to-text framework models which are trained on the unlabelled large text corpus called C4 (Colossal Clean Crawled Corpus) using deep learning. C4 is the web extract text of 800Gb cleaned data. The cleaning process involves deduplication, discarding incomplete sentences, and removing offensive or noisy content.</span></p>
<p><b>You can get these T5 pre-trained models from the </b><a href="https://huggingface.co/models?search=T5"><b>HuggingFace website</b></a><b>:</b></p>
<ol>
<li><span style="font-weight: 400">   T5-small with 60 million parameters.</span></li>
<li><span style="font-weight: 400">   T5-base with 220 million parameters.</span></li>
<li><span style="font-weight: 400">   T5-large with 770 million parameters.</span></li>
<li><span style="font-weight: 400">   T5-3B with 3 billion parameters.</span></li>
<li><span style="font-weight: 400">   T5-11B with 11 billion parameters.</span></li>
</ol>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="594" data-permalink="https://turbolab.in/abstractive-summarization-using-googles-t5/capture/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?fit=1318%2C491&amp;ssl=1" data-orig-size="1318,491" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1632147683&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="t5 models" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?fit=800%2C298&amp;ssl=1" class="alignnone size-full wp-image-594" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=800%2C298&#038;ssl=1" alt="" width="800" height="298" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?w=1318&amp;ssl=1 1318w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=300%2C112&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=768%2C286&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=1024%2C381&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=1080%2C402&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=1280%2C477&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=980%2C365&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/Capture.jpg?resize=480%2C179&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><span style="font-weight: 400">T5 expects a prefix before the input text to understand the task given by the user. For example, &#8220;<strong>summarize</strong>:&#8221; for the summarization, &#8220;<strong>cola sentence:</strong>&#8221; for the classification, &#8220;<strong>translate</strong> English to Spanish:&#8221; for the machine translation, etc., You can have a look at the below image to understand the above illustration.</span></p>
<figure id="attachment_595" aria-describedby="caption-attachment-595" style="width: 744px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="595" data-permalink="https://turbolab.in/abstractive-summarization-using-googles-t5/1_xch7mi0d_v3vvdipu-svkq-744x328/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?fit=744%2C328&amp;ssl=1" data-orig-size="744,328" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="1_xCh7mi0D_V3vvdIpU-sVKQ-744&amp;#215;328" data-image-description="" data-image-caption="&lt;p&gt;Source: Google AI Blog&lt;/p&gt;
" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?fit=744%2C328&amp;ssl=1" class="size-full wp-image-595" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=744%2C328&#038;ssl=1" alt="" width="744" height="328" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=744%2C328&amp;ssl=1 744w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=300%2C132&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/1_xCh7mi0D_V3vvdIpU-sVKQ-744x328.png?resize=480%2C212&amp;ssl=1 480w" sizes="(max-width: 744px) 100vw, 744px" /><figcaption id="caption-attachment-595" class="wp-caption-text">Source: Google AI Blog</figcaption></figure>
<p>Every task we consider uses text as input to the model, which is trained to generate some target text. This allows us to use the same model, loss function, and hyperparameters across our diverse set of tasks including translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue).</p>
<p><span style="font-weight: 400">Besides the improved transformer architecture and massive unsupervised training data, better decoding methods have also played an important role. Currently, the most prominent decoding methods are </span><b>Greedy Search</b><span style="font-weight: 400">, </span><b>Beam Search</b><span style="font-weight: 400">, </span><b>Top-K Sampling,</b><span style="font-weight: 400"> and </span><b>Top-p Sampling</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">Visit this </span><a href="https://huggingface.co/blog/how-to-generate">link</a><span style="font-weight: 400"> to know the detailed information about these methods.</span></p>
<h2><span style="font-weight: 400">Using T5 through the HuggingFace transformers:</span></h2>
<p><span style="font-weight: 400">HuggingFace, an open-source NLP library that helps load pre-trained models, which are similar to sci-kit learn for machine learning algorithms.</span></p>
<p><span style="font-weight: 400">We define the content we are going to summarize. </span></p>
<blockquote><p><em><span style="font-weight: 400">content = </span><span style="font-weight: 400">&#8220;China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.&#8221;</span></em></p></blockquote>
<h3>Importing the necessary packages</h3>
<blockquote><p>from transformers import T5Tokenizer, T5ForConditionalGeneration</p></blockquote>
<h3>Loading the tokenizer and model architecture with weights</h3>
<blockquote><p>T5_PATH = &#8216;t5-large&#8217; # T5 model name</p>
<p># initialize the model architecture and weights</p>
<p>t5_model = T5ForConditionalGeneration.from_pretrained(T5_PATH)</p>
<p># initialize the model tokenizer</p>
<p>t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)</p></blockquote>
<p><span style="font-weight: 400">The pre-trained model used here is t5-large. Other pre-trained models of t5 are discussed above.</span></p>
<h3>Encode the text</h3>
<blockquote><p><span style="font-weight: 400"># encode the text into tensor of integers using the tokenizer</span></p>
<p><em><span style="font-weight: 400">inputs = tokenizer.encode(&#8220;summarize: &#8221; + article, return_tensors=&#8221;pt&#8221;, max_length=512, padding=’max_length’, truncation=True)</span></em></p></blockquote>
<h3>Generate the summarized text and decode it</h3>
<blockquote><p><em><span style="font-weight: 400">summary_ids = t5_model.generate(inputs,</span></em></p>
<p><em><span style="font-weight: 400">                                    num_beams=int(2),</span></em></p>
<p><em><span style="font-weight: 400">                                    no_repeat_ngram_size=3,</span></em></p>
<p><em><span style="font-weight: 400">                                    length_penalty=2.0,</span></em></p>
<p><em><span style="font-weight: 400">                                    min_length=min_length,</span></em></p>
<p><em><span style="font-weight: 400">                                    max_length=max_length,</span></em></p>
<p><em><span style="font-weight: 400">                                    early_stopping=True)</span></em></p>
<p><em><span style="font-weight: 400">output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)</span></em></p></blockquote>
<p><span style="font-weight: 400">The decoding method used here is </span><b>Beam Search</b><span style="font-weight: 400"> with </span><b>num_beams </b><span style="font-weight: 400">value as 2.</span></p>
<p><span style="font-weight: 400">With </span><b>min_length 50 </b><span style="font-weight: 400">and </span><b>max_length 50</b><span style="font-weight: 400">, the output is:</span></p>
<blockquote><p><i><span style="font-weight: 400">&#8220;Huawei overtakes Samsung as world&#8217;s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung&#8217;s 53.7 million, Canalys says. Sales of Huawei&#8217;s&#8221;</span></i></p></blockquote>
<p><span style="font-weight: 400">and the time taken to generate the summary is 8.07 seconds with 16 cores CPU host.</span></p>
<p><span style="font-weight: 400">With </span><b>min_length 50 </b><span style="font-weight: 400">and </span><b>max_length 100</b><span style="font-weight: 400">, the output is:</span></p>
<blockquote><p><i><span style="font-weight: 400">&#8220;Huawei overtakes Samsung as world&#8217;s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung&#8217;s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But company increased its dominance of the china market which has been faster to recover from COVID-19.&#8221;</span></i></p></blockquote>
<p><span style="font-weight: 400">and the time taken to generate the summary is 14.32 seconds with 16 cores CPU host.</span></p>
<p><span style="font-weight: 400">With </span><b>min_length 100 </b><span style="font-weight: 400">and </span><b>max_length 200</b><span style="font-weight: 400">, the output is:</span></p>
<blockquote><p><i><span style="font-weight: 400">&#8220;Huawei overtakes Samsung as world&#8217;s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung&#8217;s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But Huawei increased its dominance of the china market which has been faster to recover from COVID-19.. Apple is due to release its Q2 iPhone shipment data on friday.&#8221;</span></i></p></blockquote>
<p><span style="font-weight: 400">and the time taken to generate the summary is 23.15 seconds with 16 cores CPU host.</span></p>
<p><span style="font-weight: 400">As you increase any of these parameters </span><b>num_beams</b><span style="font-weight: 400">, </span><b>min_lenth, </b><span style="font-weight: 400">and </span><b>max_length</b><span style="font-weight: 400">, the time taken to generate the summary is going to increase.</span></p>
<h2><span style="font-weight: 400">Conclusion</span></h2>
<p><span style="font-weight: 400">In this article, we have used the Beam Search decoding method. For a better summary, we can suggest increasing the beam value and trying the other decoding methods(<b>Greedy Search</b>, <b>Beam Search</b>, <b>Top-K Sampling,</b> and <b>Top-p Sampling)</b> mentioned. </span></p>
<p><span style="font-weight: 400">With Pegasus, we can only perform abstractive summarization but T5 can perform various NLP tasks like Classification tasks (eg: Sentiment Analysis), Question-Answering, Machine Translation, and Document Summarization. We recommend you go through the other NLP tasks of T5.</span></p>
<p>The post <a href="https://turbolab.in/abstractive-summarization-using-googles-t5/">Abstractive Summarization Using Google&#8217;s T5</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/abstractive-summarization-using-googles-t5/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">592</post-id>	</item>
		<item>
		<title>Sentiment Analysis: Concepts, Models, and Examples</title>
		<link>https://turbolab.in/sentiment-analysis-concepts-models-and-examples/</link>
					<comments>https://turbolab.in/sentiment-analysis-concepts-models-and-examples/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Mon, 27 Sep 2021 06:30:32 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[Sentiment Analysis]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=607</guid>

					<description><![CDATA[<p>Sentiment analysis is a sub field of Natural Language Processing (NLP) that identifies and extracts emotions expressed in given texts. It is a machine learning tool that understands the context and determines the polarity of text, whether it is positive, neutral, or negative. This article will discuss what sentiment analysis is, where it is being [&#8230;]</p>
<p>The post <a href="https://turbolab.in/sentiment-analysis-concepts-models-and-examples/">Sentiment Analysis: Concepts, Models, and Examples</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Sentiment analysis is a sub field of Natural Language Processing (NLP) that identifies and extracts emotions expressed in given texts. It is a machine learning tool that understands the context and determines the polarity of text, whether it is positive, neutral, or negative.</span></p>
<p><span style="font-weight: 400">This article will discuss what sentiment analysis is, where it is being used, and how to use a pre-trained model to analyze sentiments from texts.</span></p>
<p><span style="font-weight: 400">We will also explore the approach on how Machine Learning models are used to build sentiment analytic tools.</span></p>
<h3></h3>
<h3><b>Use cases of sentiment analysis:</b></h3>
<ul>
<li style="font-weight: 400"><span style="font-weight: 400">Brand Monitoring</span></li>
<li style="font-weight: 400"><span style="font-weight: 400">Customers Feedback</span></li>
<li style="font-weight: 400"><span style="font-weight: 400">Product Analytics</span></li>
<li style="font-weight: 400"><span style="font-weight: 400">Monitoring Market Research</span></li>
<li style="font-weight: 400"><span style="font-weight: 400">Analyzing Movie Reviews</span></li>
</ul>
<p><span style="font-weight: 400">There are various pre-trained sentiment analysis tools available in Natural Language Processing (NLP) libraries. Such as NLTK’s Vader sentiment analysis tool, TextBlob, Flair sentiment classifier based on LSTM neural network, etc.</span></p>
<h3></h3>
<h3></h3>
<h3><b>Part 1- Sentiment analysis using a pre-trained model (TextBlob)</b></h3>
<p><span style="font-weight: 400">TextBlob is a python library for Natural Language Processing (NLP). It helps you perform complex analysis and operations on textual data.</span></p>
<h4><b>Steps to apply the TextBlob model to achieve sentiments are given here:</b></h4>
<p><span style="font-weight: 400">Before applying Textblob, basic text cleaning should be done. You can check NLTK or Spacy libraries for various text cleaning methods.</span></p>
<p>&nbsp;</p>
<blockquote><p><span style="font-weight: 400">from textblob import TextBlob</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">def sentimental(text: str) -&gt; str:</span></p>
<p><span style="font-weight: 400">    sentiment = None</span></p>
<p><span style="font-weight: 400">    if text:</span></p>
<p><span style="font-weight: 400">        text = &#8216; &#8216;.join(text.split()).strip() # removing empty strings</span></p>
<p><span style="font-weight: 400">        blob = TextBlob(text)</span></p>
<p><span style="font-weight: 400">        if blob.sentiment.polarity &gt; 0:</span></p>
<p><span style="font-weight: 400">            sentiment = &#8216;Positive&#8217;</span></p>
<p><span style="font-weight: 400">        if blob.sentiment.polarity &lt; 0:</span></p>
<p><span style="font-weight: 400">            sentiment = &#8216;Negative&#8217;</span></p>
<p><span style="font-weight: 400">        if blob.sentiment.polarity == 0:</span></p>
<p><span style="font-weight: 400">            sentiment = &#8216;Neutral&#8217;</span></p>
<p><span style="font-weight: 400">    return sentiment</span></p></blockquote>
<h4></h4>
<p>&nbsp;</p>
<h4><b>Output Result:</b></h4>
<p><em><span style="font-weight: 400">sentimental(&#8220;This is what a true masterpiece looks like. Thrillingly played with a flawless ensemble cast.&#8221;)</span></em></p>
<p><em><span style="font-weight: 400">Out: &#8216;<strong>Positive</strong>&#8216;</span></em></p>
<p><span style="font-weight: 400">TextBlob returns the ‘polarity’ of a sentence. Polarity lies between [-1,1].</span></p>
<p><span style="font-weight: 400">-1 defines </span><b>Negative</b><span style="font-weight: 400">, 0 defines </span><b>Neutral,</b><span style="font-weight: 400"> and 1 defines </span><b>Positive</b><span style="font-weight: 400">.</span></p>
<h3></h3>
<p>&nbsp;</p>
<h3><b>Part 2 &#8211; Train a Machine Learning Model for sentiment analysis</b></h3>
<p><span style="font-weight: 400">In this part, we will be using a Supervised Machine Learning model called Support Vector Machines (SVM) to train the model.</span></p>
<h4></h4>
<h4><b>Data Gathering: </b></h4>
<p><span style="font-weight: 400">Here we will choose</span><a href="http://www.cs.cornell.edu/people/pabo/movie-review-data/"> <b>sentiment polarity datasets 2.0</b></a><span style="font-weight: 400"> which is a classified movie dataset with labels, and transformed into CSVs.</span></p>
<p><span style="font-weight: 400">Data is divided into </span><b>“trainData”</b><span style="font-weight: 400"> and </span><b>“testData”. </b><span style="font-weight: 400">The dataset contains “Content” and “Label” columns.</span></p>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="609" data-permalink="https://turbolab.in/sentiment-analysis-concepts-models-and-examples/senti5/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?fit=888%2C548&amp;ssl=1" data-orig-size="888,548" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="senti5" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?fit=800%2C494&amp;ssl=1" class="wp-image-609 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?resize=576%2C355&#038;ssl=1" alt="" width="576" height="355" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?resize=300%2C185&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?resize=768%2C474&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?resize=480%2C296&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti5.png?w=888&amp;ssl=1 888w" sizes="(max-width: 576px) 100vw, 576px" /></p>
<p>&nbsp;</p>
<h4><b>Data Vectorization</b></h4>
<p><span style="font-weight: 400">Before feeding our model with data, we need to extract features from our textual dataset, basically converting the text data into vectors. TF-IDF is one of many methods to extract features from text documents. TF-IDF stands for &#8216;Term Frequency &#8211; Inverse Document Frequency.</span></p>
<blockquote><p><span style="font-weight: 400">from sklearn.feature_extraction.text import TfidfVectorizer</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400"># Create feature vectors</span></p>
<p><span style="font-weight: 400">vectorizer = TfidfVectorizer(min_df = 5,</span></p>
<p><span style="font-weight: 400">                             max_df = 0.8,</span></p>
<p><span style="font-weight: 400">                             sublinear_tf = True,</span></p>
<p><span style="font-weight: 400">                             use_idf = True)</span></p>
<p><span style="font-weight: 400">train_vectors = vectorizer.fit_transform(trainData[&#8216;Content&#8217;])</span></p>
<p><span style="font-weight: 400">test_vectors = vectorizer.transform(testData[&#8216;Content&#8217;])</span></p></blockquote>
<h4></h4>
<p>&nbsp;</p>
<h4><b>Model Building</b></h4>
<p><span style="font-weight: 400">After generating vectors for both train and test input sets, we can now feed the SVC model with this data and train it.</span></p>
<blockquote><p><span style="font-weight: 400"># importing libraries</span></p>
<p><span style="font-weight: 400">from sklearn import svm</span></p>
<p><span style="font-weight: 400">from sklearn.metrics import classification_report</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400"># Initialising SVM classifier with linear kernel</span></p>
<p><span style="font-weight: 400">svm_classifier = svm.SVC(kernel=&#8217;linear&#8217;)</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400"># training the model with the train data</span></p>
<p><span style="font-weight: 400">svm_classifier.fit(train_vectors, trainData[&#8216;Label&#8217;])</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400"># testing the model in test data content</span></p>
<p><span style="font-weight: 400">predicted_result = svm_classifier.predict(test_vectors)</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400"># results</span></p>
<p><span style="font-weight: 400">report = classification_report(testData[&#8216;Label&#8217;], predicted_result, output_dict=True)</span></p>
<p><span style="font-weight: 400">print(&#8216;Model accuracy: &#8216;, report[&#8216;accuracy&#8217;])</span></p></blockquote>
<h4></h4>
<p>&nbsp;</p>
<h4><b>Model Results and Statistics:</b></h4>
<p><em><span style="font-weight: 400">Model accuracy:  <strong>0.915</strong></span></em></p>
<p><span style="font-weight: 400">Model accuracy shows the ratio of the number of correctly predicted classes to the total number of input samples. Accuracy is one of many metrics used for evaluating classification problems.</span></p>
<p><span style="font-weight: 400">Here the accuracy is 0.915, which shows that the model has learned the data quite well as the range of accuracy is calculated between 0 to 1.</span></p>
<h4><b>Testing the Model to Predict on Movie Reviews:</b></h4>
<p><em><span style="font-weight: 400">svm_classifier.predict(vectorizer.transform(&#8220;This is what a true masterpiece looks like. Thrillingly played with a flawless ensemble cast.&#8221;))</span></em></p>
<p><em><span style="font-weight: 400">Out: ‘<strong>Positive</strong>’</span></em></p>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="610" data-permalink="https://turbolab.in/sentiment-analysis-concepts-models-and-examples/sent4/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?fit=996%2C644&amp;ssl=1" data-orig-size="996,644" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="sent4" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?fit=800%2C517&amp;ssl=1" class=" wp-image-610 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?resize=545%2C353&#038;ssl=1" alt="" width="545" height="353" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?resize=300%2C194&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?resize=768%2C497&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?resize=980%2C634&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?resize=480%2C310&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/sent4.png?w=996&amp;ssl=1 996w" sizes="(max-width: 545px) 100vw, 545px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Classification accuracy alone can be misleading if you have an unequal number of observations. A confusion matrix can give you a better idea of what our model is predicting correctly.</span></p>
<p><span style="font-weight: 400">Here we have taken 200 test samples and as shown in the matrix above, we got 9 False positives, which means it has falsely predicted the negative as positive. There were also 8 False negatives, where it has falsely predicted the positive as negative.</span></p>
<p><span style="font-weight: 400">To reduce these errors we can train the model with a larger dataset.</span></p>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="611" data-permalink="https://turbolab.in/sentiment-analysis-concepts-models-and-examples/senti2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?fit=1338%2C916&amp;ssl=1" data-orig-size="1338,916" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="senti2" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?fit=800%2C548&amp;ssl=1" class="wp-image-611 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=559%2C382&#038;ssl=1" alt="" width="559" height="382" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=300%2C205&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=768%2C526&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=1024%2C701&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=1080%2C739&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=1280%2C876&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=980%2C671&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?resize=480%2C329&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/09/senti2.png?w=1338&amp;ssl=1 1338w" sizes="(max-width: 559px) 100vw, 559px" /></p>
<h3></h3>
<h3></h3>
<h3></h3>
<h3><b>Conclusion</b></h3>
<p><span style="font-weight: 400">In this article, we have mentioned the TextBlob (pre-trained) Python package and SVM (Machine Learning) model to determine sentiment analysis. But the field of sentiment analysis is an exciting research direction due to a large number of real-world applications where discovering people’s opinions is important in better decision-making.</span></p>
<p><span style="font-weight: 400">Although detecting sentiment using NLP is surprisingly a difficult task, such as when we face sentences that are put in sarcastic ways. These types of textual context can mislead NLP-based model predictions. We can even see that both the model prediction results are not the same for all samples. Here the TextBlob model performs and predicts better with ‘neutral’ tagging of </span><span style="font-weight: 400">articles. This is because TextBlob is using more data to train the model and has neutral tagged data in the training set.</span></p>
<p><span style="font-weight: 400">To overcome such difficult tasks, we can use deep learning models like LSTM, RNN, etc. We can even make use of transformer-based models like GPT-3 and T5 from google for sentiment analysis.</span></p>
<p>The post <a href="https://turbolab.in/sentiment-analysis-concepts-models-and-examples/">Sentiment Analysis: Concepts, Models, and Examples</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/sentiment-analysis-concepts-models-and-examples/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">607</post-id>	</item>
	</channel>
</rss>
