<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Technology Archives - Turbolab Technologies</title>
	<atom:link href="https://turbolab.in/category/tech/feed/" rel="self" type="application/rss+xml" />
	<link>https://turbolab.in/category/tech/</link>
	<description>Big Data and News Analysis Startup in Kochi</description>
	<lastBuildDate>Fri, 26 Jul 2024 10:00:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/turbolab.in/wp-content/uploads/2018/03/turbo_black_trans-space.png?fit=32%2C32&#038;ssl=1</url>
	<title>Technology Archives - Turbolab Technologies</title>
	<link>https://turbolab.in/category/tech/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">98237731</site>	<item>
		<title>Entity Linking &#038; Disambiguation using REL</title>
		<link>https://turbolab.in/entity-linking-disambiguation-using-rel/</link>
					<comments>https://turbolab.in/entity-linking-disambiguation-using-rel/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Tue, 12 Jul 2022 07:02:27 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[entity linking]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[rel]]></category>
		<category><![CDATA[spacy]]></category>
		<category><![CDATA[wikifier]]></category>
		<category><![CDATA[wikipedia]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=907</guid>

					<description><![CDATA[<p>Entity extraction, also known as Named Entity Recognition(NER), is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social [&#8230;]</p>
<p>The post <a href="https://turbolab.in/entity-linking-disambiguation-using-rel/">Entity Linking &amp; Disambiguation using REL</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Entity extraction, also known as </span><em><b>Named Entity Recognition(NER)</b></em><span style="font-weight: 400">, is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social media, will be transformed into structured entities that help analysts query the data and generate insightful reports.</span></p>
<p><span style="font-weight: 400">spaCy provides different models in various languages to perform NER and NLP-related tasks. Building a custom NER model using spaCy has been explained in one of our blogs. You can check out the link</span> <strong><a href="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/">here</a></strong>.</p>
<p><span style="font-weight: 400">Now, let’s look into the entity extraction from a random news article using spaCy and Flair:</span></p>
<blockquote><p><em>Defending champion Novak Djokovic battled back from two sets to love down to defeat Jannik Sinner and reach his 11th Wimbledon semi-final on Tuesday. Djokovic triumphed 5-7, 2-6, 6-3, 6-2, 6-2 and will face Britain&#8217;s Cameron Norrie of Belgium for a place in Sunday&#8217;s final. It was the seventh time in the Serb&#8217;s career that he had recovered from two sets to love at the Slams. &#8220;Huge congrats to Jannik for a big fight, he&#8217;s so mature for his age, he has plenty of time ahead of him,&#8221; said Djokovic.</em></p></blockquote>
<h5>Entity Extraction using spaCy:</h5>
<blockquote><p><em><strong>import spacy</strong></em></p>
<p><em><strong>nlp = spacy.load(&#8216;en_core_web_lg&#8217;) # spacy load the model</strong></em></p>
<p><em><strong>ner_ent = {&#8216;person&#8217;: [], &#8216;norp&#8217;: [], &#8216;fac&#8217;: [], &#8216;org&#8217;: [], &#8216;gpe&#8217;: [], &#8216;loc&#8217;: [], &#8216;product&#8217;: [], &#8216;event&#8217;: [], &#8216;work_of_art&#8217;: [], &#8216;law&#8217;: [], &#8216;language&#8217;: [], &#8216;date&#8217;: [], &#8216;time&#8217;: [], &#8216;percent&#8217;: [], &#8216;money&#8217;: [], &#8216;quantity&#8217;: [], &#8216;ordinal&#8217;: [], &#8216;cardinal&#8217;: []}</strong></em></p>
<p><em><strong>doc = nlp(content)</strong></em><br />
<em><strong>for entity in doc.ents:</strong></em><br />
<em><strong>    if entity.label_.lower() in ner_ent:</strong></em><br />
<em><strong>        ner_ent[entity.label_.lower()].append(entity.text)</strong></em></p>
<p><em><strong>print(ner_ent)</strong></em></p>
<p><em><strong># output</strong></em></p>
<p><em><strong>{&#8216;person&#8217;: [&#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;, &#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;], &#8216;norp&#8217;: [&#8216;Serb&#8217;, &#8216;Serb&#8217;], &#8216;fac&#8217;: [], &#8216;org&#8217;: [], &#8216;gpe&#8217;: [&#8216;Britain&#8217;, &#8216;Belgium&#8217;, &#8216;Britain&#8217;, &#8216;Belgium&#8217;], &#8216;loc&#8217;: [], &#8216;product&#8217;: [], &#8216;event&#8217;: [&#8216;Wimbledon&#8217;, &#8216;Wimbledon&#8217;], &#8216;work_of_art&#8217;: [], &#8216;law&#8217;: [], &#8216;language&#8217;: [], &#8216;date&#8217;: [&#8216;Tuesday&#8217;, &#8216;Sunday&#8217;, &#8216;Tuesday&#8217;, &#8216;Sunday&#8217;], &#8216;time&#8217;: [], &#8216;percent&#8217;: [], &#8216;money&#8217;: [], &#8216;quantity&#8217;: [], &#8216;ordinal&#8217;: [&#8217;11th&#8217;, &#8216;seventh&#8217;, &#8217;11th&#8217;, &#8216;seventh&#8217;], &#8216;cardinal&#8217;: [&#8216;two&#8217;, &#8216;5&#8217;, &#8216;2-6&#8217;, &#8216;6-3&#8217;, &#8216;6&#8217;, &#8216;6-2&#8217;, &#8216;two&#8217;, &#8216;two&#8217;, &#8216;5&#8217;, &#8216;2-6&#8217;, &#8216;6-3&#8217;, &#8216;6&#8217;, &#8216;6-2&#8217;, &#8216;two&#8217;]}</strong></em></p></blockquote>
<h5>Entity Extraction using Flair:</h5>
<blockquote><p><em><strong>from flair.data import Sentence</strong></em><br />
<em><strong>from flair.models import SequenceTagger</strong></em></p>
<p><em><strong>ner_ent = {&#8216;per&#8217;: [], &#8216;org&#8217;: [], &#8216;loc&#8217;: [], &#8216;misc&#8217;: []}</strong></em></p>
<p><em><strong># make a sentence</strong></em><br />
<em><strong>sentence = Sentence(content)</strong></em></p>
<p><em><strong># load the NER tagger</strong></em><br />
<em><strong>tagger = SequenceTagger.load(&#8216;ner&#8217;)</strong></em></p>
<p><em><strong># run NER over sentence</strong></em><br />
<em><strong>tagger.predict(sentence)</strong></em></p>
<p><em><strong>print(&#8216;The following NER tags are found:&#8217;)</strong></em><br />
<em><strong># iterate over each entity</strong></em><br />
<em><strong>for entity in sentence.get_spans(&#8216;ner&#8217;):</strong></em><br />
<em><strong>    if str(entity.labels[0]).split()[0].lower() in ner_ent:</strong></em><br />
<em><strong>        ner_ent[str(entity.labels[0]).split()[0].lower()].append(entity.text)</strong></em></p>
<p><em><strong># output</strong></em></p>
<p><em><strong>The following NER tags are found:</strong></em></p>
<p><em><strong>{&#8216;per&#8217;: [&#8216;George Washington&#8217;, &#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Djokovic&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;], &#8216;org&#8217;: [], &#8216;loc&#8217;: [&#8216;Washington&#8217;, &#8216;Britain&#8217;, &#8216;Belgium&#8217;], &#8216;misc&#8217;: [&#8216;Wimbledon&#8217;, &#8216;Serb&#8217;, &#8216;Slams&#8217;]}</strong></em></p></blockquote>
<p>Flair NER models give us only 4 entity types whereas spaCy gives 18 entity types.</p>
<h2>Entity Linking &amp; Disambiguation</h2>
<p>Entity Linking is the process of linking entities with the target knowledge base. Here, we map the entities to the wiki links or the wiki page titles. Hence the process is called Wikification. We can say entity linking is also referred to as entity validation. The entities extracted from the models of Spacy or Flair will get validated from the third-party knowledge base.</p>
<p>However, this job is entity linking is intricate due to entity ambiguity and name variants. For example, the word <strong>Amazon</strong> refers to an organization and a rainforest.</p>
<p>Let&#8217;s have a detailed discussion on Entity Linking &amp; Entity Disambiguation</p>
<h5>News Article Clip:</h5>
<blockquote><p>Deforestation in Brazil&#8217;s Amazon rainforest reached a record high for the first six months of the year, as an area five times the size of New York City was destroyed, preliminary government data showed on Friday.</p></blockquote>
<h5>Spacy Output:</h5>
<blockquote><p>&#8216;org&#8217;: [&#8216;Amazon&#8217;], &#8216;gpe&#8217;: [&#8216;Brazil&#8217;, &#8216;New York City&#8217;]</p></blockquote>
<p>Here, <strong>Amazon</strong> is detected as the organization.</p>
<h5>Flair Output:</h5>
<blockquote><p>&#8216;loc&#8217;: [&#8216;Brazil&#8217;, &#8216;Amazon&#8217;, &#8216;New York City&#8217;]</p></blockquote>
<p><span style="font-weight: 400">Here, </span><b>Amazon</b><span style="font-weight: 400"> is detected as the location/GPE. The ambiguity problem is clearly visible here and can be solved by Radboud Entity Linker (REL).</span></p>
<h5><strong>REL</strong> <strong>Output</strong>:</h5>
<p><img data-recalc-dims="1" fetchpriority="high" decoding="async" data-attachment-id="908" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=1430%2C266&amp;ssl=1" data-orig-size="1430,266" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel" data-image-description="" data-image-caption="&lt;p&gt;REL&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=300%2C56&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=800%2C148&amp;ssl=1" class="size-full wp-image-908" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=800%2C149&#038;ssl=1" alt="" width="800" height="149" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?w=1430&amp;ssl=1 1430w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=300%2C56&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=768%2C143&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1024%2C190&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1080%2C201&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1280%2C238&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=980%2C182&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=480%2C89&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><a href="https://github.com/informagi/REL"><strong>Radboud Entity Linker (REL)</strong></a> deals <span style="font-weight: 400">with the tasks of Entity Linking and Entity Disambiguation. One can use the public API provided by REL or install it using Docker/Source code with the instructions mentioned in the documentation. By default, </span><b>REL</b><span style="font-weight: 400"> uses Flair to extract entities; you can replace Flair with spaCy. REL also provides pre-trained models with case-sensitive and insensitive models with an f1 score of almost 93%.</span></p>
<p><a href="https://pypi.org/project/wikimapper/"><strong>Wikimapper</strong></a> python <span style="font-weight: 400">library is used to fetch the wikidata_id from the Wikipedia titles. You can have a look at the project which helps you to map Wikipedia page titles to WikiData IDs and vice-versa.</span></p>
<p><a href="https://github.com/facebookresearch/BLINK"><b>BLINK</b></a><span style="font-weight: 400">, the Facebook research entity linking python library,  uses Wikipedia as the target knowledge base, similar to </span><b>REL</b><span style="font-weight: 400">. But, the BLINK documentation hasn&#8217;t revealed any information regarding entity disambiguation.</span></p>
<p><a href="https://github.com/wetneb/opentapioca"><b>OpenTapioca</b></a><span style="font-weight: 400"> is a simple and fast Named Entity Linking system for Wikidata. A spaCy wrapper of OpenTapioca called</span><a href="https://spacy.io/universe/project/spacyopentapioca"> <b>spaCyOpenTapioca</b></a><span style="font-weight: 400"> is also available for the entity linking process. But the results are not as great when compared to REL.</span></p>
<p><span style="font-weight: 400">spaCy includes a pipeline component called</span><a href="https://spacy.io/api/entitylinker"> <b>entitylinker</b></a><span style="font-weight: 400"> for Named Entity Linking and Disambiguation.</span></p>
<h2>Dealing with Disambiguation</h2>
<blockquote><p><span id="w0" class="word annotHilite hasAnnotation underlined">Japan</span><span id="s1" class="space"> </span><span id="w1" class="word hasAnnotation">began</span><span id="s2" class="space"> </span><span id="w2" class="word hasAnnotation">the</span><span id="s3" class="space hasAnnotation"> </span><span id="w3" class="word hasAnnotation">defence</span><span id="s4" class="space hasAnnotation"> </span><span id="w4" class="word hasAnnotation">of</span><span id="s5" class="space"> </span><span id="w5" class="word hasAnnotation">their</span><span id="s6" class="space hasAnnotation"> </span><span id="w6" class="word hasAnnotation">title</span><span id="s7" class="space"> </span><span id="w7" class="word hasAnnotation">with</span><span id="s8" class="space"> </span><span id="w8" class="word hasAnnotation">a</span><span id="s9" class="space"> </span><span id="w9" class="word hasAnnotation">lucky</span><span id="s10" class="space"> </span><span id="w10" class="word hasAnnotation">2-1</span><span id="s11" class="space"> </span><span id="w11" class="word hasAnnotation">win</span><span id="s12" class="space"> </span><span id="w12" class="word hasAnnotation">against</span><span id="s13" class="space"> </span><span id="w13" class="word hasAnnotation underlined">Syria</span><span id="s14" class="space"> </span><span id="w14" class="word hasAnnotation">in</span><span id="s15" class="space"> </span><span id="w15" class="word hasAnnotation">a</span><span id="s16" class="space hasAnnotation"> </span><span id="w16" class="word hasAnnotation">championship</span><span id="s17" class="space hasAnnotation"> </span><span id="w17" class="word hasAnnotation">match</span><span id="s18" class="space"> </span><span id="w18" class="word hasAnnotation">on</span><span id="s19" class="space"> </span><span id="w19" class="word hasAnnotation">Friday</span><span id="s20" class="space"></span><span id="w20" class="word hasAnnotation">.</span></p></blockquote>
<p><span style="font-weight: 400">Using the above statement, we will discuss the different approaches to choosing the appropriate entity in the case of Entity Disambiguation.</span></p>
<h5>Let&#8217;s see how <a href="https://wikifier.org/"><strong>wikifier</strong></a> deals with the disambiguation:</h5>
<p><a href="https://wikifier.org/"><strong>Wikifier</strong></a> <span style="font-weight: 400">doesn&#8217;t use any entity extraction method for extracting entities; it goes with Parts of Speech (POS).</span></p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="911" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/wikifier1/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?fit=1891%2C381&amp;ssl=1" data-orig-size="1891,381" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="wikifier1" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?fit=300%2C60&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?fit=800%2C161&amp;ssl=1" class="size-full wp-image-911 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=800%2C161&#038;ssl=1" alt="" width="800" height="161" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?w=1891&amp;ssl=1 1891w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=300%2C60&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=768%2C155&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1024%2C206&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1080%2C218&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1280%2C258&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=980%2C197&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=480%2C97&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?w=1600&amp;ssl=1 1600w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><span style="font-weight: 400">The entities Syria and Japan are linked to their respective countries’ Wikipedia pages,</span><a href="https://en.wikipedia.org/wiki/Syria"> <b>Syria</b></a><span style="font-weight: 400"> and</span><a href="https://en.wikipedia.org/wiki/Japan"> <b>Japan</b></a><span style="font-weight: 400">. In the context of the above statement, Japan and Syria actually refer to their football teams. Wikifier fetches all the Wikipedia page entities related to the entity and maps the entity with the most link targets.</span></p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="912" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/wikifier2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?fit=483%2C671&amp;ssl=1" data-orig-size="483,671" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="wikifier2" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?fit=216%2C300&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?fit=483%2C671&amp;ssl=1" class="size-full wp-image-912 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=483%2C671&#038;ssl=1" alt="" width="483" height="671" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?w=483&amp;ssl=1 483w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=216%2C300&amp;ssl=1 216w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=480%2C667&amp;ssl=1 480w" sizes="(max-width: 483px) 100vw, 483px" /></p>
<p>Wikifier considers the minLinkFrequency parameter to evaluate the score.</p>
<h5>Let&#8217;s see how REL deals with the disambiguation:</h5>
<p>In REL, entity linking decisions depend on the contextual similarity and coherence with the other entity linking decisions in the document. One entity mapping is dependent on the other entities found in the document. You can read the paper <a href="https://arxiv.org/pdf/2006.01969.pdf"><strong>here</strong></a>.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="913" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?fit=1435%2C215&amp;ssl=1" data-orig-size="1435,215" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel2" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?fit=300%2C45&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?fit=800%2C120&amp;ssl=1" class="size-full wp-image-913 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=800%2C120&#038;ssl=1" alt="" width="800" height="120" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?w=1435&amp;ssl=1 1435w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=300%2C45&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=768%2C115&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1024%2C153&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1080%2C162&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1280%2C192&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=980%2C147&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=480%2C72&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><span style="font-weight: 400">This example doesn&#8217;t have any impact since only two entities are found and the content is a one-liner. Instead of the entity detection method, if we had passed the POS output, the result might have been different.</span></p>
<p>With passing the entire <a href="https://www.firstpost.com/sports/fifa-world-cup-qualifiers-2022-syria-japan-secure-victories-to-make-it-to-next-round-9694971.html"><strong>article</strong></a> to the REL, the results are quite better. The REL model can now understand the context and relate more entities from the entire article.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="914" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel3/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?fit=1135%2C300&amp;ssl=1" data-orig-size="1135,300" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel3" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?fit=300%2C79&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?fit=800%2C212&amp;ssl=1" class="size-full wp-image-914 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=800%2C211&#038;ssl=1" alt="" width="800" height="211" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?w=1135&amp;ssl=1 1135w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=300%2C79&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=768%2C203&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=1024%2C271&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=1080%2C285&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=980%2C259&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=480%2C127&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><strong>Brazil</strong> and <strong>Dutch</strong> mapped to their respective football team wiki pages. Mapping <strong>Japan</strong> to its respective football team is still a mystery though. LOL.</p>
<h2>Conclusion</h2>
<p><span style="font-weight: 400">Instead of going with the score of the most link targets, REL considers the context and the relationship between the entities detected from the document. By improving the mentioned detection, REL can be used as a perfect Entity Disambiguation tool.</span></p>
<p>Last but not least, there is a tool called <a href="https://github.com/SapienzaNLP/extend"><strong>ExtEnD</strong></a>(Extractive Entity Disambiguation) which needs to explore. We can add this tool to the spaCy NLP pipeline.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="915" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/extend/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?fit=665%2C178&amp;ssl=1" data-orig-size="665,178" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="extend" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?fit=300%2C80&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?fit=665%2C178&amp;ssl=1" class="size-full wp-image-915 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=665%2C178&#038;ssl=1" alt="" width="665" height="178" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?w=665&amp;ssl=1 665w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=300%2C80&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=480%2C128&amp;ssl=1 480w" sizes="(max-width: 665px) 100vw, 665px" /></p>
<p>The output documented by <strong>ExtEnD</strong> is much better compared to the REL-generated output. Before coming to conclusion, as mentioned above this tool needs to explore.</p>
<p>The post <a href="https://turbolab.in/entity-linking-disambiguation-using-rel/">Entity Linking &amp; Disambiguation using REL</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/entity-linking-disambiguation-using-rel/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">907</post-id>	</item>
		<item>
		<title>Incremental/Online/Continuous Model Training using Creme</title>
		<link>https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/</link>
					<comments>https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Fri, 25 Feb 2022 09:01:02 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[creme]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[model retraining]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=889</guid>

					<description><![CDATA[<p>Have you noticed the trained ML model performance degrades over time? Why will the model performance degrade? Let&#8217;s say we have a model which takes the person&#8217;s data as an input and detects the face. Now with the Covid situation, almost 90% of people wear masks and the model will not be able to detect [&#8230;]</p>
<p>The post <a href="https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/">Incremental/Online/Continuous Model Training using Creme</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Have you noticed the trained ML model performance degrades over time? Why will the model performance degrade? Let&#8217;s say we have a model which takes the person&#8217;s data as an input and detects the face. Now with the Covid situation, almost 90% of people wear masks and the model will not be able to detect faces which results in low performance and low accuracy of the model. What is the phenomenon called? This is called Model Drift and it is categorized into Concept Drift and Data Drift.</p>
<p>Concept Drift is where the properties of the dependent variable change i.e., output/prediction of the model.</p>
<p>Data Drift is where the properties of the independent variable change i.e., input to the model.</p>
<blockquote><p><strong>y = a + bx</strong></p></blockquote>
<p>The change in the dependent variable <strong>y</strong> leads to Concept Drift and the change in the independent variable <strong>x</strong> leads to Data Drift.</p>
<blockquote><p><strong>Change is inevitable</strong></p></blockquote>
<p>The world and the parameters with which we train the model are going to change over time. Let&#8217;s discuss another example of a travel agency with a model considering a person&#8217;s average salary, season, weather as the input to predict the number of people traveling to some X country. With the covid regulations of border closures, flying restrictions, job losses, inflation, and with the change in people&#8217;s mindset the model would go for a toss.</p>
<h2>How to detect the model drift?</h2>
<p><strong>Monitoring</strong> the model in production is the only way to detect the model drift. With the alert triggering by keeping the threshold of precision, recall, and f1score metrics through the monitoring tools. <a href="https://evidentlyai.com/"><strong>Evidently AI</strong> </a>is one such monitoring tool.</p>
<h2>How to avoid the model drift?</h2>
<p>Either we train the model with the streaming data coming in for the continuous learning of the model or retrain the model with the interval of a week, month, etc., in a scheduled manner with the updated data. Retraining the model with the latest data is not an efficient way to handle the models which are already deployed in production. Online/Incremental model training is the efficient way.</p>
<h2>Model Retraining Approach</h2>
<p>Let us explore the python tool <strong><a href="https://pypi.org/project/creme/">creme</a></strong> to train the incremental ML model with the streaming data one record at a time.</p>
<p>With creme, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows learning from massive datasets that don&#8217;t fit in the main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you&#8217;re bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore creme!) might be what you&#8217;re looking for.</p>
<h3>Install from PyPI</h3>
<blockquote><p><em><strong>pip install creme</strong></em></p></blockquote>
<h3>Dataset to train the model</h3>
<blockquote><p><em><strong>docs = [</strong></em><br />
<em><strong>(&#8220;Cricket news: England James Anderson determined to revive international career despite West Indies axing&#8221;, &#8220;Cricket&#8221;),</strong></em><br />
<em><strong>(&#8220;Well Have Just One Head Coach For All Cricket Formats: CA Chairman&#8221;, &#8220;Cricket&#8221;),</strong></em><br />
<em><strong>(&#8220;Rod Marsh: Australian cricket legend in critical condition after suffering heart attack&#8221;, &#8220;Cricket&#8221;),</strong></em><br />
<em><strong>(&#8220;Facebook, Twitter highlight security steps for users in Ukraine&#8221;, &#8220;Technology&#8221;),</strong></em><br />
<em><strong>(&#8220;Apple launching new series of iphone&#8221;, &#8220;Technology&#8221;),</strong></em><br />
<em><strong>(&#8220;Galaxy S22 preorder sales indicate the phone is already a huge success&#8221;, &#8220;Technology&#8221;),</strong></em><br />
<em><strong>]</strong></em></p></blockquote>
<h3>Setting up the model pipeline</h3>
<blockquote><p><em><strong>from creme import compose</strong></em><br />
<em><strong>from creme import feature_extraction</strong></em></p>
<p><em><strong>model = compose.Pipeline(<br />
</strong></em><em><strong>              (&#8216;tokenize&#8217;, feature_extraction.TFIDF(lowercase=True)),</strong></em><br />
<em><strong>              (&#8216;nb&#8217;, naive_bayes.MultinomialNB(alpha=1))<br />
</strong></em><em><strong>)</strong></em></p></blockquote>
<p>Here, we are using TFIDF as the feature extraction method and Naive Bayes as the ML algorithm</p>
<p>These are the other feature extraction methods we can try</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="892" data-permalink="https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/feature-extraction/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?fit=591%2C207&amp;ssl=1" data-orig-size="591,207" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="feature extraction" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?fit=300%2C105&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?fit=591%2C207&amp;ssl=1" class="aligncenter wp-image-892 size-medium" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction-300x105.png?resize=300%2C105&#038;ssl=1" alt="" width="300" height="105" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?resize=300%2C105&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?resize=480%2C168&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/02/feature-extraction.png?w=591&amp;ssl=1 591w" sizes="(max-width: 300px) 100vw, 300px" /></p>
<h3>Fitting the data to the model, one record at a time</h3>
<blockquote><p><em><strong>%%time</strong></em><br />
<em><strong>for sentence, label in docs:</strong></em><br />
<em><strong>    model = model.fit_one(sentence, label)</strong></em></p>
<p>Wall time: 998 µs</p></blockquote>
<h3>Predictions &#8211; Testing the model</h3>
<blockquote><p><em><strong>model.predict_one(&#8220;Traffic arrangements for Australian cricket team’s visit, Pakistan Day events reviewed&#8221;)</strong></em><br />
Out: &#8216;Cricket&#8217;</p>
<p><em><strong>model.predict_one(&#8220;Launching Facebook Reels Globally and New Ways for Creators to Make Money&#8221;)</strong></em><br />
Out: &#8216;Technology&#8217;</p>
<p><em><strong>test = &#8220;Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad&#8221;</strong></em><br />
<em><strong>model.predict_one(test)</strong></em><br />
Out: &#8216;Cricket&#8217;</p></blockquote>
<p>As we can see in the above model testing, the last record of football news is predicted as cricket. Both are related to the sports category, we can anyway train our model with the new category as football.</p>
<h3 id="Training-on-a-new-data-and-new-category">Training on a new data and new category</h3>
<blockquote><p><em><strong>newDocs = [&#8220;Footballer took out insurance policy on BMW minutes after smashing into parked cars&#8221;, &#8220;Russian footballer Fedor Smolov, a 32-year-old striker currently playing for his country, became one of the first Russian sportsmen to express his heartbreak at the invasion of Ukraine by his country.&#8221;, &#8220;Ukraine’s international footballer Roman Yaremchuk scored the equalizer for Benfica in a Champions League match&#8221;]</strong></em></p>
<p><em><strong>for doc_ in newDocs:</strong></em><br />
<em><strong>    model.fit_one(doc_, &#8220;Football&#8221;)</strong></em></p></blockquote>
<h3>Retesting the model</h3>
<blockquote><p><em><strong>test = &#8220;Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad&#8221;</strong></em><br />
<em><strong>model.predict_one(test)</strong></em><br />
Out: &#8216;Football&#8217;</p></blockquote>
<p>We can update the model with the new data for the existing category or the new data with the new category.</p>
<p>Some benefits of using creme (and online machine learning in general):</p>
<ol>
<li>Incremental: models can update themselves in real-time.</li>
<li>Adaptive: models can adapt to concept drift.</li>
<li>Production-ready: working with data streams makes it simple to replicate production scenarios during model development.</li>
<li>Efficient: models don&#8217;t have to be retrained and require little compute power, which lowers their carbon footprint</li>
<li>Fast: when the goal is to learn and predict with a single instance at a time, then creme is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.</li>
</ol>
<p>The post <a href="https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/">Incremental/Online/Continuous Model Training using Creme</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/machine-learning-model-retraining-approach-incremental-online-continuous-model-training-using-creme/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">889</post-id>	</item>
		<item>
		<title>Lazy Predict &#8211; Find the best suitable ML model</title>
		<link>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/</link>
					<comments>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Tue, 18 Jan 2022 06:38:11 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[prediction]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regression]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=869</guid>

					<description><![CDATA[<p>As in the earlier blog “text classification using machine learning”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy.  To overcome this problem we will discuss here an awesome python library “Lazy Predict”. This module helps us [&#8230;]</p>
<p>The post <a href="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/">Lazy Predict &#8211; Find the best suitable ML model</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">As in the earlier blog “<a href="https://turbolab.in/text-classification-using-machine-learning/">text classification using machine learning</a>”, we saw a few drawbacks on how difficult it is to select the best ML models and time-consuming for tuning different model parameters to achieve better accuracy.  To overcome this problem we will discuss here an awesome python library “<a href="https://lazypredict.readthedocs.io/en/latest/">Lazy Predict</a>”. This module helps us find the best model for classification and regression based on our data.</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">It provides a Lazy Classifier for classification problems and Lazy Regression for regression problems. </span></p>
<ul>
<li><strong><strong>Note: </strong></strong>Lazy Predict takes high computational power and it was a little time-consuming for me to run high dimensional data with multiple features.</li>
</ul>
<p>&nbsp;</p>
<p><b>Let us see how it works:</b></p>
<p><span style="font-weight: 400">First, install this library in your local system</span></p>
<blockquote><p><i><span style="font-weight: 400">pip  install lazypredict</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Dataset</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here we are not concentrating more on the dataset or its feature extraction and transformation steps, as it has been shown in the previous blog on “</span><a href="https://turbolab.in/text-classification-using-machine-learning/"><span style="font-weight: 400">text classification using machine learning</span></a><span style="font-weight: 400">”. </span></p>
<p><span style="font-weight: 400">To demonstrate lazy predict classification and regression problems we are using &#8220;D</span><span style="font-weight: 400">rug type&#8221;</span><span style="font-weight: 400"> and &#8220;W</span><span style="font-weight: 400">ine quality&#8221;</span><span style="font-weight: 400"> data both taken from kaggle.com</span></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Code</b></h3>
<p>&nbsp;</p>
<h4><b>Importing required libraries</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">import lazypredict</span></i></p>
<p><i><span style="font-weight: 400">import pandas as pd </span></i></p>
<p><i><span style="font-weight: 400">from sklearn.model_selection import train_test_split </span></i></p>
<p><i><span style="font-weight: 400">from lazypredict.Supervised import LazyClassifier, LazyRegressor</span></i></p></blockquote>
<p>&nbsp;</p>
<h4><b>Importing data and LazyClassifier model fitting</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">classificationData = pd.read_csv(&#8220;drugType.csv&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">classificationData.head()</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="870" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-10-20-37-38/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?fit=406%2C185&amp;ssl=1" data-orig-size="406,185" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-10 20-37-38" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?fit=300%2C137&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?fit=406%2C185&amp;ssl=1" class="wp-image-870 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?resize=443%2C203&#038;ssl=1" alt="" width="443" height="203" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?resize=300%2C137&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-37-38.png?w=406&amp;ssl=1 406w" sizes="(max-width: 443px) 100vw, 443px" /></p>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">X = classificationData..drop(columns=”Drug”)</span></i></p>
<p><i><span style="font-weight: 400">y = classificationData.[“Drug”]</span></i></p>
<p><i><span style="font-weight: 400"># Splitting our data into a train and test set</span></i></p>
<p><i><span style="font-weight: 400">X_train, X_test, y_train, y_test = train_test_split(X, y,</span></i></p>
<p><i><span style="font-weight: 400">                                                    test_size=0.2,</span></i></p>
<p><i><span style="font-weight: 400">                                                    random_state=42)</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">classifiers = LazyClassifier(ignore_warnings=True, custom_metric=None)</span></i></p>
<p><i><span style="font-weight: 400">models,predictions = classifiers.fit(X_train, X_test, y_train, y_test)</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">print(models)</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="871" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-10-20-53-05/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?fit=767%2C568&amp;ssl=1" data-orig-size="767,568" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-10 20-53-05" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?fit=300%2C222&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?fit=767%2C568&amp;ssl=1" class="wp-image-871 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=534%2C395&#038;ssl=1" alt="" width="534" height="395" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=300%2C222&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?resize=480%2C355&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-10-20-53-05.png?w=767&amp;ssl=1 767w" sizes="(max-width: 534px) 100vw, 534px" /></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here the model returns two values, different model names with its prediction accuracy. </span></p>
<p>&nbsp;</p>
<h4><b>Importing data and LazyRegressor model fitting</b></h4>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">regressionData = pd.read_csv(&#8220;winequality.csv&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">regressionData.head()</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="875" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-11-17-38-16/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?fit=1063%2C188&amp;ssl=1" data-orig-size="1063,188" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-11 17-38-16" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?fit=300%2C53&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?fit=800%2C141&amp;ssl=1" class=" wp-image-875 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=663%2C117&#038;ssl=1" alt="" width="663" height="117" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=300%2C53&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=768%2C136&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=1024%2C181&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=980%2C173&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?resize=480%2C85&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-38-16.png?w=1063&amp;ssl=1 1063w" sizes="(max-width: 663px) 100vw, 663px" /></p>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">X = regressionData.drop(columns=”quality”)</span></i></p>
<p><i><span style="font-weight: 400">y = regressionData[“quality”]</span></i></p>
<p><i><span style="font-weight: 400"># Splitting our data into a train and test set</span></i></p>
<p><i><span style="font-weight: 400">X_train, X_test, y_train, y_test = train_test_split(X, y,</span></i></p>
<p><i><span style="font-weight: 400">                                                    test_size=0.2, random_state = 42)</span></i></p>
<p><i><span style="font-weight: 400">regressors = LazyRegressor(ignore_warnings=True, custom_metric=None)</span></i></p>
<p><i><span style="font-weight: 400">models, predictions = regressors.fit(X_train, X_test, y_train, y_test)</span></i></p>
<p><i><span style="font-weight: 400">print(models)</span></i></p></blockquote>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="882" data-permalink="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/screenshot-from-2022-01-11-17-43-05-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?fit=629%2C696&amp;ssl=1" data-orig-size="629,696" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2022-01-11 17-43-05" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?fit=271%2C300&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?fit=629%2C696&amp;ssl=1" class=" wp-image-882 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=515%2C570&#038;ssl=1" alt="" width="515" height="570" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=271%2C300&amp;ssl=1 271w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?resize=480%2C531&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/01/Screenshot-from-2022-01-11-17-43-05-1.png?w=629&amp;ssl=1 629w" sizes="(max-width: 515px) 100vw, 515px" /></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3><b>Conclusion</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Here, when we use the “Lazy Predict” library, different models are fitted on our data, and model results provide us with accuracy metrics for the given data. Observing the result we can then select the top 5 base models based on the best accuracy. </span></p>
<p><span style="font-weight: 400">Later we can tune the parameters of those top models and get better accuracy. </span></p>
<p><span style="font-weight: 400">As this library runs many different models at once it takes a lot of computational power. If you have low computational power I would suggest you use Google Colab.</span></p>
<p>&nbsp;</p>
<p>The post <a href="https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/">Lazy Predict &#8211; Find the best suitable ML model</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/lazy-predict-find-the-best-suitable-ml-model/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">869</post-id>	</item>
		<item>
		<title>Text Classification with Keras and GloVe Word Embeddings</title>
		<link>https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/</link>
					<comments>https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Fri, 31 Dec 2021 13:25:16 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[deep learning]]></category>
		<category><![CDATA[keras]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[text classification]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=603</guid>

					<description><![CDATA[<p>Deep Learning(DL) is the subset of Machine Learning. It is a method of statistical learning that extracts features or attributes from raw data. DL uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. DL takes the data into a network of layers(Input, Hidden [&#8230;]</p>
<p>The post <a href="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/">Text Classification with Keras and GloVe Word Embeddings</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Deep Learning(DL)</strong> is the subset of <strong class="markup--strong markup--p-strong">Machine Learning</strong>. It is a method of statistical learning that extracts features or attributes from raw data. <strong class="markup--strong markup--p-strong">DL</strong> uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. <strong class="markup--strong markup--p-strong">DL</strong> takes the data into a network of layers(Input, Hidden &amp; Output) to extract the features and to learn from the data.</p>
<p>In this blog, we will learn how to train a supervised text classification model using the DL python module called <em><strong>Keras</strong></em> and pre-trained <em><strong>GloVe</strong></em> word embeddings to transform the text data into a machine-understandable numerical representation. We will be using <strong>Convolutional Neural Networks</strong>(CNN) architecture to train the classification model.</p>
<p>The dataset and the category labels of the data are discussed in <a href="https://turbolab.in/text-classification-using-machine-learning/"><strong>Text Classification using Machine Learning</strong></a> blog. Please refer to the blog and we will be using the same dataset here to train our CNN model to predict the classification of the given text.</p>
<p>Assume the dataset is referred to as the pandas dataframe called <strong>df </strong>in the code snippet.</p>
<figure id="attachment_684" aria-describedby="caption-attachment-684" style="width: 512px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="684" data-permalink="https://turbolab.in/text-classification-using-machine-learning/datasample/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=512%2C442&amp;ssl=1" data-orig-size="512,442" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1634309405&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="datasample" data-image-description="" data-image-caption="&lt;p&gt;dataset&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=300%2C259&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=512%2C442&amp;ssl=1" class="size-full wp-image-684" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=512%2C442&#038;ssl=1" alt="dataset" width="512" height="442" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?w=512&amp;ssl=1 512w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=300%2C259&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=480%2C414&amp;ssl=1 480w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption id="caption-attachment-684" class="wp-caption-text">dataset</figcaption></figure>
<h2>Cleaning the dataset</h2>
<p>The data cleaning part is also discussed in the <a href="https://turbolab.in/text-classification-using-machine-learning/"><strong>blog</strong></a>.</p>
<figure id="attachment_849" aria-describedby="caption-attachment-849" style="width: 752px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="849" data-permalink="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/dl_cleaning-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?fit=752%2C332&amp;ssl=1" data-orig-size="752,332" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1640861599&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="DL_cleaning" data-image-description="" data-image-caption="&lt;p&gt;Cleaned Dataset&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?fit=300%2C132&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?fit=752%2C332&amp;ssl=1" class="size-full wp-image-849" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=752%2C332&#038;ssl=1" alt="" width="752" height="332" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?w=752&amp;ssl=1 752w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=300%2C132&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/DL_cleaning-1.jpg?resize=480%2C212&amp;ssl=1 480w" sizes="(max-width: 752px) 100vw, 752px" /><figcaption id="caption-attachment-849" class="wp-caption-text">Cleaned Dataset</figcaption></figure>
<p>We have used stemming and stopwords removal on the dataset content. We can replace stemming with lemmatization and check out this blog about <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/"><strong>stemming vs lemmatization</strong></a> to know the differences. Skipping the cleaning part on the dataset for this DL blog, because stemming can give us meaningless words which don&#8217;t have the embedding glove vector.</p>
<p>As a part of data preparation, we are going to perform these operations on the dataset <strong>df</strong></p>
<ol>
<li>Lowering the content because the glove embedding vectors are generated for the lower case words.</li>
<li>Stripping and making sure the word starts with alphanumerics ie., removing special characters.</li>
<li>Dropping the null and empty columns from the df.</li>
<li>Dropping the duplicates from the df.</li>
</ol>
<blockquote><p><em><strong>df = df[[&#8216;content&#8217;, &#8216;label&#8217;]]</strong></em><br />
<em><strong>df = df.astype(&#8216;str&#8217;).applymap(str.lower)</strong></em><br />
<em><strong>df = df.applymap(str.strip).replace(r&#8221;[^a-z0-9 ]+&#8221;, &#8221;)</strong></em><br />
<em><strong>df = df.dropna()</strong></em><br />
<em><strong>df = df.drop_duplicates()</strong></em></p></blockquote>
<h2>Loading the Glove Embeddings</h2>
<p><a href="https://nlp.stanford.edu/projects/glove/"><strong>GloVe</strong></a> is an unsupervised learning algorithm for obtaining vector representations for words. Trained the models on Wiki, Twitter and common crawled data to have pre-trained word vectors with differences in size, tokens, and vocab size. For this blog, we will use the <strong>glove.6b.100d.txt</strong> pretrained glove word vector.</p>
<figure id="attachment_852" aria-describedby="caption-attachment-852" style="width: 500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="852" data-permalink="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/glove/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?fit=1315%2C942&amp;ssl=1" data-orig-size="1315,942" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1640944657&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="glove" data-image-description="" data-image-caption="&lt;p&gt;Glove Embeddings&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?fit=300%2C215&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?fit=800%2C573&amp;ssl=1" class="wp-image-852" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=500%2C358&#038;ssl=1" alt="" width="500" height="358" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=300%2C215&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=768%2C550&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1024%2C734&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1080%2C774&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=1280%2C917&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=980%2C702&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?resize=480%2C344&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/glove.jpg?w=1315&amp;ssl=1 1315w" sizes="(max-width: 500px) 100vw, 500px" /><figcaption id="caption-attachment-852" class="wp-caption-text">Glove Embeddings</figcaption></figure>
<p>In the above image, we can see the words <strong>that</strong>, <strong>on</strong>, <strong>is</strong>, <strong>was</strong> is represented by vector coefficients.</p>
<blockquote><p><em><strong>def loading_embeddings():</strong></em><br />
<em><strong>    &#8220;&#8221;&#8221; loading glove embeddings &#8220;&#8221;&#8221;</strong></em><br />
<em><strong>    embeddings_index = {}</strong></em><br />
<em><strong>    f = open(glove_path + &#8216;glove.6B.100d.txt&#8217;, encoding=&#8221;utf8&#8243;) # loading the file</strong></em><br />
<em><strong>    for line in f:</strong></em><br />
<em><strong>        values = line.split()</strong></em><br />
<em><strong>        word = values[0]</strong></em><br />
<em><strong>        coefs = np.asarray(values[1:], dtype=&#8217;float32&#8242;)</strong></em><br />
<em><strong>        embeddings_index[word] = coefs</strong></em><br />
<em><strong>    f.close()</strong></em><br />
<em><strong>    return embeddings_index</strong></em></p></blockquote>
<h2>Preparing the Embedding Matrix</h2>
<div>
<blockquote>
<div><em><strong><span class="pl-v">MAX_NB_WORDS</span> <span class="pl-c1">=</span> <span class="pl-c1">100000</span></strong></em></div>
<div></div>
<div><em><strong>def prepare_embedding_matrix(word_index):</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221; preparing embedding matrix with our data set &#8220;&#8221;&#8221;</strong></em></div>
<div></div>
<div><em><strong>    embeddings_index = loading_embeddings()</strong></em></div>
<div><em><strong>    num_words = min(MAX_NB_WORDS, len(word_index))</strong></em></div>
<div><em><strong>    embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))</strong></em></div>
<div></div>
<div><em><strong>    for word, i in word_index.items():</strong></em></div>
<div><em><strong>        if i &gt;= MAX_NB_WORDS:</strong></em></div>
<div><em><strong>            continue</strong></em></div>
<div><em><strong>        embedding_vector = embeddings_index.get(word)</strong></em></div>
<div><em><strong>        if embedding_vector is not None:</strong></em></div>
<div><em><strong>            # </strong>words not found in embedding index will be all-zeros.</em></div>
<div><em><strong>            embedding_matrix[i] = embedding_vector</strong></em></div>
<div><em><strong>    return embedding_matrix, num_words</strong></em></div>
</blockquote>
<div><em><strong>MAX_NB_WORDS </strong></em>is the maximum number of words to consider as features for tokenizer</div>
<div><em><strong>word_index </strong></em>is the tokenizer unique words list which extracted by fitting on our dataset content</div>
</div>
<div></div>
<div>The minimum of these two parameters is the <em><strong>num_words</strong></em>, which we keep as <em><strong>input_dim</strong></em> for the Keras Embedding layer.</div>
<div></div>
<h2>Preparing the dataset for the model to train</h2>
<div>
<blockquote>
<div><em><strong><span class="pl-v">MAX_SEQUENCE_LENGTH</span> <span class="pl-c1">=</span> <span class="pl-c1">1000</span></strong></em></div>
<div><em><strong><span class="pl-v">VALIDATION_SPLIT</span> <span class="pl-c1">=</span> <span class="pl-c1">0.1</span></strong></em></div>
<div></div>
<div><em><strong>def vectorizing_data(df):</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221; vectorizing and splitting the data for training, testing, validating &#8220;&#8221;&#8221;</strong></em></div>
<div><em><strong>    label_s = df[&#8216;label&#8217;].tolist()</strong></em></div>
<div><em><strong>    l = list(set(label_s))</strong></em></div>
<div><em><strong>    l.sort()</strong></em></div>
<div><em><strong>    labels_index = dict([(j,i) for i, j in enumerate(l)])</strong></em></div>
<div><em><strong>    labels = [labels_index[i] for i in label_s]</strong></em></div>
<div><em><strong>    print(&#8216;Found %s texts.&#8217; % len(df[&#8216;content&#8217;]))</strong></em></div>
<div><em><strong>    print(&#8216;labels_index &#8212; &#8216;, labels_index)</strong></em></div>
<div></div>
<div><em><strong>    tokenizer = Tokenizer(num_words=MAX_NB_WORDS)</strong></em></div>
<div><em><strong>    tokenizer.fit_on_texts(df[&#8216;content&#8217;])</strong></em></div>
<div><em><strong>    sequences = tokenizer.texts_to_sequences(df[&#8216;content&#8217;])</strong></em></div>
<div><em><strong>    word_index = tokenizer.word_index</strong></em></div>
<div><em><strong>    print(&#8216;Found %s unique tokens.&#8217; % len(word_index))</strong></em></div>
<div></div>
<div><em><strong>    df = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)</strong></em></div>
<div><em><strong>    labels = to_categorical(np.asarray(labels))</strong></em></div>
<div></div>
<div><em><strong>    # randomizing and splitting the df into a training set, test set and a validation set</strong></em></div>
<div><em><strong>    indices = np.arange(df.shape[0])</strong></em></div>
<div><em><strong>    np.random.shuffle(indices)</strong></em></div>
<div><em><strong>    df = df[indices]</strong></em></div>
<div><em><strong>    labels = labels[indices]</strong></em></div>
<div><em><strong>    num_validation_samples = int(VALIDATION_SPLIT * df.shape[0])</strong></em></div>
<div><em><strong>    x_train = df[:-num_validation_samples]</strong></em></div>
<div><em><strong>    y_train = labels[:-num_validation_samples]</strong></em></div>
<div><em><strong>    x_val = df[-num_validation_samples:]</strong></em></div>
<div><em><strong>    y_val = labels[-num_validation_samples:]</strong></em></div>
<div><em><strong>    x_test = x_train[-num_validation_samples:]</strong></em></div>
<div><em><strong>    y_test = y_train[-num_validation_samples:]</strong></em></div>
<div><em><strong>    return x_train, y_train, x_test, y_test, x_val, y_val, word_index</strong></em></div>
</blockquote>
<div>Split the dataset for train, test, and validation purposes. Created a tokenizer and generated the word_index on the dataset content. Padded the sequences with the max length of <em><strong>MAX_SEQUENCE_LENGTH.</strong></em></div>
</div>
<h2>Model construction</h2>
<div>
<blockquote>
<div><em><strong><span class="pl-v">EMBEDDING_DIM</span> <span class="pl-c1">=</span> <span class="pl-c1">100</span></strong></em></div>
<div></div>
<div><em><strong>def model_generation(embedding_matrix, num_words):</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221; model generation &#8220;&#8221;&#8221;</strong></em></div>
<div><em><strong>    embedding_layer = Embedding(num_words + 1,</strong></em></div>
<div><em><strong>                                EMBEDDING_DIM,</strong></em></div>
<div><em><strong>                                weights=[embedding_matrix],</strong></em></div>
<div><em><strong>                                input_length=MAX_SEQUENCE_LENGTH,</strong></em></div>
<div><em><strong>                                trainable=False)</strong></em></div>
<div><em><strong>    convs = []</strong></em></div>
<div><em><strong>    filter_sizes = [3,4,5]</strong></em></div>
<div><em><strong>    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=&#8217;int32&#8242;)</strong></em></div>
<div><em><strong>    embedded_sequences = embedding_layer(sequence_input)</strong></em></div>
<div><em><strong>    for fsz in filter_sizes:</strong></em></div>
<div><em><strong>        l_conv = Conv1D(filters=128, kernel_size=fsz, activation=&#8217;relu&#8217;)(embedded_sequences)</strong></em></div>
<div><em><strong>        l_pool = MaxPooling1D(5)(l_conv)</strong></em></div>
<div><em><strong>        convs.append(l_pool)</strong></em></div>
<div><em><strong>    l_merge = Concatenate(axis=1)(convs)</strong></em></div>
<div><em><strong>    l_cov1= Conv1D(filters=128, kernel_size=5, activation=&#8217;relu&#8217;)(l_merge)</strong></em></div>
<div><em><strong>    l_cov1 = Dropout(0.2)(l_cov1)</strong></em></div>
<div><em><strong>    l_pool1 = MaxPooling1D(5)(l_cov1)</strong></em></div>
<div><em><strong>    l_cov2 = Conv1D(filters=128, kernel_size=5, activation=&#8217;relu&#8217;)(l_pool1)</strong></em></div>
<div><em><strong>    l_cov2 = Dropout(0.2)(l_cov2)</strong></em></div>
<div><em><strong>    l_pool2 = MaxPooling1D(30)(l_cov2)</strong></em></div>
<div><em><strong>    l_flat = Flatten()(l_pool2)</strong></em></div>
<div><em><strong>    l_dense = Dense(128, activation=&#8217;relu&#8217;)(l_flat)</strong></em></div>
<div><em><strong>    preds = Dense(label_count, activation=&#8217;softmax&#8217;)(l_dense)</strong></em></div>
<div><em><strong>    model = Model(sequence_input, preds)</strong></em></div>
<div><em><strong>    return model</strong></em></div>
</blockquote>
<div>The model summary looks like this</div>
<div></div>
</div>
<div>
<div>
<figure id="attachment_856" aria-describedby="caption-attachment-856" style="width: 819px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="856" data-permalink="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/model_summary/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?fit=819%2C815&amp;ssl=1" data-orig-size="819,815" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1640970904&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="model_summary" data-image-description="" data-image-caption="&lt;p&gt;Model Summary&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?fit=300%2C300&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?fit=800%2C796&amp;ssl=1" class="size-full wp-image-856" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=800%2C796&#038;ssl=1" alt="" width="800" height="796" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?w=819&amp;ssl=1 819w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=150%2C150&amp;ssl=1 150w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=300%2C300&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=768%2C764&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/model_summary.jpg?resize=480%2C478&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /><figcaption id="caption-attachment-856" class="wp-caption-text">Model Summary</figcaption></figure>
</div>
<p>The model is represented by the embedding layer followed by convolutional layers, pooling layers, and dropout layers. The final layer is the dense layer with the output size of labels/category count.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Dropout</strong> is dropping off the neurons to prevent an over-fitting problem in neural networks. It is an approach to regularization in neural networks which helps to reduce interdependent learning amongst the neurons. In Machine Learning we use regularization to prevent an over-fitting problem by adding a penalty to the loss function.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Batch normalization</strong> is another method to regularize a convolutional network.</p>
<h2>f1-score, precision, and recall</h2>
<div>
<div>
<blockquote>
<div><em><strong>def recall_m(y_true, y_pred):</strong></em></div>
<div><em><strong>    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))</strong></em></div>
<div><em><strong>    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))</strong></em></div>
<div><em><strong>    recall = true_positives / (possible_positives + K.epsilon())</strong></em></div>
<div><em><strong>    return recall</strong></em></div>
<div><em><strong>def precision_m(y_true, y_pred):</strong></em></div>
<div><em><strong>    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))</strong></em></div>
<div><em><strong>    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))</strong></em></div>
<div><em><strong>    precision = true_positives / (predicted_positives + K.epsilon())</strong></em></div>
<div><em><strong>    return precision</strong></em></div>
<div><em><strong>def f1_m(y_true, y_pred):</strong></em></div>
<div><em><strong>    precision = precision_m(y_true, y_pred)</strong></em></div>
<div><em><strong>    recall = recall_m(y_true, y_pred)</strong></em></div>
<div><em><strong>    return 2*((precision*recall)/(precision+recall+K.epsilon()))</strong></em></div>
</blockquote>
<div>These parameters evaluate the trained model. f1-score is the harmonic mean of precision and recall.</div>
<div></div>
</div>
<div>Precision is calculated as the number of true positives divided by the total number of true positives and false positives.</div>
</div>
</div>
<div></div>
<div>Recall is calculated as the number of true positives divided by the total number of true positives and false negatives.</div>
<h2>Model training and evaluation</h2>
<div>
<blockquote>
<div><em><strong>def training_evaluating_model(model, x_train, y_train, x_test, y_test, x_val, y_val):</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221; training the model with the train and validation data</strong></em></div>
<div><em><strong>    and evaluating the model with the test data &#8220;&#8221;&#8221;</strong></em></div>
<div><em><strong>    model.compile(loss=&#8217;categorical_crossentropy&#8217;,</strong></em></div>
<div><em><strong>                  optimizer=&#8217;rmsprop&#8217;,</strong></em></div>
<div><em><strong>                  metrics=[&#8216;acc&#8217;, f1_m, precision_m, recall_m])</strong></em></div>
<div><em><strong>    # Displays the network structure</strong></em></div>
<div><em><strong>    model.summary()</strong></em></div>
<div><em><strong>    # fitting the model</strong></em></div>
<div><em><strong>    model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=epochs, batch_size=batch_size)</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221;</strong></em></div>
<div><em><strong>    model.save_weights(home_path + &#8216;model_trained&#8217;) # Saving the model</strong></em></div>
<div><em><strong>    &#8220;&#8221;&#8221;</strong></em></div>
<div><em><strong>    # evaluating the model</strong></em></div>
<div><em><strong>    loss, accuracy, f1_score, precision, recall = model.evaluate(x_test, y_test, verbose=0)</strong></em></div>
<div><em><strong>    return loss, accuracy, f1_score, precision, recall</strong></em></div>
</blockquote>
<div>
<figure id="attachment_859" aria-describedby="caption-attachment-859" style="width: 1598px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="859" data-permalink="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/evaluation/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?fit=1598%2C607&amp;ssl=1" data-orig-size="1598,607" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1640972511&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="evaluation" data-image-description="" data-image-caption="&lt;p&gt;Model Evaluation&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?fit=300%2C114&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?fit=800%2C304&amp;ssl=1" class="size-full wp-image-859" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=800%2C304&#038;ssl=1" alt="" width="800" height="304" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?w=1598&amp;ssl=1 1598w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=300%2C114&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=768%2C292&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1024%2C389&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1080%2C410&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=1280%2C486&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=980%2C372&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/evaluation.jpg?resize=480%2C182&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /><figcaption id="caption-attachment-859" class="wp-caption-text">Model Evaluation</figcaption></figure>
<p>The model training accuracy is around 99.33% and validation accuracy is around 90.8%. Validation loss is more compared to training loss which resulted in the reduction of validation accuracy.  We have trained on dataset sample of 10000 rows only, if we have trained on complete dataset and increase in the number of epochs, results would have been much better.</p>
<p class="graf graf--p">The complete code discussed above can be found <em><strong><a class="markup--anchor markup--p-anchor" href="https://github.com/Vasistareddy/text_classification_DL_vs_ML/blob/master/model_training_tutorial_dl.py" target="_blank" rel="noopener">here</a></strong></em>.</p>
</div>
</div>
<p>The post <a href="https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/">Text Classification with Keras and GloVe Word Embeddings</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/text-classification-with-keras-and-glove-word-embeddings/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">603</post-id>	</item>
		<item>
		<title>How to monitor work-flow of scraping project with Apache-Airflow</title>
		<link>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/</link>
					<comments>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Wed, 22 Dec 2021 08:16:05 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[airflow]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[monitor]]></category>
		<category><![CDATA[scraping]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=823</guid>

					<description><![CDATA[<p>Apache Airflow is a platform to programmatically monitor workflows, schedule, and authorize projects. In this blog, we will discuss handling the workflow of scraping yelp.com with Apache Airflow. Quick setup of Airflow on ubuntu 20.04 LTS # make sure your system is up-to-date sudo apt update sudo apt upgrade # install airflow dependencies  sudo apt-get install libmysqlclient-dev [&#8230;]</p>
<p>The post <a href="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/">How to monitor work-flow of scraping project with Apache-Airflow</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p class="graf graf--p">Apache Airflow is a platform to programmatically monitor workflows, schedule, and authorize projects.</p>
<p class="graf graf--p">In this blog, we will discuss handling the workflow of scraping <strong><a class="markup--anchor markup--p-anchor" href="https://www.yelp.com/" target="_blank" rel="noopener">yelp.com</a></strong> with Apache Airflow.</p>
<h2 class="graf graf--h3">Quick setup of Airflow on ubuntu 20.04 LTS</h2>
<p># make sure your system is up-to-date</p>
<blockquote>
<pre class=" prettyprinted"><span class="pln">sudo apt update
sudo apt upgrade</span></pre>
</blockquote>
<p><em># install airflow dependencies </em></p>
<blockquote>
<pre class="je jf jg jh ji ki kj kk"><span id="e9fa" class="he kl km gh kn b dt ko kp s kq">sudo apt-get install libmysqlclient-dev</span></pre>
<pre class="je jf jg jh ji ki kj kk"><span id="efb9" class="he kl km gh kn b dt kr ks kt ku kv kp s kq">sudo apt-get install libssl-dev</span></pre>
<pre class="je jf jg jh ji ki kj kk"><span id="01af" class="he kl km gh kn b dt kr ks kt ku kv kp s kq">sudo apt-get install libkrb5-dev</span></pre>
</blockquote>
<p class="graf graf--h3"><em># create the virtual env and install the airflow using pip</em></p>
<blockquote>
<pre class=" prettyprinted"><span class="pln">sudo apt install python3</span><span class="pun">-</span><span class="pln">virtualenv
virtualenv airflow_test
cd airflow_test</span><span class="pun">/</span><span class="pln">
source bin/activate
</span><span class="kwd">export</span><span class="pln"> AIRFLOW_HOME</span><span class="pun">=~/</span><span class="pln">airflow # set Airflow home
pip3 install apache</span><span class="pun">-</span><span class="pln">airflow
pip3 install typing_extensions
airflow db init # initialize the db</span></pre>
</blockquote>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">db, unittests, logs, configuration(cfg)</strong> files will be generated inside <strong class="markup--strong markup--p-strong">Airflow_Home</strong></p>
<p class="graf graf--h4"># <em>Start a WebServer &amp; Scheduler</em></p>
<blockquote>
<pre class="graf graf--pre"><em>airflow webserver -p 8080 # start the webserver</em></pre>
<pre class="graf graf--pre"><em>airflow scheduler # start the scheduler
</em></pre>
</blockquote>
<p class="graf graf--p">By default it is localhost. If you wish to change, you can give the command like this</p>
<blockquote>
<p>airflow webserver -H xxx.xxx.xxx.xxx -p 9005</p>
</blockquote>
<p class="graf graf--p">Check the quick installation guide <a href="https://airflow.apache.org/docs/apache-airflow/1.10.12/start.html"><strong>here</strong></a>.</p>
<p class="graf graf--p">If everything goes well, we can see the apache airflow web interface</p>
<pre class="graf graf--pre"><em><strong><a class="markup--anchor markup--pre-anchor" href="http://localhost:8080/admin/" target="_blank" rel="nofollow noopener">http://localhost:8080/admin/</a> # web-server</strong></em></pre>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AOgwTYIIY1G4QghIE9TPl_Q.png?resize=800%2C411&#038;ssl=1" alt="" width="800" height="411" /><figcaption class="wp-caption-text">Airflow WebServer</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p">Everything in Airflow works as <strong class="markup--strong markup--p-strong">DAGs</strong>(Directed acyclic Graphs). We need to create a DAG with a unique dag_id and nest the tasks to that dag_id created. Simply put, DAG is the collection of tasks we want to run. Parameters like <strong>schedule_time</strong>, <strong>start_time</strong>, <strong>author</strong>, and other parameters can also be passed to the DAG object.</p>
<p class="graf graf--p">Create a folder named <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">dags</strong></code> inside the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Airflow_Home,</strong></code>the Scheduler will be checking for new DAGs for every 300&#8217;s, if any new dags are found — you can see them at web-server.</p>
<figure id="attachment_832" aria-describedby="caption-attachment-832" style="width: 1077px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="832" data-permalink="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/screenshot-from-2021-12-22-14-12-05/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=1077%2C194&amp;ssl=1" data-orig-size="1077,194" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2021-12-22 14-12-05" data-image-description="" data-image-caption="&lt;p&gt;Airflow Scheduler&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=300%2C54&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=800%2C144&amp;ssl=1" class="size-full wp-image-832" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=800%2C144&#038;ssl=1" alt="" width="800" height="144" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?w=1077&amp;ssl=1 1077w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=300%2C54&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=768%2C138&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=1024%2C184&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=980%2C177&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=480%2C86&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /><figcaption id="caption-attachment-832" class="wp-caption-text">Airflow Scheduler</figcaption></figure>
<figure class="graf graf--figure">
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p">We are going to create a workflow to scrape <strong class="markup--strong markup--p-strong">yelp.com</strong> for business listings &amp; save the data to MongoDB.</p>
<p class="graf graf--p">The code to be used in this tutorial to scrape the <strong class="markup--strong markup--p-strong">yelp.com</strong> can check <a href="https://gist.githubusercontent.com/Vasistareddy/26a37b841e93756ab3256022e6daa09d/raw/a75b6b277ed64c953e09094e60e5f18d1789573a/yelp_search.py"><em><strong>here</strong></em></a>.</p>
<h2 class="graf graf--h3">Creation of DAG</h2>
<blockquote>
<pre class="graf graf--pre"><em>from airflow import DAG
from datetime import datetime</em></pre>
<pre class="graf graf--pre"><em># dag creation
default_args = {'owner': 'turbolab', 'start_date': datetime(2019, 1, 1), 'depends_on_past': False}
_yelp_workflow = DAG('_yelp_workflow', catchup=False, schedule_interval=None, default_args=default_args) # creating a DAG</em></pre>
</blockquote>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AoS8FS0p8EM1o3XYfnAxrNA.png?resize=800%2C218&#038;ssl=1" alt="" width="800" height="218" /><figcaption class="wp-caption-text">DAG Created</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">_yelp_workflow</strong></code> DAG is created. <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">schedule_interval=None</strong></code> is for manual triggering the DAG. Other options are <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">@daily, @weekly,</strong></code> <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">“* * * */2 1”</strong></code>(cron schedule). Know about <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">catchup, depends_on_past</strong></code> the airflow documentation <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/scheduler.html" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">here</strong></a><strong class="markup--strong markup--p-strong">.</strong></p>
<h2 class="graf graf--h3">Task Creation</h2>
<p class="graf graf--p">With the airflow set of <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/concepts.html#operators" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">operators</strong></a><strong class="markup--strong markup--p-strong">, </strong>we can define tasks of the DAG workflow. An operator describes a single task in a workflow. While DAGs describes how to run a workflow, <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Operators</strong></code> determine what actually gets done. To call a python function — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">PythonOperator</strong></code>, for an Email — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">EmailOperator</strong></code>, for a Bash command —<code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"> BashOperator</strong></code>, for a SQL instruction — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">MySqlOperator</strong></code> etc.,</p>
<p class="graf graf--p">Generally, operators run independently with no sharing of information in the order specified. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called <strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">XCom.</em></strong></p>
<blockquote>
<pre class="graf graf--pre"><em>def url_generator(**kwargs):
    <strong class="markup--strong markup--pre-strong">""" 
    generating the yelp url to find the business listings with place and search_query 
    {'place': 'Location | Address | zip code'}
    {'search_query': "Restaurants | Breakfast &amp; Brunch | Coffee &amp; Tea | Delivery | Reservations"}
    """</strong>
    place = Variable.get("place")
    search_query = Variable.get("search_query")
    yelp_url = "<a class="markup--anchor markup--pre-anchor" href="https://www.yelp.com/search?find_desc=%7B0%7D&amp;find_loc=%7B1" target="_blank" rel="nofollow noopener">https://www.yelp.com/search?find_desc={0}&amp;find_loc={1</a>}".format(search_query,place)
    return yelp_url</em></pre>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">"""defining a task"""
yelp_url_generator = PythonOperator(
    task_id='url_generator',
    python_callable=url_generator,
    provide_context=True,
    dag=_yelp_workflow)</strong></em></pre>
</blockquote>
<p class="graf graf--p">Likewise, 6 tasks were created and the concepts like <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">variables</em></strong></code> and <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">xcom</em></strong></code> are used among the tasks.</p>
<h2 class="graf graf--h3">Concept of xcom</h2>
<blockquote>
<pre class="graf graf--pre"><em>def get_response(**kwargs):
    """
    validating the url and forwarding the response
    """
    <strong class="markup--strong markup--pre-strong">ti = kwargs['ti']
    url = ti.xcom_pull(task_ids='url_generator')
    print('url generated: ', url)</strong>
    headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chrome/70.0.3538.77 Safari/537.36'}
    success = False
    
    for retry in range(10):
        response = requests.get(url, verify=False, headers=headers)
        if response.status_code == 200:
            success = True
            break
        else:
            print("Response received: %s. Retrying : %s"%(response.status_code, url))
            success = False
    
    if success == False:
        print("Failed to process the URL: ", url)
        raise ValueError("Failed to process the URL: ", url)
    return response</em></pre>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">response_generator = PythonOperator(
    task_id='response_generator',
    python_callable=get_response,
    provide_context=True,
    dag=_yelp_workflow)</strong></em></pre>
</blockquote>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">url_generator</strong></code> task returning the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url</strong></code> has to pass to <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">response_generator</strong></code> task, where we will be checking the response of the URL. If the status_code of the response is <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">200</strong></code>, we are returning — otherwise raising a <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">ValueError</strong></code> to stop the pipeline.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">xcom</strong>’s can be viewed at the admin page after the successful task runs.</p>
<figure class="graf graf--figure"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2ADFuCAJ27zH6GGOE0_vrlcg.gif?w=800&#038;ssl=1" /></figure>
<h2 class="graf graf--h3">Concept of variable</h2>
<p class="graf graf--p">This concept is used when the user has to input the values(like command-line arguments in python) to the tasks created.</p>
<blockquote>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">place = Variable.get("place")
search_query = Variable.get("search_query")</strong></em></pre>
</blockquote>
<p class="graf graf--p">These variables <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">place</strong></code> and <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">search_query</strong></code> are used in the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">url_generator</strong></code> python function of <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url_generator</strong></code> task.</p>
<figure class="graf graf--figure">
<figure style="width: 1178px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2ARTGpoXRfHbC1VD5kvUsH2A.gif?resize=800%2C535&#038;ssl=1" alt="" width="800" height="535" /><figcaption class="wp-caption-text">Variables Creation</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<h2 class="graf graf--h3">Tasks Relationship/Arrangement</h2>
<p class="graf graf--p">The DAG will make sure that operators run in the correct certain order. Check <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/concepts.html#dag-assignment" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">here</strong></a><strong class="markup--strong markup--p-strong">.</strong></p>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2A8DGPBXdRHVFeiXhk0LbG1Q.gif?w=800&#038;ssl=1" /></figure>
<blockquote>
<pre class="graf graf--pre"><strong class="markup--strong markup--pre-strong">end_task &lt;&lt; validate_db &lt;&lt; writing_to_db &lt;&lt; validate_data &lt;&lt; get_data &lt;&lt; response_generator &lt;&lt; yelp_url_generator &lt;&lt; start_task</strong></pre>
</blockquote>
<p class="graf graf--p">airflow upstream arrangement of tasks with <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">start_task and end_task</strong></code> is dummy tasks(<strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">optional</em></strong>). Others <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url_generator →response_generator →get_data →validate_data →writing_to_db →validate_db</strong></code> are python tasks.</p>
<p class="graf graf--p"><a class="markup--anchor markup--p-anchor" href="https://gist.githubusercontent.com/Vasistareddy/f0b5f7d73efc900f269e0aa81d04e81b/raw/8cc88cd5bd31fe219368b522b3dea3945e21caf4/yelp_business_listings.py" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">Check the complete code here</strong></a><strong class="markup--strong markup--p-strong"> </strong></p>
<h2 class="graf graf--h3">Triggering the DAG</h2>
<p class="graf graf--p">Since we kept <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">schedule_interval=None,</strong></code> we have to manually trigger the DAG. Let’s see how to do that →</p>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AsUBKsJsFllw4_xwpfA2EMg.gif?w=800&#038;ssl=1" /></figure>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2ADsa6AxsvP_YrNdarurz1rA.png?resize=800%2C219&#038;ssl=1" alt="" width="800" height="219" /><figcaption class="wp-caption-text">MongoDB data</figcaption></figure>
<figure style="width: 1000px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2A7L589dgoMSEiOumFTh5adQ.png?resize=800%2C204&#038;ssl=1" alt="" width="800" height="204" /><figcaption class="wp-caption-text">Tasks Successfully Completed</figcaption></figure>
</figure>
<h2>Tree View of each DAG run</h2>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AedTrxUAskSUHKaVRyMNNlg.png?resize=800%2C257&#038;ssl=1" alt="" width="800" height="257" /><figcaption class="wp-caption-text">Tree View of each DAG run</figcaption></figure>
</figure>
<figure class="graf graf--figure">
<figcaption class="imageCaption"></figcaption>
</figure>
<h2 class="graf graf--h3">Handling Cases</h2>
<p class="graf graf--p">You must be wondering why to use this setup of airflow for simple scraping. The reason is,</p>
<ol>
<li class="graf graf--p">We can break down the whole single task into multiple tasks and have control over each task at any point.</li>
<li class="graf graf--p">Will have clear logs at every level.</li>
<li class="graf graf--p">Can easily connect to other servers with airflow operators to execute the script.</li>
</ol>
<h3 class="graf graf--p">Here are the few cases handled in the work-flow</h3>
<ul class="postList">
<li class="graf graf--li">When we are trying to write the same set of data into the Database with multiple DAG runs.</li>
</ul>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1344px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AjOghgSV1knE290aFknHQIQ.png?resize=800%2C209&#038;ssl=1" alt="" width="800" height="209" /><figcaption class="wp-caption-text">Duplicate Key Error</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">task_id=writing_to_db</strong></code> will be handling this case.</p>
<ul class="postList">
<li class="graf graf--li">When the data scraped and pushed to the database doesn&#8217;t match.</li>
</ul>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AcoOCTtnVRr_svwadeJIDPA.png?w=800&#038;ssl=1" /></figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">task_id=validate_db</strong></code> will be handling this case. In case the anomaly is detected, we will be raising the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Value Error</strong></code>.</p>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figcaption class="imageCaption"></figcaption>
</figure>


<p></p>
<p>The post <a href="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/">How to monitor work-flow of scraping project with Apache-Airflow</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">823</post-id>	</item>
		<item>
		<title>Text Similarity using fastText Word Embeddings in Python</title>
		<link>https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/</link>
					<comments>https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Thu, 09 Dec 2021 09:41:31 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[fasttext]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[word2vec]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=803</guid>

					<description><![CDATA[<p>Text Similarity is one of the essential techniques of NLP which is used to find similarities between two chunks of text. In order to perform text similarity, word embedding techniques are used to convert chunks of text to certain dimension vectors. We also perform some mathematical operations on these vectors to find the similarity between [&#8230;]</p>
<p>The post <a href="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/">Text Similarity using fastText Word Embeddings in Python</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Text Similarity is one of the essential techniques of NLP which is used to find similarities between two chunks of text. In order to perform text similarity, word embedding techniques are used to convert chunks of text to certain dimension vectors. We also perform some mathematical operations on these vectors to find the similarity between the text chunks. Recommendation System, Text Summarization, Information Retrieval, and Text Categorization are some of the main applications of text similarity.</p>
<p>In this tutorial, we will discuss how sentence similarity can be achieved with the fastText module and also the use-case of generating related news articles.</p>
<h3>Dataset</h3>
<p>Here, we have a science and technology news dataset with a sample of 6188 titles.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="809" data-permalink="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/dataset-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?fit=478%2C449&amp;ssl=1" data-orig-size="478,449" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1639053610&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="dataset" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?fit=300%2C282&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?fit=478%2C449&amp;ssl=1" class="size-full wp-image-809 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?resize=478%2C449&#038;ssl=1" alt="" width="478" height="449" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?w=478&amp;ssl=1 478w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/dataset.jpg?resize=300%2C282&amp;ssl=1 300w" sizes="(max-width: 478px) 100vw, 478px" /></p>
<h3>Problem Statement</h3>
<p>From the above dataset, we are going to pick one article title i.e.,</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>ROI = &#8220;Samsung to spend whopping $22B on artificial intelligence, cars&#8221;</strong></em></p></blockquote>
<p>Henceforth, we are going to call it <strong>ROI </strong>in the tutorial. We will be fetching related articles of <strong>ROI</strong> from the dataset using a fastText sentence vector.</p>
<blockquote><p><em><strong>Out:</strong></em></p>
<p>&nbsp;</p></blockquote>
<h3>Generate sentence vectors</h3>
<ol>
<li>Import the fastText module and load the model(300 dimension vector).<br />
<blockquote><p><em><strong>import fasttext</strong></em><br />
<em><strong>modelPath = &#8220;D://&#8221; # user defined path</strong></em><br />
<em><strong>ft = fasttext.load_model(modelPath+&#8217;cc.en.300.bin&#8217;) </strong></em></p></blockquote>
</li>
<li>Generate sentence vector to the <em><strong>ROI </strong>and call it as<strong> vector1</strong></em>.<br />
<blockquote><p><em><strong>def generateVector(sentence):</strong></em><br />
<em><strong>    return ft.get_sentence_vector(sentence)<br />
</strong></em><br />
<strong><strong>vector1 = generateVector(&#8216;Samsung to spend whopping $22B on artificial intelligence, cars&#8217;)</strong></strong></p>
<p>Out:</p>
<pre><strong>array([-6.36741472e-03,  1.08614033e-02,  9.33997519e-03, -2.33159624e-02,
       -9.58340534e-04,  1.86185073e-02,  2.20048483e-02, -2.02285256e-02,
       -1.13004427e-02, -1.38842128e-02, -6.33053621e-03,  1.18326535e-02,
       -2.36112420e-02,  9.13483184e-03,  5.59101533e-03,  1.09400013e-02,
        4.77387244e-03, -1.54347951e-02, -1.35055669e-02, -2.90185958e-02,
        1.35819204e-02,  2.80883280e-03,  3.43523137e-02, -2.22271457e-02,
</strong></pre>
<p><strong>        &#8230;&#8230;&#8230;&#8230;..</strong></p></blockquote>
<p>&nbsp;</li>
<li>Generate sentence vector for the entire dataset.<br />
<blockquote><p><em><strong>df[&#8220;vector&#8221;] = df[&#8220;title&#8221;].apply(generateVector)</strong></em></p>
<p>Out:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="810" data-permalink="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/vector/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?fit=793%2C440&amp;ssl=1" data-orig-size="793,440" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="vector" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?fit=300%2C166&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?fit=793%2C440&amp;ssl=1" class="alignnone size-full wp-image-810" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?resize=793%2C440&#038;ssl=1" alt="" width="793" height="440" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?w=793&amp;ssl=1 793w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?resize=300%2C166&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?resize=768%2C426&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/vector.jpg?resize=480%2C266&amp;ssl=1 480w" sizes="(max-width: 793px) 100vw, 793px" /></p></blockquote>
</li>
</ol>
<h3>Calculate Spatial Distance</h3>
<p>Calculate the spatial distance between the <em><strong>ROI</strong></em> and the rest of the dataframe titles to determine the related articles of <em><strong>ROI</strong></em>. The lesser the distance value, the more the related content.</p>
<blockquote><p><em><strong>from scipy import spatial</strong></em></p>
<p><em><strong>def spatialDistance(vector1, vector2):</strong></em><br />
<em><strong>    return spatial.distance.euclidean(vector1, vector2)</strong></em></p></blockquote>
<p><strong>vector1</strong> is the vector of <em><strong>ROI</strong></em> generated above. <strong>vector2</strong> is the vector column of each title of the dataframe.</p>
<p>Generating distance as a <strong>score</strong> between the static <em><strong>ROI</strong></em> and the rest of the dataframe titles.</p>
<blockquote><p><em><strong>df[&#8220;score&#8221;] = df.apply(lambda x: spatialDistance(vector1, x[&#8216;vector&#8217;]), axis=1)</strong></em></p></blockquote>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="811" data-permalink="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/score/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?fit=907%2C207&amp;ssl=1" data-orig-size="907,207" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1639059705&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="score" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?fit=300%2C68&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?fit=800%2C183&amp;ssl=1" class="alignnone size-full wp-image-811" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?resize=800%2C183&#038;ssl=1" alt="" width="800" height="183" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?w=907&amp;ssl=1 907w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?resize=300%2C68&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?resize=768%2C175&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/score.jpg?resize=480%2C110&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p>Sorting the <strong>score</strong> column of the dataframe to determine the closest related titles.</p>
<blockquote><p><em><strong>df.drop_duplicates(subset=[&#8220;score&#8221;]).sort_values(by=[&#8216;score&#8217;])</strong></em></p></blockquote>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="812" data-permalink="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/sort/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?fit=889%2C199&amp;ssl=1" data-orig-size="889,199" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1639060052&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="sort" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?fit=300%2C67&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?fit=800%2C179&amp;ssl=1" class="alignnone size-full wp-image-812" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?resize=800%2C179&#038;ssl=1" alt="" width="800" height="179" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?w=889&amp;ssl=1 889w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?resize=300%2C67&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?resize=768%2C172&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/sort.jpg?resize=480%2C107&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p>From the dataset, the top 10 article titles related to the <em><strong>ROI</strong></em> are:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>outputs = df.drop_duplicates(subset=[&#8220;score&#8221;]).sort_values(by=[&#8216;score&#8217;])[0:10][&#8220;title&#8221;].tolist()</strong></em></p>
<p>&nbsp;</p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;OnePlus phones go cheaper on Amazon up to Rs 10,000, lots of EMI and exchange offers on latest models&#8217;,</strong></em><br />
<em><strong>&#8216;Samsung to invest nearly $500 mn to set up display factory in India&#8217;,</strong></em><br />
<em><strong>&#8216;Samsung Galaxy S20+ gets listed on Geekbench, revealed to bring 120Hz display, 8K video and more&#8217;,</strong></em><br />
<em><strong>&#8216;Worldwide spend on robotics systems, drones to hit $128.7 billion in 2020&#8217;,</strong></em><br />
<em><strong>&#8216;This is the pitch deck that the CEO of AI startup Directly used to convince its top customers Microsoft and Samsung to invest in a $20 million round&#8217;,</strong></em><br />
<em><strong>&#8216;Dell is working on a software to let users control iPhones from their laptops&#8217;,</strong></em><br />
<em><strong>&#8220;Here&#8217;s an exclusive look at the pitch deck AI privacy startup Mine used to raise $3 million to help people ask companies to delete their data&#8221;,</strong></em><br />
<em><strong>&#8216;Samsung offering instant cashback of up to Rs 20,000 on Galaxy S10 series&#8217;,</strong></em><br />
<em><strong>&#8216;Google exec reveals how its cloud is helping retailers to keep their sites from crashing on their biggest shopping days of the year&#8217;,</strong></em><br />
<em><strong>&#8220;Here&#8217;s the pitch deck that email startup Front used to get get top tech execs like Zoom CEO Eric Yuan to invest in its $59 million Series C round&#8221;]</strong></em></p></blockquote>
<h3>Another Example:</h3>
<p>If the <em><strong>ROI</strong></em> is <strong>&#8220;</strong><em><strong>SpaceX launches third batch of 60 Starlink mini satellites&#8221;,</strong></em></p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>SpaceX launches third batch of 60 Starlink mini satellites</strong></em></p>
<p><em><strong>Out:</strong></em><br />
<em><strong>[&#8216;SpaceX launches third batch of Starlink satellites&#8217;,</strong></em><br />
<em><strong>&#8220;ISRO&#8217;s GSAT 30 satellite successfully rides the Ariane 5 rocket into orbit abroad the first launch of 2020&#8221;,</strong></em><br />
<em><strong>&#8216;SpaceX launch LIVE stream: Watch Elon Musk blast next Starlink satellites into orbit today&#8217;,</strong></em><br />
<em><strong>&#8216;ISRO targets to launch 19 satellites within a period of 7 months&#8217;,</strong></em><br />
<em><strong>&#8216;Asteroid alert: NASA tracks four large space rocks racing towards Earth in next 48 hours&#8217;,</strong></em><br />
<em><strong>&#8216;Huawei launches Mate 30 Pro 5G outside of China for first time, enters UAE&#8217;,</strong></em><br />
<em><strong>&#8216;ISRO’s first mission of the decade on this date! Ariane rocket to launch GSAT-30 satellite&#8217;,</strong></em><br />
<em><strong>&#8216;Samsung teases launch of new Galaxy phone in 11 Feb event announcement&#8217;,</strong></em><br />
<em><strong>&#8216;SpaceX launch LIVE stream: Watch Elon Musk’s first launch of 2020 online HERE&#8217;,</strong></em><br />
<em><strong>&#8216;NASA news: Space agency outlines goals for 2020 including a launch to Mars&#8217;]</strong></em></p></blockquote>
<h3>Conclusion</h3>
<p>In this tutorial, we have discussed generating related content using fastText sentence embeddings and a mathematical operation called Spatial Distance. We can also try replacing the spatial distance with the cosine similarity between the vectors to find the related content. Pre-processing techniques like <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">lemmatization</a>, stemming and removal of stopwords can also be performed on the dataset before the vector generation to improve the accuracy of the result. This specific use-case of generating related content can be enhanced to the recommendation system application considering the user&#8217;s interests.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>The post <a href="https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/">Text Similarity using fastText Word Embeddings in Python</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/text-similarity-using-fasttext-word-embeddings-in-python/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">803</post-id>	</item>
		<item>
		<title>Data Cleaning using Regular Expression</title>
		<link>https://turbolab.in/data-cleaning-using-regular-expression/</link>
					<comments>https://turbolab.in/data-cleaning-using-regular-expression/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Tue, 30 Nov 2021 12:06:01 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[data cleaning]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[text cleaning]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=779</guid>

					<description><![CDATA[<p>Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty [&#8230;]</p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.</span></p>
<p><span style="font-weight: 400">The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.</span></p>
<p><span style="font-weight: 400">In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.</span></p>
<p><span style="font-weight: 400">In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a &#8216;re&#8217; module which is in-built in Python so you don’t need to install it separately.</span></p>
<p><span style="font-weight: 400">The re module offers a set of functions that allows us to search a string for a match.</span></p>
<p><span style="font-weight: 400">The most commonly used methods provided by ‘re’ package are:</span></p>
<p>&nbsp;</p>
<ul>
<li><strong>re.match()</strong></li>
</ul>
<ul>
<li><strong>re.search()</strong></li>
</ul>
<ul>
<li><strong>re.findall()</strong></li>
</ul>
<ul>
<li><strong>re.split()</strong></li>
</ul>
<ul>
<li><strong>re.sub()</strong></li>
</ul>
<ul>
<li><strong>re.compile()</strong></li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Replacing Multi-Spaces</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Removing extra white spaces from data is an important step as it makes your data look well structured.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if       you hold an empty gatorade bottle up to your ear   you can hear      the sports&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8216;\s+&#8217;, &#8221; &#8220;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports</span></i></p>
<p>&nbsp;</p></blockquote>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Dealing with Special Characters</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance</strong></p>
<p>&nbsp;</p>
<h4><b>1.   Removing special characters and keeping only alphabets and numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z0-9 ]+&#8221;, “ ”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports 100&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<h4><b>2. Keeping either of alphabets or numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z ]+&#8221;,” &#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports&#8217;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8221; +&#8221;, &#8220;&#8221;,re.sub(&#8220;[^0-9 ]+&#8221;,&#8221;, tweet))</span></i></p>
<p><i><span style="font-weight: 400">Output: ‘100’</span></i></p></blockquote>
<p><b><b><br />
</b></b><i></i></p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove URLs</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;follow this website for more details www.knowmore.com and login to http://login.com&#8217;</span></i></p>
<p><i><span style="font-weight: 400">pattern = re.compile(r&#8221;https?://\S+|www\.\S+&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(pattern, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;www.knowmore.com&#8217;, &#8216;</span></i><a href="http://login.com"><i><span style="font-weight: 400">http://login.com</span></i></a><i><span style="font-weight: 400">&#8216;]</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400"># remove urls </span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(pattern, “”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: follow this website for more details and login to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove HTML Tags<br />
</b></h3>
</li>
</ul>
<blockquote><p><i><span style="font-weight: 400">Import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;&lt;p&gt;follow this &lt;b&gt;website&lt;/b&gt; for more details. &lt;/p&gt;&#8217;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;&lt;.*?&gt;&#8217;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input : x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;&lt;p&gt;&#8217;, &#8216;&lt;b&gt;&#8217;, &#8216;&lt;/b&gt;&#8217;, &#8216;&lt;/p&gt;&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8216;&lt;.*?&gt;&#8217;, &#8220;&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;follow this website for more details.&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove Email IDs </b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use “re.search” to find e-mail ID.  re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">x = re.search(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(29, 40), match=&#8217;</span></i><a href="mailto:my@gmal.com"><i><span style="font-weight: 400">myemail@gmail.com</span></i></a><i><span style="font-weight: 400">&#8216;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, ””, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: please send your feedback to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove the Hashtag</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;love to explore. #nature #traveller&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;#[_]*[a-z]+&#8217;,tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">[&#8216;#nature&#8217;, &#8216;#traveller&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(</span></i><i><span style="font-weight: 400">&#8216;#[_]*[a-z]+&#8217;, ‘ ’, tweet</span></i><i><span style="font-weight: 400">)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">&#8220;love to explore.”</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect Mentions using re.match() and re.findall()</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use re.match and re.findall to detect mentions. </strong></p>
<p><strong>re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.match(&#8220;(@\w+)&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(0, 6), match=&#8217;@Bryan&#8217;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain announced in @SportsLive&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8220;@\S+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [ &#8216;@Bryan&#8217;, &#8216;@SportsLive&#8217;]</span></i></p></blockquote>
<p>&nbsp;</p>
<h3><b>Conclusion</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on <a href="https://docs.python.org/3/library/re.html">regex</a>.</span></p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/data-cleaning-using-regular-expression/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">779</post-id>	</item>
		<item>
		<title>Build a Custom NER model using spaCy 3.0</title>
		<link>https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/</link>
					<comments>https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Thu, 11 Nov 2021 12:45:37 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[customNER]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[spacy]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=727</guid>

					<description><![CDATA[<p>SpaCy is an open-source python library used for Natural Language Processing(NLP). Unlike NLTK, which is widely used in research, spaCy focuses on production usage. Industrial-strength NLP spaCy is a library for advanced NLP in Python and Cython. As of now, this is the best NLP tool available in the market. SpaCy provides ready-to-use language-specific pre-trained models to perform [&#8230;]</p>
<p>The post <a href="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/">Build a Custom NER model using spaCy 3.0</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>SpaCy is an open-source python library used for <em>Natural Language Processing(NLP).</em> Unlike <em>NLTK</em>, which is widely used in research, spaCy focuses on production usage. Industrial-strength<em> NLP</em> <strong><em>spaCy</em></strong> is a library for advanced <em>NLP</em> in Python and Cython. As of now, this is the best NLP tool available in the market.</p>
<p>SpaCy provides ready-to-use language-specific pre-trained models to perform <em>parsing</em>, <em>tagging</em>, <em>NER</em>, <em><a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">lemmatizer</a></em>, <em>tok2vec</em>, <em>attribute_ruler</em>, and other NLP tasks. It supports 18 languages and 1 multi-language pipeline. Check the supported language list <strong><a href="https://spacy.io/usage/models#languages">here</a></strong>.</p>
<p><span style="font-weight: 400;">SpaCy provides the following four </span><a href="https://spacy.io/models/en"><b>pre-trained models</b></a><span style="font-weight: 400;"> with MIT license for the English language:</span></p>
<ol>
<li><em><strong>en_core_web_sm</strong></em>(12 mb)</li>
<li><em><strong>en_core_web_md</strong></em>(43 mb)</li>
<li><em><strong>en_core_web_lg</strong></em>(741 mb)</li>
<li><em><strong>en_core_web_trf</strong></em>(438 mb)</li>
</ol>
<p>Support for transformers and the pretrained pipeline(<strong>en_core_web_trf)</strong> has been introduced in spaCy 3.0.</p>
<p>Named Entity Recognition(NER) is the NLP task that recognizes entities in a given text. NER is a model which performs two tasks: <strong>Detect</strong> and <strong>Categorize</strong>. It has to detect the entities(<strong>India</strong>, <strong>America</strong>, <strong>Abdul Kalam</strong>) in the text and categorize(<strong>LOCATION</strong>, <strong>LOCATION</strong>, <strong>PERSON</strong>) the entities detected. This tool helps in information retrieval from bulk uncategorized texts.</p>
<h2>Load a spaCy model and check if it has ner pipeline</h2>
<blockquote><p>In:</p>
<p><em><strong>!python -m spacy download en_core_web_sm</strong></em></p>
<p><em><strong>import spacy </strong></em></p>
<p><em><strong>nlp = spacy.load(&#8220;en_core_web_sm&#8221;)</strong></em><br />
<em><strong>nlp.pipe_names</strong></em></p>
<p>&nbsp;</p>
<p>Out:</p>
<p><strong><em>[&#8216;tok2vec&#8217;, &#8216;tagger&#8217;, &#8216;parser&#8217;, &#8216;attribute_ruler&#8217;, &#8216;lemmatizer&#8217;, &#8216;ner&#8217;]</em></strong></p></blockquote>
<p><strong>ner</strong> is in the pipeline, let&#8217;s test how the entity detection will work on a sentence.</p>
<blockquote><p>In:</p>
<p><em><strong>sentence = &#8220;Daniil Medvedev and Novak Djokovic have built an intriguing rivalry since the Australian Open decider, which the Serb won comprehensively.&#8221;</strong></em><br />
<em><strong>doc = nlp(sentence)</strong></em></p>
<p><em><strong>from spacy import displacy</strong></em><br />
<em><strong>displacy.render(doc, style=&#8221;ent&#8221;, jupyter=True)</strong></em></p></blockquote>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="738" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/entitydetection/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?fit=1252%2C99&amp;ssl=1" data-orig-size="1252,99" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1636586218&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="entityDetection" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?fit=300%2C24&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?fit=800%2C63&amp;ssl=1" class="alignnone size-full wp-image-738" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=800%2C63&#038;ssl=1" alt="" width="800" height="63" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?w=1252&amp;ssl=1 1252w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=300%2C24&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=768%2C61&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=1024%2C81&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=1080%2C85&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=980%2C77&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/entityDetection.jpg?resize=480%2C38&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p>Let&#8217;s observe the doc to see how entities are being identified/tagged by the model.</p>
<blockquote><p>In:</p>
<p><em><strong>[(X, X.ent_iob_, X.ent_type_) for X in doc if X.ent_type_]</strong></em></p>
<p>Out:</p>
<p><em><strong>[(Daniil, &#8216;B&#8217;, &#8216;PERSON&#8217;),</strong></em><br />
<em><strong>(Medvedev, &#8216;I&#8217;, &#8216;PERSON&#8217;),</strong></em><br />
<em><strong>(Novak, &#8216;B&#8217;, &#8216;PERSON&#8217;),</strong></em><br />
<em><strong>(Djokovic, &#8216;I&#8217;, &#8216;PERSON&#8217;),</strong></em><br />
<em><strong>(Australian, &#8216;B&#8217;, &#8216;NORP&#8217;), # LOCATION</strong></em><br />
<em><strong>(Serb, &#8216;B&#8217;, &#8216;NORP&#8217;)]</strong></em></p></blockquote>
<p><strong>Novak</strong> and <strong>Djokovic</strong> are correctly identified as <strong>PERSON</strong> but they are separate entities. But these are displayed as a single entity through <strong>Displacy</strong>. <strong>IOB Tagging</strong> plays a key role to combine the entities which are inclusive of one another.</p>
<h2>Inside-Outside-Beginning(IOB) Tagging</h2>
<p><strong>IOB</strong> is the common tagging format for tagging the entities/chunks in the text.</p>
<ul>
<li><em><strong>I</strong></em> stands for Inside and it indicates that the token is an insider of a chunk.</li>
<li><em><strong>B</strong></em> stands for Beginning and it indicates that the token is the beginning of a chunk.</li>
<li><em><strong>O</strong></em> stands for Outside and it indicates that the token doesn&#8217;t belong to any chunk.</li>
</ul>
<p>In the above output, <strong>Daniil</strong> is tagged as B which is the beginning of the entity chunk, and <strong>Medvedev</strong> is tagged as <em><strong>I</strong></em> which is the insider token of the previous token <strong>Daniil. </strong>These two tokens combine to form a <strong>PERSON</strong> entity. Same is the scenario with <strong>Novak</strong> and <strong>Djokovic. </strong></p>
<p>The tokens tagged as <strong>O</strong> are not classified as an entity type and we can see that no label has been assigned by the model.</p>
<blockquote><p><em><strong>[(and, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(have, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(built, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(an, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(intriguing, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(rivalry, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(since, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(the, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(Open, &#8216;O&#8217;, &#8221;),</strong></em><br />
<em><strong>(decider, &#8216;O&#8217;, &#8221;)]</strong></em></p></blockquote>
<p><em><strong>CARDINAL</strong></em>, <em><strong>DATE</strong></em>, <em><strong>EVENT</strong></em>, <em><strong>FAC</strong></em>, <em><strong>GPE</strong></em>, <em><strong>LANGUAGE</strong></em>, <em><strong>LAW</strong></em>, <em><strong>LOC</strong></em>, <em><strong>MONEY</strong></em>, <em><strong>NORP</strong></em>, <em><strong>ORDINAL</strong></em>, <em><strong>ORG</strong></em>, <em><strong>PERCENT</strong></em>, <em><strong>PERSON</strong></em>, <em><strong>PRODUCT</strong></em>, <em><strong>QUANTITY</strong></em>, <em><strong>TIME</strong></em>, <em><strong>WORK_OF_ART</strong></em></p>
<p>These are the entity labels provided by the NER pre-trained model. <span style="font-weight: 400;">We can execute the command given below to understand each label.</span></p>
<blockquote><p>In:</p>
<p><em><strong>spacy.explain(&#8220;NORP&#8221;)</strong></em></p>
<p>Out:</p>
<p><em><strong>Nationalities or religious or political groups</strong></em></p></blockquote>
<h2><span style="font-weight: 400;">Why do we need a Custom NER?</span></h2>
<p>SpaCy pre-trained models detect and categorize the text chunks into 18 types of entities. If the user requirement is to extract information from job postings, the above pre-trained model will not provide any support. Let&#8217;s see an example:</p>
<blockquote><p>In:</p>
<p><em><strong>sentence = &#8220;&#8221;&#8221;As a Full Stack Developer, you will develop applications in a very passionate environment being responsible for Front-end and Back-end development. You will perform development and day-to-day maintenance on large applications. You have multiple opportunities to work on cross-system single-page applications.&#8221;&#8221;&#8221;</strong></em><br />
<em><strong>doc = nlp(sentence)</strong></em></p>
<p><em><strong>from spacy import displacy</strong></em><br />
<em><strong>displacy.render(doc, style=&#8221;ent&#8221;, jupyter=True)</strong></em></p>
<p>Out:</p>
<p><strong>UserWarning</strong>: <em>[W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.</em></p></blockquote>
<p><span style="font-weight: 400;">The warning says that no entities were found in the Doc object.</span></p>
<p><span style="font-weight: 400;">This is where the custom NER model comes into the picture for our custom problem statement i.e., detecting the </span><b>job_role</b><span style="font-weight: 400;"> from the job posts.</span></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="740" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/jobtitle/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?fit=1401%2C266&amp;ssl=1" data-orig-size="1401,266" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1636630992&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jobtitle" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?fit=300%2C57&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?fit=800%2C152&amp;ssl=1" class="alignnone size-full wp-image-740" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=800%2C152&#038;ssl=1" alt="" width="800" height="152" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?w=1401&amp;ssl=1 1401w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=300%2C57&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=768%2C146&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=1024%2C194&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=1080%2C205&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=1280%2C243&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=980%2C186&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/jobtitle.jpg?resize=480%2C91&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" />Steps to build the custom NER model for detecting the job role in job postings in spaCy 3.0:</p>
<ol>
<li>Annotate the data to train the model.</li>
<li>Convert the annotated data into the spaCy bin object.</li>
<li>Generate the config file from the spaCy website.</li>
<li>Train the model in the command line.</li>
<li>Load and test the saved model.</li>
</ol>
<p>We will discuss the above steps in detail.</p>
<h3>SpaCy NER annotation tool by agateteam</h3>
<p>The agateteam provides a lightweight <a href="http://agateteam.org/spacynerannotate/"><em><strong>annotation tool</strong></em></a> to generate the spaCy-supported annotated data format.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="744" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/tool/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/tool.gif?fit=1905%2C975&amp;ssl=1" data-orig-size="1905,975" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="tool" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/tool.gif?fit=300%2C154&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/tool.gif?fit=800%2C409&amp;ssl=1" class="alignnone size-full wp-image-744" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/tool.gif?resize=800%2C409&#038;ssl=1" alt="" width="800" height="409" /></p>
<p>Annotation of a sentence is shown in the above gif. We have shown the <strong>job_role</strong> tagging; you can add <strong>work_experience</strong>, <strong>work_location</strong>, <strong>experience</strong> to the entity list. Here is the sample annotated data:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="745" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/datasample-2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?fit=1333%2C502&amp;ssl=1" data-orig-size="1333,502" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1636637140&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="datasample" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?fit=300%2C113&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?fit=800%2C302&amp;ssl=1" class="alignnone size-full wp-image-745" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=800%2C301&#038;ssl=1" alt="" width="800" height="301" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?w=1333&amp;ssl=1 1333w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=300%2C113&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=768%2C289&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=1024%2C386&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=1080%2C407&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=1280%2C482&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=980%2C369&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/datasample.jpg?resize=480%2C181&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<h3>Convert the annotated data into the spaCy bin object</h3>
<p>In spaCy 2.x, we can use this raw data to train a model. But, in spaCy 3.x, we need to convert it to a doc bin object. Consider this: we assign the above-annotated data to the variable called <strong>trainData</strong>. We can convert it using the function below:</p>
<blockquote>
<div>
<div><em><strong>import spacy</strong></em></div>
<div><em><strong>from spacy.tokens import DocBin</strong></em></div>
<div><em><strong>from tqdm import tqdm</strong></em></div>
<div></div>
<div><em><strong>nlp = spacy.blank(&#8220;en&#8221;) # load a new spacy model</strong></em></div>
<div><em><strong>db = DocBin() # create a DocBin object</strong></em></div>
<div></div>
<div><em><strong>for text, annot in tqdm(trainData): # data in previous format</strong></em></div>
<div><em><strong>    doc = nlp.make_doc(text) # create doc object from text</strong></em></div>
<div><em><strong>    ents = []</strong></em></div>
<div><em><strong>    for start, end, label in annot[&#8220;entities&#8221;]: # add character indexes</strong></em></div>
<div><em><strong>        span = doc.char_span(start, end, label=label, alignment_mode=&#8221;contract&#8221;)</strong></em></div>
<div><em><strong>        if span is None:</strong></em></div>
<div><em><strong>            print(&#8220;Skipping entity&#8221;)</strong></em></div>
<div><em><strong>        else:</strong></em></div>
<div><em><strong>            ents.append(span)</strong></em></div>
<div><em><strong>    try:</strong></em></div>
<div><em><strong>        doc.ents = ents # label the text with the ents</strong></em></div>
<div><em><strong>        db.add(doc)</strong></em></div>
<div><em><strong>    except:</strong></em></div>
<div><em><strong>        print(text, annot)</strong></em></div>
<div></div>
<div><em><strong>db.to_disk(&#8220;./train.spacy&#8221;) # save the docbin object</strong></em></div>
</div>
</blockquote>
<div>Now, we have the trainData saved as <strong>train.spacy</strong>.</div>
<div></div>
<h3>Generate the config file to train via Command line</h3>
<p>spaCy train from the command line is the recommended way to train our spaCy pipelines. <em><strong>config.cfg</strong></em> includes all settings and hyperparameters. If necessary, we can overwrite it.</p>
<p>Go to the spaCy training <strong><a href="https://spacy.io/usage/training"><em>link </em></a></strong>and follow the steps below:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="747" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/spacyconfig/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/spacyConfig.gif?fit=1288%2C868&amp;ssl=1" data-orig-size="1288,868" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="spacyConfig" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/spacyConfig.gif?fit=300%2C202&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/spacyConfig.gif?fit=800%2C539&amp;ssl=1" class="alignnone size-full wp-image-747" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/spacyConfig.gif?resize=800%2C539&#038;ssl=1" alt="" width="800" height="539" /></p>
<p>Select the preferred language and component as <strong>ner</strong>. As per your system requirement, you can choose CPU/GPU. You can save this configuration as<strong> base_config.cfg</strong></p>
<div>To fill the remaining system defaults, run this command on the command line to generate the <em><strong>config.cfg </strong></em>file<em>.</em></div>
<blockquote>
<div><em><strong><span class="f93e7b95">python -m</span> spacy <span class="_89ba5f03 cea05330">init fill-config</span> <span class="_89ba5f03">base_config.cfg</span> <span class="_89ba5f03">config.cfg</span></strong></em></div>
</blockquote>
<h3>Training the model using the command line</h3>
<blockquote><p><em><strong><span class="token selector">[paths]</span></strong></em></p>
<p><em><strong><span class="token constant">train</span> <span class="token attr-value"><span class="token punctuation">=</span> ./train.spacy</span></strong></em></p>
<p><em><strong><span class="token constant">dev</span> <span class="token attr-value"><span class="token punctuation">=</span> ./dev.spacy</span></strong></em></p></blockquote>
<p>You can specify the train, dev, and output file paths in the config file. The batch size, max steps, epochs, patience, etc can also be specified in the config file.</p>
<p><span style="font-weight: 400;">Now that we have the config file and train data, let’s train the model using the command line.</span></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="750" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/train/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/train.gif?fit=1299%2C866&amp;ssl=1" data-orig-size="1299,866" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="train" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/train.gif?fit=300%2C200&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/train.gif?fit=800%2C534&amp;ssl=1" class="alignnone size-full wp-image-750" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/train.gif?resize=800%2C533&#038;ssl=1" alt="" width="800" height="533" /></p>
<p><span style="font-weight: 400;">The model output will be saved in the specified folder as an argument at the command line.</span></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="749" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/modeloutput/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?fit=948%2C464&amp;ssl=1" data-orig-size="948,464" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1636651648&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="modelOutput" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?fit=300%2C147&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?fit=800%2C392&amp;ssl=1" class="alignnone size-full wp-image-749" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?resize=800%2C392&#038;ssl=1" alt="" width="800" height="392" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?w=948&amp;ssl=1 948w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?resize=300%2C147&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?resize=768%2C376&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/modelOutput.jpg?resize=480%2C235&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<h3>Load &amp; Test the model</h3>
<ul>
<li>Load the model.</li>
</ul>
<blockquote><p><em><strong>import spacy</strong></em></p>
<p><em><strong>nlp = spacy.load(&#8220;output/model-last/&#8221;) #load the model</strong></em></p></blockquote>
<ul>
<li>Take the unseen data to test the model prediction.</li>
</ul>
<blockquote><p><em><strong>sentence = &#8220;&#8221;&#8221;We are looking for a Backend Developer who has 4-6 years of experience in designing, developing and implementing backend services using Python and Django.&#8221;&#8221;&#8221;</strong></em></p>
<p><em><strong>doc = nlp(sentence)</strong></em></p>
<p><em><strong>from spacy import displacy</strong></em><br />
<em><strong>displacy.render(doc, style=&#8221;ent&#8221;, jupyter=True)</strong></em></p></blockquote>
<p><em><strong>Out:</strong></em></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="751" data-permalink="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/final_output/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?fit=1239%2C96&amp;ssl=1" data-orig-size="1239,96" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1636652294&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="final_output" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?fit=300%2C23&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?fit=800%2C62&amp;ssl=1" class="alignnone size-full wp-image-751" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=800%2C62&#038;ssl=1" alt="" width="800" height="62" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?w=1239&amp;ssl=1 1239w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=300%2C23&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=768%2C60&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=1024%2C79&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=1080%2C84&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=980%2C76&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/11/final_output.jpg?resize=480%2C37&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><strong>Backend Developer</strong> is predicted as a <strong>job_role</strong> by the model.</p>
<h2>Applications of NER:</h2>
<ul>
<li>Enables Recommendation Systems.</li>
<li>Simplify Customer Support.</li>
<li>Classify the data of News Sources.</li>
<li>Optimizing the Search Engine Algorithms.</li>
</ul>
<h2>EndNote:</h2>
<p>We have taken just 10 records to train the model. For better accuracy and precision, we need to have a huge amount of annotated data to train a model.</p>
<p>The post <a href="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/">Build a Custom NER model using spaCy 3.0</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">727</post-id>	</item>
		<item>
		<title>Stemming Vs. Lemmatization with Python NLTK</title>
		<link>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/</link>
					<comments>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Fri, 29 Oct 2021 17:00:31 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[lemmatization]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[stemmer]]></category>
		<category><![CDATA[stemming]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=694</guid>

					<description><![CDATA[<p>Stemming and Lemmatization are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example: Let&#8217;s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our [&#8230;]</p>
<p>The post <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">Stemming Vs. Lemmatization with Python NLTK</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><strong>Stemming</strong> and <strong>Lemmatization</strong> are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="707" data-permalink="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/lemmvsstem2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?fit=408%2C244&amp;ssl=1" data-orig-size="408,244" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1635515383&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="lemmVsStem2" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?fit=300%2C179&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?fit=408%2C244&amp;ssl=1" class="size-full wp-image-707 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?resize=408%2C244&#038;ssl=1" alt="" width="408" height="244" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?w=408&amp;ssl=1 408w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?resize=300%2C179&amp;ssl=1 300w" sizes="(max-width: 408px) 100vw, 408px" /></p>
<p>Let&#8217;s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our data corpus. <span style="font-weight: 400">By applying stemming/lemmatization techniques, we can reduce the vocabulary size by converting the words to their base forms. </span><span style="font-weight: 400">This will make the vocabulary more distinct and will reduce the ambiguity for the model to train and yield better results.</span></p>
<p>In this post, we will discuss the practical examples of how stemming and lemmatization can be done on words and sentences using the python <strong>nltk</strong> package.</p>
<h1>Stemming</h1>
<p>Stemming is a rule-based normalization approach as it slices the word&#8217;s prefix and suffix to reduce them to its root form. Stemming is faster compared to lemmatization as it cuts the prefixes(pre-, extra-, in-, im-, ir-, etc.)  and suffixes(ed-, ing-, es-, -ity, -ty, -ship, -ness, etc.) without considering the context of the words. <strong>Due to its aggressiveness, there is a possibility that the outcome from the stemming algorithm may not be a valid word</strong>.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="710" data-permalink="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/lemmvsstem3/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?fit=487%2C295&amp;ssl=1" data-orig-size="487,295" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1635522855&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="lemmVsStem3" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?fit=300%2C182&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?fit=487%2C295&amp;ssl=1" class="size-full wp-image-710 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=487%2C295&#038;ssl=1" alt="" width="487" height="295" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?w=487&amp;ssl=1 487w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=300%2C182&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=480%2C291&amp;ssl=1 480w" sizes="(max-width: 487px) 100vw, 487px" /></p>
<p>In the above example, you can see that the outcomes of <strong>badly</strong> and <strong>pharmacies</strong> are invalid words.</p>
<h3>Porter Stemmer</h3>
<p>The Porter stemming algorithm (or &#8220;Porter stemmer&#8221;) uses suffix-stemming to produce stems. Here is a python code using nltk to create a stemmer object and generate results.</p>
<p>Code Snippet to perform Porter Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import PorterStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ps = PorterStemmer()</strong></em><br />
<em><strong>print([ps.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;player&#8217;, &#8216;pharmaci&#8217;, &#8216;badli&#8217;]</strong></em></p></blockquote>
<p><strong>Observing the drawbacks of PorterStemmer, the Snowball Stemming algorithm was introduced.</strong></p>
<h3>Snowball Stemmer</h3>
<p>This Snowball Stemming Algorithm is also known as Porter2 Stemmer. It is the best version of Porter Stemmer in which a few of the above-discussed stemming issues are resolved.</p>
<p>Code Snippet to perform Snowball Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem.snowball import SnowballStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ss = SnowballStemmer(language=&#8217;english&#8217;)</strong></em><br />
<em><strong>print([ss.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;player&#8217;, &#8216;pharmaci&#8217;, &#8216;bad&#8217;]</strong></em></p></blockquote>
<p>Here, we can see that the word &#8220;<em><strong>badly</strong></em>&#8221; is a valid stem, but the word &#8220;<em><strong>pharmacies</strong></em>&#8221; is still an invalid stem.</p>
<h3>Lancaster Stemmer</h3>
<p><span style="font-weight: 400">Compared to snowball and porter stemming, lancaster is the most aggressive stemming algorithm because it tends to over-stem a lot of words. It tries to reduce the word to the shortest stem possible. Here is an example:</span></p>
<p>Here is an example:</p>
<blockquote><p><em><strong>&#8220;salty&#8221; &#8212;- &#8220;sal&#8221;</strong></em></p>
<p><em><strong>&#8220;sales&#8221; &#8212;- &#8220;sal&#8221;</strong></em></p></blockquote>
<p>Code Snippet to perform Lancaster Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import LancasterStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ls = LancasterStemmer()</strong></em><br />
<em><strong>print([ls.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;pharm&#8217;, &#8216;bad&#8217;]</strong></em></p></blockquote>
<p>As mentioned in the beginning, we can reduce the vocabulary and maintain more unique words by stemming.</p>
<p>Code snippet to perform tokenization and stemming on a paragraph:</p>
<blockquote><p><em><strong>content = &#8220;China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.&#8221;</strong></em></p></blockquote>
<p>The above content will hereafter be used as the input to the code snippets.</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import PorterStemmer</strong></em><br />
<em><strong>from nltk.tokenize import word_tokenize</strong></em></p>
<p><em><strong>ps = PorterStemmer()</strong></em></p>
<p><em><strong># Porter Stemmed version</strong></em></p>
<p><em><strong>porteredContent = [ps.stem(word) for word in word_tokenize(content)]</strong></em></p></blockquote>
<p><span style="font-weight: 400">Try testing the above code snippet by replacing the Porter stemmer with Snowball and Lancaster stemmers.</span></p>
<p>Let us throw some statistics to compare these three stemming algorithms.</p>
<ul>
<li><strong>Length of the content is 1041(without spaces)</strong></li>
<li><strong>Length of the content after Porter Stemmer is 943 which took around 0.00499 seconds to process</strong></li>
<li><strong>Length of the content after Snowball Stemmer is 944 which took around 0.00399 seconds to process</strong></li>
<li><strong>Length of the content after Lancaster Stemmer is 835 which took around 0.00399 seconds to process</strong></li>
</ul>
<p>Obviously, Lancaster Stemmer will have less content length because of its aggressive over-stemming nature. With all the three stemmers discussed above, we weren&#8217;t able to get the root word of &#8220;<strong>pharmacies<span style="font-weight: 400">”</span></strong>. We will now move on to lemmatization since stemming didn&#8217;t get us the valid stem word in all cases. While stemming is fast, it is not 100% accurate.</p>
<h1>Lemmatization</h1>
<p>In Lemmatization, the parts of speech(POS) will be determined first, unlike stemming which stems the word to its root form without considering the context. Lemmatization always considers the context and converts the word to its meaningful root/dictionary(WordNet) form called Lemma.</p>
<h3>WordNet Lemmatizer</h3>
<p><b>WordNet</b> is a lexical database (a collection of words) that has been used by major search engines and IR research projects for many years. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import WordNetLemmatizer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>lemmatizer = WordNetLemmatizer()</strong></em><br />
<em><strong>print([lemmatizer.lemmatize(word) for word in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;playing&#8217;, &#8216;played&#8217;, &#8216;player&#8217;, &#8216;pharmacy&#8217;, &#8216;badly&#8217;]</strong></em></p></blockquote>
<p>Here, we can see that only &#8220;<strong>plays</strong>&#8221; and most anticipated &#8220;<strong>pharmacies</strong>&#8221; have been converted to their root forms while the remaining words are not. Without the POS tag, WordNet Lemmatizer assumes every word as a noun. We need to pass a respective POS tag along with the word to the WordNet Lemmatizer.</p>
<h3>WordNet Lemmatizer with POS tag:</h3>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>word = &#8220;better&#8221;</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;n&#8221;)) # n for noun and it is default</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;a&#8221;)) # a for adjective</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;v&#8221;)) # v for verb</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;r&#8221;)) # r for adverb</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>better | </strong></em><em><strong>good | </strong></em><em><strong>better | </strong></em><em><strong>well</strong></em></p></blockquote>
<p>For the word <span style="font-weight: 400">“</span><strong>better<span style="font-weight: 400">”</span></strong>, the output is not the same when the POS is an adjective and an adverb.</p>
<p>Now, determining the POS for the word will be an extra task for the lemmatization process. When we are converting a large number of text chunks, it will be difficult to pass a POS tag for each word &#8211; we need to automate the fetching of POS tags for each word we lemmatize. Here is a function for that:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>import nltk</strong></em><br />
<em><strong>from nltk.corpus import wordnet</strong></em></p>
<p><em><strong>def getWordNetPOS(word):</strong></em><br />
<em><strong>     tag = nltk.pos_tag([word])[0][1][0].upper()</strong></em><br />
<em><strong>     tagDict = {&#8220;J&#8221;: wordnet.ADJ,</strong></em><br />
<em><strong>     &#8220;N&#8221;: wordnet.NOUN,</strong></em><br />
<em><strong>     &#8220;V&#8221;: wordnet.VERB,</strong></em><br />
<em><strong>     &#8220;R&#8221;: wordnet.ADV}</strong></em><br />
<em><strong>    return tagDict.get(tag, wordnet.NOUN)</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<ul>
<li><em><strong>getWordNetPOS(&#8220;better&#8221;) &#8212; &#8220;r&#8221;</strong></em></li>
<li><em><strong>get_wordnet_pos(&#8220;play&#8221;) &#8212; &#8220;n&#8221;</strong></em></li>
<li><em><strong>get_wordnet_pos(&#8220;bad&#8221;) &#8212; &#8220;a&#8221;</strong></em></li>
</ul>
</blockquote>
<p>Code Snippet to perform WordNet Lemmatization with POS:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import WordNetLemmatizer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>lemmatizer = WordNetLemmatizer()</strong></em><br />
<em><strong>print([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;played&#8217;, &#8216;player&#8217;, &#8216;pharmacy&#8217;, &#8216;badly&#8217;]</strong></em></p></blockquote>
<p>Spacy Lemmatizer, TextBlob Lemmatizer, Stanford CoreNLP Lemmatizer, Gensim Lemmatizer are the other lemmatizers that can be tried. With a spacy lemmatizer, lemmatization can be done without passing any POS tag.</p>
<p>Code snippet to perform lemmatization on a paragraph:</p>
<blockquote><p><em><strong>from nltk.tokenize import word_tokenize</strong></em><br />
<em><strong>from nltk.stem import WordNetLemmatizer</strong></em></p>
<p><em><strong>wordnetContent = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(content)] # content defined earlier</strong></em></p></blockquote>
<p>Time taken to process this content on WordNet Lemmatizer is <strong>0.2234</strong> seconds which is a lot higher when compared to stemming.</p>
<h1>Conclusion</h1>
<p>Stemming and Lemmatization both generate the root/base form of the word. The only difference is that the stem may not be an actual word whereas the lemma is a meaningful word.</p>
<p>Compared to stemming, lemmatization is slow but helps to train the accurate ML model. If your data is huge, then snowball stemmer(porter2) is a better alternative. If your ML model uses a count vectorizer and it doesn&#8217;t bother with the context of the words/sentences, then stemming is the best process that can be considered.</p>
<p>For deep learning models and word embeddings in use, lemmatization is the perfect choice because you will not find word embeddings for invalid stem words.</p>
<p>We recommend you try other methods of lemmatization provided by Spacy, Textblob, Gensim, and Stanford core NLP.</p>
<p>The post <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">Stemming Vs. Lemmatization with Python NLTK</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">694</post-id>	</item>
		<item>
		<title>Text Classification using Machine Learning</title>
		<link>https://turbolab.in/text-classification-using-machine-learning/</link>
					<comments>https://turbolab.in/text-classification-using-machine-learning/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Fri, 15 Oct 2021 13:11:59 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[text classification]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=601</guid>

					<description><![CDATA[<p>Machine Learning, Deep Learning, Artificial Intelligence are the popular buzzwords in present trends. Artificial Intelligence(AI) is the branch of computer science which deals with developing intelligence artificially to the machines which are able to think, act and behave like humans. Machine Learning(ML) is a subset of AI and is the way to implement artificial intelligence. It [&#8230;]</p>
<p>The post <a href="https://turbolab.in/text-classification-using-machine-learning/">Text Classification using Machine Learning</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Machine Learning, Deep Learning</strong>, <strong class="markup--strong markup--p-strong">Artificial Intelligence</strong> are the popular buzzwords in present trends.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Artificial Intelligence(AI)</strong> is the branch of computer science which deals with developing intelligence artificially to the machines which are able to think, act and behave like humans.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Machine Learning(ML)</strong> is a subset of <strong class="markup--strong markup--p-strong">AI</strong> and is the way to implement artificial intelligence. It is the statistical approach where each instance in a data-set is described by a set of features or attributes. Feature Extraction is key in <strong class="markup--strong markup--p-strong">ML</strong>.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">Deep Learning(DL)</strong> is the next evolution and subset of <strong class="markup--strong markup--p-strong">ML</strong>. It is a method of statistical learning that extracts features or attributes from raw data. <strong class="markup--strong markup--p-strong">DL</strong> uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. <strong class="markup--strong markup--p-strong">DL</strong> takes the data into a network of layers(Input, Hidden &amp; Output) to extract the features and to learn from the data. Let&#8217;s end about the <strong>DL</strong> here &#8211; We will discuss more in the coming blogs.</p>
<p><figure id="attachment_675" aria-describedby="caption-attachment-675" style="width: 450px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="675" data-permalink="https://turbolab.in/text-classification-using-machine-learning/1_wvgsubijsbt5ls_5y-vshq/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?fit=1000%2C1204&amp;ssl=1" data-orig-size="1000,1204" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="outline" data-image-description="" data-image-caption="" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?fit=249%2C300&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?fit=800%2C964&amp;ssl=1" class="wp-image-675" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=450%2C542&#038;ssl=1" alt="" width="450" height="542" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=249%2C300&amp;ssl=1 249w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=768%2C925&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=850%2C1024&amp;ssl=1 850w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=980%2C1180&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?resize=480%2C578&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wVgSuBIjSBT5lS_5Y-vSHQ.png?w=1000&amp;ssl=1 1000w" sizes="(max-width: 450px) 100vw, 450px" /><figcaption id="caption-attachment-675" class="wp-caption-text">outline</figcaption></figure></p>
<p class="graf graf--p">In <b>ML/DL</b>, there are models that fall into different categories like supervised, unsupervised &amp; reinforcement learning. In this tutorial, we will discuss Supervised learning which involves an output label associated with each instance in the data-set.</p>
<p><figure id="attachment_674" aria-describedby="caption-attachment-674" style="width: 711px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="674" data-permalink="https://turbolab.in/text-classification-using-machine-learning/1_wmwkg_y6jvzu4sg3xhuazq/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?fit=711%2C335&amp;ssl=1" data-orig-size="711,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Supervised Learning Model Flow Chart" data-image-description="" data-image-caption="&lt;p&gt;Supervised Learning Model Flow Chart&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?fit=300%2C141&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?fit=711%2C335&amp;ssl=1" class="size-full wp-image-674" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=711%2C335&#038;ssl=1" alt="" width="711" height="335" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?w=711&amp;ssl=1 711w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=300%2C141&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/1_wmWKg_y6jvZU4sg3XhUAZQ.png?resize=480%2C226&amp;ssl=1 480w" sizes="(max-width: 711px) 100vw, 711px" /><figcaption id="caption-attachment-674" class="wp-caption-text">Supervised Learning Model Flow Chart</figcaption></figure></p>
<p><strong>Text(Document) Classification</strong>/T<strong>ext(Document) Categorization</strong> is one of the important and typical tasks in supervised <b>ML</b>. This technique allows machines to understand and then categorize text into known organized groups.</p>
<p>In this post, we will look into the supervised learning technique of how classification on the document dataset can be approached with <strong>ML</strong> algorithms.</p>
<p class="graf graf--p">Some of the <strong>ML</strong> algorithms are:</p>
<ul class="postList">
<li class="graf graf--li"><strong>Naive Bayes.</strong></li>
<li class="graf graf--li"><strong>Decision Trees.</strong></li>
<li class="graf graf--li"><strong>Logistic Regression <em class="markup--em markup--li-em">(Linear Model)</em>.</strong></li>
<li class="graf graf--li"><strong>Support Vector Machines <em class="markup--em markup--li-em">(SVM)</em>.</strong></li>
<li class="graf graf--li"><strong>Random Forest.</strong></li>
<li class="graf graf--li"><strong>K-Means Clustering.</strong></li>
<li class="graf graf--li"><strong>K-Nearest Neighbour.</strong></li>
<li class="graf graf--li"><strong>Gaussian Mixture Model.</strong></li>
<li class="graf graf--li"><strong>Hidden Markov Model. </strong><em>et cetera</em></li>
</ul>
<p class="graf graf--p">Among these <strong>ML</strong> Algorithms, we will discuss how the <strong class="markup--strong markup--p-strong">Naive Bayes</strong>, <strong class="markup--strong markup--p-strong">Logistic Regression</strong> and <strong class="markup--strong markup--p-strong">SVM</strong> classifier models perform on the data-set feature vectors.</p>
<h2>Dataset</h2>
<p><figure id="attachment_681" aria-describedby="caption-attachment-681" style="width: 613px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="681" data-permalink="https://turbolab.in/text-classification-using-machine-learning/categories/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?fit=613%2C691&amp;ssl=1" data-orig-size="613,691" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1634302468&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="categories" data-image-description="" data-image-caption="&lt;p&gt;news dataset&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?fit=266%2C300&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?fit=613%2C691&amp;ssl=1" class="wp-image-681 size-full" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=613%2C691&#038;ssl=1" alt="news dataset" width="613" height="691" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?w=613&amp;ssl=1 613w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=266%2C300&amp;ssl=1 266w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/categories.jpg?resize=480%2C541&amp;ssl=1 480w" sizes="(max-width: 613px) 100vw, 613px" /><figcaption id="caption-attachment-681" class="wp-caption-text">news dataset</figcaption></figure></p>
<p>Categorized the data-set with the above 10 categories with each category of 1000 entries with <strong>content</strong> and <strong>label</strong> as two columns. I would call this dataset a dataframe(<strong>df</strong>) hereafter in the following code snippets.</p>
<p><figure id="attachment_684" aria-describedby="caption-attachment-684" style="width: 512px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="684" data-permalink="https://turbolab.in/text-classification-using-machine-learning/datasample/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=512%2C442&amp;ssl=1" data-orig-size="512,442" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1634309405&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="datasample" data-image-description="" data-image-caption="&lt;p&gt;dataset&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=300%2C259&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?fit=512%2C442&amp;ssl=1" class="size-full wp-image-684" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=512%2C442&#038;ssl=1" alt="dataset" width="512" height="442" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?w=512&amp;ssl=1 512w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=300%2C259&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/datasample.jpg?resize=480%2C414&amp;ssl=1 480w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption id="caption-attachment-684" class="wp-caption-text">dataset</figcaption></figure></p>
<h2></h2>
<h2 class="graf graf--h3">Data Cleaning</h2>
<p class="graf graf--p">Pre-processing of data will have an impact on the output ie., the accuracy, performance of the model. Some of the data-cleaning steps are as follows</p>
<ul class="postList">
<li class="graf graf--li">Removing Stop Words. (<strong class="markup--strong markup--li-strong">NLTK</strong>)</li>
<li class="graf graf--li">Performing Stemming on the text. (<strong class="markup--strong markup--li-strong">NLTK</strong>)</li>
<li class="graf graf--li">Removing special characters &amp; extra spaces or keeping only Alpha-Numeric characters in the text.</li>
</ul>
<pre class="graf graf--pre"><strong># NLTK python module for stemming and stopwords removal</strong>
<strong>from nltk.stem.snowball import SnowballStemmer</strong>
<strong>from nltk.corpus import stopwords</strong>
<strong>import string, re</strong></pre>
<pre class="graf graf--pre"><strong>stemmer = SnowballStemmer('english') # stemmer</strong>
<strong>t = str.maketrans(dict.fromkeys(string.punctuation)) # special char removal</strong></pre>
<pre class="graf graf--pre"><strong>def clean_text(text):  </strong>
<strong>    ## Remove Punctuation</strong>
<strong>    text = text.translate(t) </strong>
<strong>    text = text.split()</strong></pre>
<pre class="graf graf--pre"><strong>    ## Remove stop words</strong>
<strong>    stops = set(stopwords.words("english"))</strong>
<strong>    text = [stemmer.stem(w) for w in text if not w in stops]</strong>
    
<strong>    text = " ".join(text)</strong>
<strong>    text = re.sub(' +',' ', text) # extra consecutive space removal </strong>
<strong>    return text

df["content"] = df["content"].apply(clean_text)
</strong></pre>
<p>This data cleaning part is optional &#8211; one can test the model accuracy with and without the data cleaning. Removing stop words and performing stemming can take away the context essence from the data.</p>
<p><b>Stemming removes or stems the last few characters of a word</b>, often leading to meaningless words. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.</p>
<p><figure id="attachment_688" aria-describedby="caption-attachment-688" style="width: 380px" class="wp-caption alignnone"><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="688" data-permalink="https://turbolab.in/text-classification-using-machine-learning/lemmvsstem/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?fit=380%2C82&amp;ssl=1" data-orig-size="380,82" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1634315532&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="lemmVsStem" data-image-description="" data-image-caption="&lt;p&gt;lemmatization Vs Stemming&lt;/p&gt;
" data-medium-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?fit=300%2C65&amp;ssl=1" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?fit=380%2C82&amp;ssl=1" class="size-full wp-image-688" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?resize=380%2C82&#038;ssl=1" alt="" width="380" height="82" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?w=380&amp;ssl=1 380w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem.jpg?resize=300%2C65&amp;ssl=1 300w" sizes="(max-width: 380px) 100vw, 380px" /><figcaption id="caption-attachment-688" class="wp-caption-text">lemmatization Vs Stemming</figcaption></figure></p>
<p>We have performed the stemming and stop word removal on the <strong>df</strong> before the data transformation process.</p>
<h2>Data Transformation</h2>
<p class="graf graf--p">Transforming the data into feature vectors with the following methods</p>
<ul class="postList">
<li class="graf graf--li">Count Vectorization.</li>
<li class="graf graf--li">TF-IDF Word Vectorization.</li>
<li class="graf graf--li">TF-IDF N-Gram Vectorization.</li>
</ul>
<p>We recommend you go through these feature extraction methods which are explained in detail in one of our <strong><a href="https://turbolab.in/feature-extraction-in-natural-language-processing-nlp/">blogs.</a></strong></p>
<p>Dataset(<strong>df) </strong>has been split into train and validate samples in the range of 75 and 25% respectively by sklearn&#8217;s <strong>train_test_split</strong> function.</p>
<p>Code Snippet to transform data into vectors using the <strong>scikit-learn</strong>(sklearn) module.</p>
<pre class="graf graf--pre"><strong>from sklearn import model_selection, preprocessing</strong>
<strong>from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer</strong></pre>
<pre class="graf graf--pre"><strong>'''Assume df is the dataset with columns "content" and "label"'''</strong></pre>
<pre class="graf graf--pre"><strong># split the data into training and validation</strong>
<strong>train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['content'], df['label'])</strong></pre>
<pre class="graf graf--pre"><strong># label encode the target variable </strong>
<strong>encoder = preprocessing.LabelEncoder()</strong>
<strong>train_y = encoder.fit_transform(train_y)</strong>
<strong>valid_y = encoder.fit_transform(valid_y)</strong></pre>
<pre class="graf graf--pre"><strong># count vectorization </strong>
<strong>count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')</strong>
<strong>count_vect.fit(df['content'])</strong>
<strong>xtrain_count = count_vect.transform(train_x)</strong>
<strong>xvalid_count = count_vect.transform(valid_x)</strong></pre>
<pre class="graf graf--pre"><strong># word level tf-idf vectorization</strong>
<strong>tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)</strong>
<strong>tfidf_vect.fit(df['content'])</strong>
<strong>xtrain_tfidf = tfidf_vect.transform(train_x)</strong>
<strong>xvalid_tfidf = tfidf_vect.transform(valid_x)</strong></pre>
<pre class="graf graf--pre"><strong># ngram level tf-idf vectorization</strong>
<strong>tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)</strong>
<strong>tfidf_vect_ngram.fit(df['content'])</strong>
<strong>xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)</strong>
<strong>xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)</strong></pre>
<h2>Training with Naive Bayes, Logistic Regression, and SVM</h2>
<p class="graf graf--p">We use Naive Bayes, Logistic Regression, and SVM algorithms to train the data-set feature vectors to form classifier models that are used for prediction.</p>
<p>From the above data transformation snippet, we have respective train and validation vectorization objects along with their label encoder values. We will use them here to fit the respective classifier and predict the model using the validation sample dataset.</p>
<p>This report_generation function is the common function used by the three ML Algorithms to predict validation data and pass the model&#8217;s accuracy.</p>
<pre class="graf graf--pre"><strong>from sklearn import </strong><strong class="markup--strong markup--pre-strong">linear_model, naive_bayes, svm, metrics
</strong><strong>from sklearn.metrics import classification_report

<span class="pl-s1">target_names</span> <span class="pl-c1">=</span> <span class="pl-en">list</span>(<span class="pl-s1">encoder</span>.<span class="pl-s1">classes_</span>) <span class="pl-c"># output labels for report generation</span>
</strong></pre>
<pre class="graf graf--pre"><strong>def report_generation(classifier, train_data, valid_data, train_y, valid_y):</strong>
<strong>   classifier.fit(train_data, train_y)</strong>
<strong>   predictions = classifier.predict(valid_data)</strong>
<strong>   print("Accuracy :", metrics.accuracy_score(predictions, valid_y))</strong>
<strong>   report = classification_report(valid_y, predictions, output_dict=True, target_names=target_names)</strong>
<strong>   return report</strong></pre>
<h3>Naive Bayes</h3>
<pre class="graf graf--pre"><strong class="markup--strong markup--pre-strong"># Naive Bayes</strong>
<strong>classifier = naive_bayes.MultinomialNB()
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)</strong>
<strong class="markup--strong markup--pre-strong">print("NB Count Vectorizer Report :", report['weighted avg'])</strong>

<strong>#</strong> <strong class="markup--strong markup--pre-strong">Results</strong>
<strong class="markup--strong markup--pre-strong">Accuracy</strong> : 0.9436
<strong class="markup--strong markup--pre-strong">NB Count Vectorizer Report</strong> : {'precision': 0.9448178637882411, 'recall': 0.9436, 'f1-score': 0.9434664656369504, 'support': 2500}

<strong>report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)
print("NB TFIDF-Word Report :", report['weighted avg'])</strong>

<strong>#</strong> <strong class="markup--strong markup--pre-strong">Results</strong>
<strong class="markup--strong markup--pre-strong">Accuracy</strong> : 0.9416
<strong class="markup--strong markup--pre-strong">NB Count Vectorizer Report</strong> : {'precision': 0.9430346010709252, 'recall': 0.9416, 'f1-score': 0.9416037073783431, 'support': 2500}</pre>
<pre class="graf graf--pre"><strong>report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)
print("NB TFIDF-NGram Report :", report['weighted avg'])</strong>

<strong>#</strong> <strong class="markup--strong markup--pre-strong">Results</strong>
<strong class="markup--strong markup--pre-strong">Accuracy</strong> : 0.9208
<strong class="markup--strong markup--pre-strong">NB Count Vectorizer Report</strong> : {'precision': 0.9233051466162964, 'recall': 0.9208, 'f1-score': 0.9206511260527037, 'support': 2500}</pre>
<h3>Logistic Regression</h3>
<pre class="graf graf--pre"><strong class="markup--strong markup--pre-strong"># Logistic Regression </strong></pre>
<pre class="graf graf--pre"><strong>classifier = linear_model.LogisticRegression()    
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)</strong>    
<strong class="markup--strong markup--pre-strong">print("LogisticRegression Count Vectorizer Report :", report['weighted avg'])

</strong><strong># Results</strong>
<strong>Accuracy</strong> : 0.9804
<strong>LogisticRegression Count Vectorizer Report</strong> : {'precision': 0.9806682334322502, 'recall': 0.9804, 'f1-score': 0.9804527264151257, 'support': 2500}

<strong>report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)</strong> 
<strong class="markup--strong markup--pre-strong">print("LogisticRegression TFIDF-Word Report :", report['weighted avg'])

# Results
</strong><strong>Accuracy</strong> : 0.9792
<strong>LogisticRegression TFIDF-Word Report</strong> : {'precision': 0.9794911617869886, 'recall': 0.9792, 'f1-score': 0.9792657461379974, 'support': 2500}</pre>
<pre class="graf graf--pre"><strong>report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)</strong>    
<strong class="markup--strong markup--pre-strong">print("LogisticRegression TFIDF-NGram Report :", report['weighted avg'])

# Results
</strong><strong>Accuracy</strong> : 0.932
<strong>LogisticRegression TFIDF-NGram Report</strong> : {'precision': 0.9329064009056843, 'recall': 0.932, 'f1-score': 0.9320786137751711, 'support': 2500}</pre>
<h3>SVM</h3>
<pre class="graf graf--pre"><strong class="markup--strong markup--pre-strong"># Support Vector Machines</strong>
    
<strong>classifier = svm.SVC(gamma="scale")    
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)</strong>    
<strong class="markup--strong markup--pre-strong">print("SVM Count Vectorizer Report :", report['weighted avg'])</strong> 
<strong class="markup--strong markup--pre-strong">
# Results</strong>
<strong class="markup--strong markup--pre-strong">Accuracy</strong> : 0.9668
<strong class="markup--strong markup--pre-strong">SVM Count Vectorizer Report</strong> : {'precision': 0.9687847838287942, 'recall': 0.9668, 'f1-score': 0.9672306318670637, 'support': 2500}</pre>
<pre class="graf graf--pre"><strong>report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)  </strong>  
<strong class="markup--strong markup--pre-strong">print("SVM TFIDF-Word Report :", report['weighted avg'])</strong> 
<strong class="markup--strong markup--pre-strong">
# Results
Accuracy</strong> : 0.9804
<strong class="markup--strong markup--pre-strong">SVM TFIDF-Word Report</strong> : {'precision': 0.980766234757573, 'recall': 0.9804, 'f1-score': 0.9804795388691244, 'support': 2500}</pre>
<pre class="graf graf--pre"><strong>report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)</strong> 
<strong class="markup--strong markup--pre-strong">print("SVM TFIDF-NGram Report :", report['weighted avg'])

</strong><strong class="markup--strong markup--pre-strong"># Results</strong>
<strong class="markup--strong markup--pre-strong">Accuracy</strong> : 0.9304
<strong class="markup--strong markup--pre-strong">SVM TFIDF-NGram Report</strong> : {'precision': 0.9324797933370057, 'recall': 0.9304, 'f1-score': 0.9306949638900389, 'support': 2500}</pre>
<h2>conclusion</h2>
<p>SVM with TF-IDF Word Vectorizer and Logistic Regression with Count Vectorizer gives better accuracy compared with other ML algorithms tested.</p>
<p>We recommend you train the model without the data cleaning part to check which ML algorithm works better following the similar approach shown above.</p>
<p><strong>Disclaimer:</strong> We can not say which model is best here &#8211; It all depends on your data. Then, how do we conclude which ML algorithm suits our data?</p>
<p>Selecting the best model for your <strong>ML</strong> problem is definitely a difficult task. There is an awesome python library called <strong>Lazy Predict</strong> which helps to understand which models work better for your data without any parameter tuning. Check out the documentation <strong><a href="https://lazypredict.readthedocs.io/en/latest/">here</a></strong>. In the coming posts, we will discuss the <strong>Lazy Predict</strong> python module with some examples.</p>
<p>The post <a href="https://turbolab.in/text-classification-using-machine-learning/">Text Classification using Machine Learning</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/text-classification-using-machine-learning/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">601</post-id>	</item>
	</channel>
</rss>
