<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>nltk Archives - Turbolab Technologies</title>
	<atom:link href="https://turbolab.in/tag/nltk/feed/" rel="self" type="application/rss+xml" />
	<link>https://turbolab.in/tag/nltk/</link>
	<description>Big Data and News Analysis Startup in Kochi</description>
	<lastBuildDate>Fri, 05 Aug 2022 14:34:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/turbolab.in/wp-content/uploads/2018/03/turbo_black_trans-space.png?fit=32%2C32&#038;ssl=1</url>
	<title>nltk Archives - Turbolab Technologies</title>
	<link>https://turbolab.in/tag/nltk/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">98237731</site>	<item>
		<title>Entity Linking &#038; Disambiguation using REL</title>
		<link>https://turbolab.in/entity-linking-disambiguation-using-rel/</link>
					<comments>https://turbolab.in/entity-linking-disambiguation-using-rel/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Tue, 12 Jul 2022 07:02:27 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[entity linking]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[rel]]></category>
		<category><![CDATA[spacy]]></category>
		<category><![CDATA[wikifier]]></category>
		<category><![CDATA[wikipedia]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=907</guid>

					<description><![CDATA[<p>Entity extraction, also known as Named Entity Recognition(NER), is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social [&#8230;]</p>
<p>The post <a href="https://turbolab.in/entity-linking-disambiguation-using-rel/">Entity Linking &amp; Disambiguation using REL</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Entity extraction, also known as </span><em><b>Named Entity Recognition(NER)</b></em><span style="font-weight: 400">, is an information extraction process that extracts entities from unstructured text and then classifies them into predefined categories such as people, organizations, places, products, date, time, money, phone numbers and so on. The several terabytes of unstructured text data, that comes from documents, web pages, and social media, will be transformed into structured entities that help analysts query the data and generate insightful reports.</span></p>
<p><span style="font-weight: 400">spaCy provides different models in various languages to perform NER and NLP-related tasks. Building a custom NER model using spaCy has been explained in one of our blogs. You can check out the link</span> <strong><a href="https://turbolab.in/build-a-custom-ner-model-using-spacy-3-0/">here</a></strong>.</p>
<p><span style="font-weight: 400">Now, let’s look into the entity extraction from a random news article using spaCy and Flair:</span></p>
<blockquote><p><em>Defending champion Novak Djokovic battled back from two sets to love down to defeat Jannik Sinner and reach his 11th Wimbledon semi-final on Tuesday. Djokovic triumphed 5-7, 2-6, 6-3, 6-2, 6-2 and will face Britain&#8217;s Cameron Norrie of Belgium for a place in Sunday&#8217;s final. It was the seventh time in the Serb&#8217;s career that he had recovered from two sets to love at the Slams. &#8220;Huge congrats to Jannik for a big fight, he&#8217;s so mature for his age, he has plenty of time ahead of him,&#8221; said Djokovic.</em></p></blockquote>
<h5>Entity Extraction using spaCy:</h5>
<blockquote><p><em><strong>import spacy</strong></em></p>
<p><em><strong>nlp = spacy.load(&#8216;en_core_web_lg&#8217;) # spacy load the model</strong></em></p>
<p><em><strong>ner_ent = {&#8216;person&#8217;: [], &#8216;norp&#8217;: [], &#8216;fac&#8217;: [], &#8216;org&#8217;: [], &#8216;gpe&#8217;: [], &#8216;loc&#8217;: [], &#8216;product&#8217;: [], &#8216;event&#8217;: [], &#8216;work_of_art&#8217;: [], &#8216;law&#8217;: [], &#8216;language&#8217;: [], &#8216;date&#8217;: [], &#8216;time&#8217;: [], &#8216;percent&#8217;: [], &#8216;money&#8217;: [], &#8216;quantity&#8217;: [], &#8216;ordinal&#8217;: [], &#8216;cardinal&#8217;: []}</strong></em></p>
<p><em><strong>doc = nlp(content)</strong></em><br />
<em><strong>for entity in doc.ents:</strong></em><br />
<em><strong>    if entity.label_.lower() in ner_ent:</strong></em><br />
<em><strong>        ner_ent[entity.label_.lower()].append(entity.text)</strong></em></p>
<p><em><strong>print(ner_ent)</strong></em></p>
<p><em><strong># output</strong></em></p>
<p><em><strong>{&#8216;person&#8217;: [&#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;, &#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;], &#8216;norp&#8217;: [&#8216;Serb&#8217;, &#8216;Serb&#8217;], &#8216;fac&#8217;: [], &#8216;org&#8217;: [], &#8216;gpe&#8217;: [&#8216;Britain&#8217;, &#8216;Belgium&#8217;, &#8216;Britain&#8217;, &#8216;Belgium&#8217;], &#8216;loc&#8217;: [], &#8216;product&#8217;: [], &#8216;event&#8217;: [&#8216;Wimbledon&#8217;, &#8216;Wimbledon&#8217;], &#8216;work_of_art&#8217;: [], &#8216;law&#8217;: [], &#8216;language&#8217;: [], &#8216;date&#8217;: [&#8216;Tuesday&#8217;, &#8216;Sunday&#8217;, &#8216;Tuesday&#8217;, &#8216;Sunday&#8217;], &#8216;time&#8217;: [], &#8216;percent&#8217;: [], &#8216;money&#8217;: [], &#8216;quantity&#8217;: [], &#8216;ordinal&#8217;: [&#8217;11th&#8217;, &#8216;seventh&#8217;, &#8217;11th&#8217;, &#8216;seventh&#8217;], &#8216;cardinal&#8217;: [&#8216;two&#8217;, &#8216;5&#8217;, &#8216;2-6&#8217;, &#8216;6-3&#8217;, &#8216;6&#8217;, &#8216;6-2&#8217;, &#8216;two&#8217;, &#8216;two&#8217;, &#8216;5&#8217;, &#8216;2-6&#8217;, &#8216;6-3&#8217;, &#8216;6&#8217;, &#8216;6-2&#8217;, &#8216;two&#8217;]}</strong></em></p></blockquote>
<h5>Entity Extraction using Flair:</h5>
<blockquote><p><em><strong>from flair.data import Sentence</strong></em><br />
<em><strong>from flair.models import SequenceTagger</strong></em></p>
<p><em><strong>ner_ent = {&#8216;per&#8217;: [], &#8216;org&#8217;: [], &#8216;loc&#8217;: [], &#8216;misc&#8217;: []}</strong></em></p>
<p><em><strong># make a sentence</strong></em><br />
<em><strong>sentence = Sentence(content)</strong></em></p>
<p><em><strong># load the NER tagger</strong></em><br />
<em><strong>tagger = SequenceTagger.load(&#8216;ner&#8217;)</strong></em></p>
<p><em><strong># run NER over sentence</strong></em><br />
<em><strong>tagger.predict(sentence)</strong></em></p>
<p><em><strong>print(&#8216;The following NER tags are found:&#8217;)</strong></em><br />
<em><strong># iterate over each entity</strong></em><br />
<em><strong>for entity in sentence.get_spans(&#8216;ner&#8217;):</strong></em><br />
<em><strong>    if str(entity.labels[0]).split()[0].lower() in ner_ent:</strong></em><br />
<em><strong>        ner_ent[str(entity.labels[0]).split()[0].lower()].append(entity.text)</strong></em></p>
<p><em><strong># output</strong></em></p>
<p><em><strong>The following NER tags are found:</strong></em></p>
<p><em><strong>{&#8216;per&#8217;: [&#8216;George Washington&#8217;, &#8216;Novak Djokovic&#8217;, &#8216;Jannik Sinner&#8217;, &#8216;Djokovic&#8217;, &#8216;Cameron Norrie&#8217;, &#8216;Jannik&#8217;, &#8216;Djokovic&#8217;], &#8216;org&#8217;: [], &#8216;loc&#8217;: [&#8216;Washington&#8217;, &#8216;Britain&#8217;, &#8216;Belgium&#8217;], &#8216;misc&#8217;: [&#8216;Wimbledon&#8217;, &#8216;Serb&#8217;, &#8216;Slams&#8217;]}</strong></em></p></blockquote>
<p>Flair NER models give us only 4 entity types whereas spaCy gives 18 entity types.</p>
<h2>Entity Linking &amp; Disambiguation</h2>
<p>Entity Linking is the process of linking entities with the target knowledge base. Here, we map the entities to the wiki links or the wiki page titles. Hence the process is called Wikification. We can say entity linking is also referred to as entity validation. The entities extracted from the models of Spacy or Flair will get validated from the third-party knowledge base.</p>
<p>However, this job is entity linking is intricate due to entity ambiguity and name variants. For example, the word <strong>Amazon</strong> refers to an organization and a rainforest.</p>
<p>Let&#8217;s have a detailed discussion on Entity Linking &amp; Entity Disambiguation</p>
<h5>News Article Clip:</h5>
<blockquote><p>Deforestation in Brazil&#8217;s Amazon rainforest reached a record high for the first six months of the year, as an area five times the size of New York City was destroyed, preliminary government data showed on Friday.</p></blockquote>
<h5>Spacy Output:</h5>
<blockquote><p>&#8216;org&#8217;: [&#8216;Amazon&#8217;], &#8216;gpe&#8217;: [&#8216;Brazil&#8217;, &#8216;New York City&#8217;]</p></blockquote>
<p>Here, <strong>Amazon</strong> is detected as the organization.</p>
<h5>Flair Output:</h5>
<blockquote><p>&#8216;loc&#8217;: [&#8216;Brazil&#8217;, &#8216;Amazon&#8217;, &#8216;New York City&#8217;]</p></blockquote>
<p><span style="font-weight: 400">Here, </span><b>Amazon</b><span style="font-weight: 400"> is detected as the location/GPE. The ambiguity problem is clearly visible here and can be solved by Radboud Entity Linker (REL).</span></p>
<h5><strong>REL</strong> <strong>Output</strong>:</h5>
<p><img data-recalc-dims="1" fetchpriority="high" decoding="async" data-attachment-id="908" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=1430%2C266&amp;ssl=1" data-orig-size="1430,266" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel" data-image-description="" data-image-caption="&lt;p&gt;REL&lt;/p&gt;
" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?fit=800%2C148&amp;ssl=1" class="size-full wp-image-908" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=800%2C149&#038;ssl=1" alt="" width="800" height="149" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?w=1430&amp;ssl=1 1430w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=300%2C56&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=768%2C143&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1024%2C190&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1080%2C201&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=1280%2C238&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=980%2C182&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel.png?resize=480%2C89&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><a href="https://github.com/informagi/REL"><strong>Radboud Entity Linker (REL)</strong></a> deals <span style="font-weight: 400">with the tasks of Entity Linking and Entity Disambiguation. One can use the public API provided by REL or install it using Docker/Source code with the instructions mentioned in the documentation. By default, </span><b>REL</b><span style="font-weight: 400"> uses Flair to extract entities; you can replace Flair with spaCy. REL also provides pre-trained models with case-sensitive and insensitive models with an f1 score of almost 93%.</span></p>
<p><a href="https://pypi.org/project/wikimapper/"><strong>Wikimapper</strong></a> python <span style="font-weight: 400">library is used to fetch the wikidata_id from the Wikipedia titles. You can have a look at the project which helps you to map Wikipedia page titles to WikiData IDs and vice-versa.</span></p>
<p><a href="https://github.com/facebookresearch/BLINK"><b>BLINK</b></a><span style="font-weight: 400">, the Facebook research entity linking python library,  uses Wikipedia as the target knowledge base, similar to </span><b>REL</b><span style="font-weight: 400">. But, the BLINK documentation hasn&#8217;t revealed any information regarding entity disambiguation.</span></p>
<p><a href="https://github.com/wetneb/opentapioca"><b>OpenTapioca</b></a><span style="font-weight: 400"> is a simple and fast Named Entity Linking system for Wikidata. A spaCy wrapper of OpenTapioca called</span><a href="https://spacy.io/universe/project/spacyopentapioca"> <b>spaCyOpenTapioca</b></a><span style="font-weight: 400"> is also available for the entity linking process. But the results are not as great when compared to REL.</span></p>
<p><span style="font-weight: 400">spaCy includes a pipeline component called</span><a href="https://spacy.io/api/entitylinker"> <b>entitylinker</b></a><span style="font-weight: 400"> for Named Entity Linking and Disambiguation.</span></p>
<h2>Dealing with Disambiguation</h2>
<blockquote><p><span id="w0" class="word annotHilite hasAnnotation underlined">Japan</span><span id="s1" class="space"> </span><span id="w1" class="word hasAnnotation">began</span><span id="s2" class="space"> </span><span id="w2" class="word hasAnnotation">the</span><span id="s3" class="space hasAnnotation"> </span><span id="w3" class="word hasAnnotation">defence</span><span id="s4" class="space hasAnnotation"> </span><span id="w4" class="word hasAnnotation">of</span><span id="s5" class="space"> </span><span id="w5" class="word hasAnnotation">their</span><span id="s6" class="space hasAnnotation"> </span><span id="w6" class="word hasAnnotation">title</span><span id="s7" class="space"> </span><span id="w7" class="word hasAnnotation">with</span><span id="s8" class="space"> </span><span id="w8" class="word hasAnnotation">a</span><span id="s9" class="space"> </span><span id="w9" class="word hasAnnotation">lucky</span><span id="s10" class="space"> </span><span id="w10" class="word hasAnnotation">2-1</span><span id="s11" class="space"> </span><span id="w11" class="word hasAnnotation">win</span><span id="s12" class="space"> </span><span id="w12" class="word hasAnnotation">against</span><span id="s13" class="space"> </span><span id="w13" class="word hasAnnotation underlined">Syria</span><span id="s14" class="space"> </span><span id="w14" class="word hasAnnotation">in</span><span id="s15" class="space"> </span><span id="w15" class="word hasAnnotation">a</span><span id="s16" class="space hasAnnotation"> </span><span id="w16" class="word hasAnnotation">championship</span><span id="s17" class="space hasAnnotation"> </span><span id="w17" class="word hasAnnotation">match</span><span id="s18" class="space"> </span><span id="w18" class="word hasAnnotation">on</span><span id="s19" class="space"> </span><span id="w19" class="word hasAnnotation">Friday</span><span id="s20" class="space"></span><span id="w20" class="word hasAnnotation">.</span></p></blockquote>
<p><span style="font-weight: 400">Using the above statement, we will discuss the different approaches to choosing the appropriate entity in the case of Entity Disambiguation.</span></p>
<h5>Let&#8217;s see how <a href="https://wikifier.org/"><strong>wikifier</strong></a> deals with the disambiguation:</h5>
<p><a href="https://wikifier.org/"><strong>Wikifier</strong></a> <span style="font-weight: 400">doesn&#8217;t use any entity extraction method for extracting entities; it goes with Parts of Speech (POS).</span></p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="911" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/wikifier1/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?fit=1891%2C381&amp;ssl=1" data-orig-size="1891,381" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="wikifier1" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?fit=800%2C161&amp;ssl=1" class="size-full wp-image-911 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=800%2C161&#038;ssl=1" alt="" width="800" height="161" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?w=1891&amp;ssl=1 1891w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=300%2C60&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=768%2C155&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1024%2C206&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1080%2C218&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=1280%2C258&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=980%2C197&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?resize=480%2C97&amp;ssl=1 480w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier1.png?w=1600&amp;ssl=1 1600w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><span style="font-weight: 400">The entities Syria and Japan are linked to their respective countries’ Wikipedia pages,</span><a href="https://en.wikipedia.org/wiki/Syria"> <b>Syria</b></a><span style="font-weight: 400"> and</span><a href="https://en.wikipedia.org/wiki/Japan"> <b>Japan</b></a><span style="font-weight: 400">. In the context of the above statement, Japan and Syria actually refer to their football teams. Wikifier fetches all the Wikipedia page entities related to the entity and maps the entity with the most link targets.</span></p>
<p><img data-recalc-dims="1" decoding="async" data-attachment-id="912" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/wikifier2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?fit=483%2C671&amp;ssl=1" data-orig-size="483,671" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="wikifier2" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?fit=483%2C671&amp;ssl=1" class="size-full wp-image-912 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=483%2C671&#038;ssl=1" alt="" width="483" height="671" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?w=483&amp;ssl=1 483w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=216%2C300&amp;ssl=1 216w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/wikifier2.png?resize=480%2C667&amp;ssl=1 480w" sizes="(max-width: 483px) 100vw, 483px" /></p>
<p>Wikifier considers the minLinkFrequency parameter to evaluate the score.</p>
<h5>Let&#8217;s see how REL deals with the disambiguation:</h5>
<p>In REL, entity linking decisions depend on the contextual similarity and coherence with the other entity linking decisions in the document. One entity mapping is dependent on the other entities found in the document. You can read the paper <a href="https://arxiv.org/pdf/2006.01969.pdf"><strong>here</strong></a>.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="913" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?fit=1435%2C215&amp;ssl=1" data-orig-size="1435,215" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel2" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?fit=800%2C120&amp;ssl=1" class="size-full wp-image-913 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=800%2C120&#038;ssl=1" alt="" width="800" height="120" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?w=1435&amp;ssl=1 1435w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=300%2C45&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=768%2C115&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1024%2C153&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1080%2C162&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=1280%2C192&amp;ssl=1 1280w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=980%2C147&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel2.png?resize=480%2C72&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><span style="font-weight: 400">This example doesn&#8217;t have any impact since only two entities are found and the content is a one-liner. Instead of the entity detection method, if we had passed the POS output, the result might have been different.</span></p>
<p>With passing the entire <a href="https://www.firstpost.com/sports/fifa-world-cup-qualifiers-2022-syria-japan-secure-victories-to-make-it-to-next-round-9694971.html"><strong>article</strong></a> to the REL, the results are quite better. The REL model can now understand the context and relate more entities from the entire article.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="914" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/rel3/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?fit=1135%2C300&amp;ssl=1" data-orig-size="1135,300" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rel3" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?fit=800%2C212&amp;ssl=1" class="size-full wp-image-914 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=800%2C211&#038;ssl=1" alt="" width="800" height="211" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?w=1135&amp;ssl=1 1135w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=300%2C79&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=768%2C203&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=1024%2C271&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=1080%2C285&amp;ssl=1 1080w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=980%2C259&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/rel3.png?resize=480%2C127&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /></p>
<p><strong>Brazil</strong> and <strong>Dutch</strong> mapped to their respective football team wiki pages. Mapping <strong>Japan</strong> to its respective football team is still a mystery though. LOL.</p>
<h2>Conclusion</h2>
<p><span style="font-weight: 400">Instead of going with the score of the most link targets, REL considers the context and the relationship between the entities detected from the document. By improving the mentioned detection, REL can be used as a perfect Entity Disambiguation tool.</span></p>
<p>Last but not least, there is a tool called <a href="https://github.com/SapienzaNLP/extend"><strong>ExtEnD</strong></a>(Extractive Entity Disambiguation) which needs to explore. We can add this tool to the spaCy NLP pipeline.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="915" data-permalink="https://turbolab.in/entity-linking-disambiguation-using-rel/extend/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?fit=665%2C178&amp;ssl=1" data-orig-size="665,178" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="extend" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?fit=665%2C178&amp;ssl=1" class="size-full wp-image-915 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=665%2C178&#038;ssl=1" alt="" width="665" height="178" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?w=665&amp;ssl=1 665w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=300%2C80&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2022/07/extend.png?resize=480%2C128&amp;ssl=1 480w" sizes="(max-width: 665px) 100vw, 665px" /></p>
<p>The output documented by <strong>ExtEnD</strong> is much better compared to the REL-generated output. Before coming to conclusion, as mentioned above this tool needs to explore.</p>
<p>The post <a href="https://turbolab.in/entity-linking-disambiguation-using-rel/">Entity Linking &amp; Disambiguation using REL</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/entity-linking-disambiguation-using-rel/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">907</post-id>	</item>
		<item>
		<title>Stemming Vs. Lemmatization with Python NLTK</title>
		<link>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/</link>
					<comments>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Fri, 29 Oct 2021 17:00:31 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[lemmatization]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[stemmer]]></category>
		<category><![CDATA[stemming]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=694</guid>

					<description><![CDATA[<p>Stemming and Lemmatization are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example: Let&#8217;s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our [&#8230;]</p>
<p>The post <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">Stemming Vs. Lemmatization with Python NLTK</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><strong>Stemming</strong> and <strong>Lemmatization</strong> are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="707" data-permalink="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/lemmvsstem2/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?fit=408%2C244&amp;ssl=1" data-orig-size="408,244" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1635515383&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="lemmVsStem2" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?fit=408%2C244&amp;ssl=1" class="size-full wp-image-707 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?resize=408%2C244&#038;ssl=1" alt="" width="408" height="244" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?w=408&amp;ssl=1 408w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem2.jpg?resize=300%2C179&amp;ssl=1 300w" sizes="(max-width: 408px) 100vw, 408px" /></p>
<p>Let&#8217;s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our data corpus. <span style="font-weight: 400">By applying stemming/lemmatization techniques, we can reduce the vocabulary size by converting the words to their base forms. </span><span style="font-weight: 400">This will make the vocabulary more distinct and will reduce the ambiguity for the model to train and yield better results.</span></p>
<p>In this post, we will discuss the practical examples of how stemming and lemmatization can be done on words and sentences using the python <strong>nltk</strong> package.</p>
<h1>Stemming</h1>
<p>Stemming is a rule-based normalization approach as it slices the word&#8217;s prefix and suffix to reduce them to its root form. Stemming is faster compared to lemmatization as it cuts the prefixes(pre-, extra-, in-, im-, ir-, etc.)  and suffixes(ed-, ing-, es-, -ity, -ty, -ship, -ness, etc.) without considering the context of the words. <strong>Due to its aggressiveness, there is a possibility that the outcome from the stemming algorithm may not be a valid word</strong>.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" data-attachment-id="710" data-permalink="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/lemmvsstem3/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?fit=487%2C295&amp;ssl=1" data-orig-size="487,295" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;vasista reddy&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;1635522855&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="lemmVsStem3" data-image-description="" data-image-caption="" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?fit=487%2C295&amp;ssl=1" class="size-full wp-image-710 aligncenter" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=487%2C295&#038;ssl=1" alt="" width="487" height="295" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?w=487&amp;ssl=1 487w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=300%2C182&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/10/lemmVsStem3.jpg?resize=480%2C291&amp;ssl=1 480w" sizes="(max-width: 487px) 100vw, 487px" /></p>
<p>In the above example, you can see that the outcomes of <strong>badly</strong> and <strong>pharmacies</strong> are invalid words.</p>
<h3>Porter Stemmer</h3>
<p>The Porter stemming algorithm (or &#8220;Porter stemmer&#8221;) uses suffix-stemming to produce stems. Here is a python code using nltk to create a stemmer object and generate results.</p>
<p>Code Snippet to perform Porter Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import PorterStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ps = PorterStemmer()</strong></em><br />
<em><strong>print([ps.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;player&#8217;, &#8216;pharmaci&#8217;, &#8216;badli&#8217;]</strong></em></p></blockquote>
<p><strong>Observing the drawbacks of PorterStemmer, the Snowball Stemming algorithm was introduced.</strong></p>
<h3>Snowball Stemmer</h3>
<p>This Snowball Stemming Algorithm is also known as Porter2 Stemmer. It is the best version of Porter Stemmer in which a few of the above-discussed stemming issues are resolved.</p>
<p>Code Snippet to perform Snowball Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem.snowball import SnowballStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ss = SnowballStemmer(language=&#8217;english&#8217;)</strong></em><br />
<em><strong>print([ss.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;player&#8217;, &#8216;pharmaci&#8217;, &#8216;bad&#8217;]</strong></em></p></blockquote>
<p>Here, we can see that the word &#8220;<em><strong>badly</strong></em>&#8221; is a valid stem, but the word &#8220;<em><strong>pharmacies</strong></em>&#8221; is still an invalid stem.</p>
<h3>Lancaster Stemmer</h3>
<p><span style="font-weight: 400">Compared to snowball and porter stemming, lancaster is the most aggressive stemming algorithm because it tends to over-stem a lot of words. It tries to reduce the word to the shortest stem possible. Here is an example:</span></p>
<p>Here is an example:</p>
<blockquote><p><em><strong>&#8220;salty&#8221; &#8212;- &#8220;sal&#8221;</strong></em></p>
<p><em><strong>&#8220;sales&#8221; &#8212;- &#8220;sal&#8221;</strong></em></p></blockquote>
<p>Code Snippet to perform Lancaster Stemming:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import LancasterStemmer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>ls = LancasterStemmer()</strong></em><br />
<em><strong>print([ls.stem(w) for w in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;play&#8217;, &#8216;pharm&#8217;, &#8216;bad&#8217;]</strong></em></p></blockquote>
<p>As mentioned in the beginning, we can reduce the vocabulary and maintain more unique words by stemming.</p>
<p>Code snippet to perform tokenization and stemming on a paragraph:</p>
<blockquote><p><em><strong>content = &#8220;China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.&#8221;</strong></em></p></blockquote>
<p>The above content will hereafter be used as the input to the code snippets.</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import PorterStemmer</strong></em><br />
<em><strong>from nltk.tokenize import word_tokenize</strong></em></p>
<p><em><strong>ps = PorterStemmer()</strong></em></p>
<p><em><strong># Porter Stemmed version</strong></em></p>
<p><em><strong>porteredContent = [ps.stem(word) for word in word_tokenize(content)]</strong></em></p></blockquote>
<p><span style="font-weight: 400">Try testing the above code snippet by replacing the Porter stemmer with Snowball and Lancaster stemmers.</span></p>
<p>Let us throw some statistics to compare these three stemming algorithms.</p>
<ul>
<li><strong>Length of the content is 1041(without spaces)</strong></li>
<li><strong>Length of the content after Porter Stemmer is 943 which took around 0.00499 seconds to process</strong></li>
<li><strong>Length of the content after Snowball Stemmer is 944 which took around 0.00399 seconds to process</strong></li>
<li><strong>Length of the content after Lancaster Stemmer is 835 which took around 0.00399 seconds to process</strong></li>
</ul>
<p>Obviously, Lancaster Stemmer will have less content length because of its aggressive over-stemming nature. With all the three stemmers discussed above, we weren&#8217;t able to get the root word of &#8220;<strong>pharmacies<span style="font-weight: 400">”</span></strong>. We will now move on to lemmatization since stemming didn&#8217;t get us the valid stem word in all cases. While stemming is fast, it is not 100% accurate.</p>
<h1>Lemmatization</h1>
<p>In Lemmatization, the parts of speech(POS) will be determined first, unlike stemming which stems the word to its root form without considering the context. Lemmatization always considers the context and converts the word to its meaningful root/dictionary(WordNet) form called Lemma.</p>
<h3>WordNet Lemmatizer</h3>
<p><b>WordNet</b> is a lexical database (a collection of words) that has been used by major search engines and IR research projects for many years. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import WordNetLemmatizer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>lemmatizer = WordNetLemmatizer()</strong></em><br />
<em><strong>print([lemmatizer.lemmatize(word) for word in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;playing&#8217;, &#8216;played&#8217;, &#8216;player&#8217;, &#8216;pharmacy&#8217;, &#8216;badly&#8217;]</strong></em></p></blockquote>
<p>Here, we can see that only &#8220;<strong>plays</strong>&#8221; and most anticipated &#8220;<strong>pharmacies</strong>&#8221; have been converted to their root forms while the remaining words are not. Without the POS tag, WordNet Lemmatizer assumes every word as a noun. We need to pass a respective POS tag along with the word to the WordNet Lemmatizer.</p>
<h3>WordNet Lemmatizer with POS tag:</h3>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>word = &#8220;better&#8221;</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;n&#8221;)) # n for noun and it is default</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;a&#8221;)) # a for adjective</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;v&#8221;)) # v for verb</strong></em><br />
<em><strong>print(lemmatizer.lemmatize(word, pos=&#8221;r&#8221;)) # r for adverb</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>better | </strong></em><em><strong>good | </strong></em><em><strong>better | </strong></em><em><strong>well</strong></em></p></blockquote>
<p>For the word <span style="font-weight: 400">“</span><strong>better<span style="font-weight: 400">”</span></strong>, the output is not the same when the POS is an adjective and an adverb.</p>
<p>Now, determining the POS for the word will be an extra task for the lemmatization process. When we are converting a large number of text chunks, it will be difficult to pass a POS tag for each word &#8211; we need to automate the fetching of POS tags for each word we lemmatize. Here is a function for that:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>import nltk</strong></em><br />
<em><strong>from nltk.corpus import wordnet</strong></em></p>
<p><em><strong>def getWordNetPOS(word):</strong></em><br />
<em><strong>     tag = nltk.pos_tag([word])[0][1][0].upper()</strong></em><br />
<em><strong>     tagDict = {&#8220;J&#8221;: wordnet.ADJ,</strong></em><br />
<em><strong>     &#8220;N&#8221;: wordnet.NOUN,</strong></em><br />
<em><strong>     &#8220;V&#8221;: wordnet.VERB,</strong></em><br />
<em><strong>     &#8220;R&#8221;: wordnet.ADV}</strong></em><br />
<em><strong>    return tagDict.get(tag, wordnet.NOUN)</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<ul>
<li><em><strong>getWordNetPOS(&#8220;better&#8221;) &#8212; &#8220;r&#8221;</strong></em></li>
<li><em><strong>get_wordnet_pos(&#8220;play&#8221;) &#8212; &#8220;n&#8221;</strong></em></li>
<li><em><strong>get_wordnet_pos(&#8220;bad&#8221;) &#8212; &#8220;a&#8221;</strong></em></li>
</ul>
</blockquote>
<p>Code Snippet to perform WordNet Lemmatization with POS:</p>
<blockquote><p><em><strong>In:</strong></em></p>
<p><em><strong>from nltk.stem import WordNetLemmatizer</strong></em><br />
<em><strong>words = [&#8220;plays&#8221;, &#8220;playing&#8221;, &#8220;played&#8221;, &#8220;player&#8221;, &#8220;pharmacies&#8221;, &#8220;badly&#8221;]</strong></em><br />
<em><strong>lemmatizer = WordNetLemmatizer()</strong></em><br />
<em><strong>print([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words])</strong></em></p>
<p><em><strong>Out:</strong></em></p>
<p><em><strong>[&#8216;play&#8217;, &#8216;play&#8217;, &#8216;played&#8217;, &#8216;player&#8217;, &#8216;pharmacy&#8217;, &#8216;badly&#8217;]</strong></em></p></blockquote>
<p>Spacy Lemmatizer, TextBlob Lemmatizer, Stanford CoreNLP Lemmatizer, Gensim Lemmatizer are the other lemmatizers that can be tried. With a spacy lemmatizer, lemmatization can be done without passing any POS tag.</p>
<p>Code snippet to perform lemmatization on a paragraph:</p>
<blockquote><p><em><strong>from nltk.tokenize import word_tokenize</strong></em><br />
<em><strong>from nltk.stem import WordNetLemmatizer</strong></em></p>
<p><em><strong>wordnetContent = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(content)] # content defined earlier</strong></em></p></blockquote>
<p>Time taken to process this content on WordNet Lemmatizer is <strong>0.2234</strong> seconds which is a lot higher when compared to stemming.</p>
<h1>Conclusion</h1>
<p>Stemming and Lemmatization both generate the root/base form of the word. The only difference is that the stem may not be an actual word whereas the lemma is a meaningful word.</p>
<p>Compared to stemming, lemmatization is slow but helps to train the accurate ML model. If your data is huge, then snowball stemmer(porter2) is a better alternative. If your ML model uses a count vectorizer and it doesn&#8217;t bother with the context of the words/sentences, then stemming is the best process that can be considered.</p>
<p>For deep learning models and word embeddings in use, lemmatization is the perfect choice because you will not find word embeddings for invalid stem words.</p>
<p>We recommend you try other methods of lemmatization provided by Spacy, Textblob, Gensim, and Stanford core NLP.</p>
<p>The post <a href="https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/">Stemming Vs. Lemmatization with Python NLTK</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/stemming-vs-lemmatization-with-python-nltk/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">694</post-id>	</item>
	</channel>
</rss>
