<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>regex Archives - Turbolab Technologies</title>
	<atom:link href="https://turbolab.in/tag/regex/feed/" rel="self" type="application/rss+xml" />
	<link>https://turbolab.in/tag/regex/</link>
	<description>Big Data and News Analysis Startup in Kochi</description>
	<lastBuildDate>Tue, 18 Jan 2022 10:11:50 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/turbolab.in/wp-content/uploads/2018/03/turbo_black_trans-space.png?fit=32%2C32&#038;ssl=1</url>
	<title>regex Archives - Turbolab Technologies</title>
	<link>https://turbolab.in/tag/regex/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">98237731</site>	<item>
		<title>Data Cleaning using Regular Expression</title>
		<link>https://turbolab.in/data-cleaning-using-regular-expression/</link>
					<comments>https://turbolab.in/data-cleaning-using-regular-expression/#respond</comments>
		
		<dc:creator><![CDATA[Anthony]]></dc:creator>
		<pubDate>Tue, 30 Nov 2021 12:06:01 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[data cleaning]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[text cleaning]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=779</guid>

					<description><![CDATA[<p>Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty [&#8230;]</p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.</span></p>
<p><span style="font-weight: 400">The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.</span></p>
<p><span style="font-weight: 400">In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.</span></p>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.</span></p>
<p><span style="font-weight: 400">In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a &#8216;re&#8217; module which is in-built in Python so you don’t need to install it separately.</span></p>
<p><span style="font-weight: 400">The re module offers a set of functions that allows us to search a string for a match.</span></p>
<p><span style="font-weight: 400">The most commonly used methods provided by ‘re’ package are:</span></p>
<p>&nbsp;</p>
<ul>
<li><strong>re.match()</strong></li>
</ul>
<ul>
<li><strong>re.search()</strong></li>
</ul>
<ul>
<li><strong>re.findall()</strong></li>
</ul>
<ul>
<li><strong>re.split()</strong></li>
</ul>
<ul>
<li><strong>re.sub()</strong></li>
</ul>
<ul>
<li><strong>re.compile()</strong></li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Replacing Multi-Spaces</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Removing extra white spaces from data is an important step as it makes your data look well structured.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if       you hold an empty gatorade bottle up to your ear   you can hear      the sports&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8216;\s+&#8217;, &#8221; &#8220;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports</span></i></p>
<p>&nbsp;</p></blockquote>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Dealing with Special Characters</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance</strong></p>
<p>&nbsp;</p>
<h4><b>1.   Removing special characters and keeping only alphabets and numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z0-9 ]+&#8221;, “ ”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports 100&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<h4><b>2. Keeping either of alphabets or numbers</b></h4>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8220;[^a-zA-Z ]+&#8221;,” &#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;if you hold an empty gatorade bottle up to your ear you can hear the sports&#8217;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.sub(&#8221; +&#8221;, &#8220;&#8221;,re.sub(&#8220;[^0-9 ]+&#8221;,&#8221;, tweet))</span></i></p>
<p><i><span style="font-weight: 400">Output: ‘100’</span></i></p></blockquote>
<p><b><b><br />
</b></b><i></i></p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove URLs</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;follow this website for more details www.knowmore.com and login to http://login.com&#8217;</span></i></p>
<p><i><span style="font-weight: 400">pattern = re.compile(r&#8221;https?://\S+|www\.\S+&#8221;)</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(pattern, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;www.knowmore.com&#8217;, &#8216;</span></i><a href="http://login.com"><i><span style="font-weight: 400">http://login.com</span></i></a><i><span style="font-weight: 400">&#8216;]</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400"># remove urls </span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(pattern, “”, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: follow this website for more details and login to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove HTML Tags<br />
</b></h3>
</li>
</ul>
<blockquote><p><i><span style="font-weight: 400">Import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8216;&lt;p&gt;follow this &lt;b&gt;website&lt;/b&gt; for more details. &lt;/p&gt;&#8217;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;&lt;.*?&gt;&#8217;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input : x</span></i></p>
<p><i><span style="font-weight: 400">Output: [&#8216;&lt;p&gt;&#8217;, &#8216;&lt;b&gt;&#8217;, &#8216;&lt;/b&gt;&#8217;, &#8216;&lt;/p&gt;&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8216;&lt;.*?&gt;&#8217;, &#8220;&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: &#8216;follow this website for more details.&#8217;</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove Email IDs </b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use “re.search” to find e-mail ID.  re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">x = re.search(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(29, 40), match=&#8217;</span></i><a href="mailto:my@gmal.com"><i><span style="font-weight: 400">myemail@gmail.com</span></i></a><i><span style="font-weight: 400">&#8216;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;please send your feedback to myemail@gmail.com &#8220;</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(&#8220;[\w\.-]+@[\w\.-]+\.\w+&#8221;, ””, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: please send your feedback to</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect and Remove the Hashtag</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;love to explore. #nature #traveller&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8216;#[_]*[a-z]+&#8217;,tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">[&#8216;#nature&#8217;, &#8216;#traveller&#8217;]</span></i></p>
<p><i><span style="font-weight: 400"># remove html tags</span></i></p>
<p><i><span style="font-weight: 400">z = re.sub(</span></i><i><span style="font-weight: 400">&#8216;#[_]*[a-z]+&#8217;, ‘ ’, tweet</span></i><i><span style="font-weight: 400">)</span></i></p>
<p><i><span style="font-weight: 400">Input: z</span></i></p>
<p><i><span style="font-weight: 400">Output: </span></i><i><span style="font-weight: 400">&#8220;love to explore.”</span></i></p></blockquote>
<p>&nbsp;</p>
<p>&nbsp;</p>
<ul>
<li>
<h3><b>Detect Mentions using re.match() and re.findall()</b></h3>
</li>
</ul>
<p>&nbsp;</p>
<p><strong>Here we&#8217;ll use re.match and re.findall to detect mentions. </strong></p>
<p><strong>re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.</strong></p>
<blockquote><p><i><span style="font-weight: 400">import re</span></i></p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.match(&#8220;(@\w+)&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Output: &lt;re.Match object; span=(0, 6), match=&#8217;@Bryan&#8217;&gt;</span></i></p>
<p>&nbsp;</p>
<p><i><span style="font-weight: 400">tweet = &#8220;@Bryan appointed as the new team captain announced in @SportsLive&#8221;</span></i></p>
<p><i><span style="font-weight: 400">x = re.findall(&#8220;@\S+&#8221;, tweet)</span></i></p>
<p><i><span style="font-weight: 400">Input: x</span></i></p>
<p><i><span style="font-weight: 400">Output: [ &#8216;@Bryan&#8217;, &#8216;@SportsLive&#8217;]</span></i></p></blockquote>
<p>&nbsp;</p>
<h3><b>Conclusion</b></h3>
<p>&nbsp;</p>
<p><span style="font-weight: 400">Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on <a href="https://docs.python.org/3/library/re.html">regex</a>.</span></p>
<p>The post <a href="https://turbolab.in/data-cleaning-using-regular-expression/">Data Cleaning using Regular Expression</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/data-cleaning-using-regular-expression/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">779</post-id>	</item>
	</channel>
</rss>
