Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
The data format is not always tabular. As we are entering the era of big data, the data comes in an extensively diverse format, including images, texts, graphs, and many more. Because the format is pretty diverse, ranging from one data to another, it’s essential to preprocess the data into a readable format for computers.
In this blog, we will go over some Regex (Regular Expression) techniques that you can use in your data cleaning process.
Regular Expression is a sequence of characters used to match strings of text such as particular characters, words, or patterns of characters.
In Python, a Regular Expression (REs, regexes, or regex pattern) is imported through a ‘re’ module which is in-built in Python so you don’t need to install it separately.
The re module offers a set of functions that allows us to search a string for a match.
The most commonly used methods provided by ‘re’ package are:
- re.match()
- re.search()
- re.findall()
- re.split()
- re.sub()
- re.compile()
-
Replacing Multi-Spaces
Removing extra white spaces from data is an important step as it makes your data look well structured.
import re
tweet = “if you hold an empty gatorade bottle up to your ear you can hear the sports”
x = re.sub(‘\s+’, ” “, tweet)
Input: x
Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports
-
Dealing with Special Characters
In case you are working on an NLP project, you will need to get your text very clean and get rid of special characters that will not alter the meaning of the text for instance
1. Removing special characters and keeping only alphabets and numbers
import re
tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”
x = re.sub(“[^a-zA-Z0-9 ]+”, “ ”, tweet)
Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports 100’
2. Keeping either of alphabets or numbers
import re
tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”
x = re.sub(“[^a-zA-Z ]+”,” ”, tweet)
Output: ‘if you hold an empty gatorade bottle up to your ear you can hear the sports’
tweet = “if you hold an empty # gatorade # bottle up to your ear @@ you can hear the sports 100 %%”
x = re.sub(” +”, “”,re.sub(“[^0-9 ]+”,”, tweet))
Output: ‘100’
-
Detect and Remove URLs
Here we are using “re.compile” to generate a regex pattern and use that saved pattern later for substitution, if needed.
import re
tweet = ‘follow this website for more details www.knowmore.com and login to http://login.com’
pattern = re.compile(r”https?://\S+|www\.\S+”)
x = re.findall(pattern, tweet)
Input: x
Output: [‘www.knowmore.com’, ‘http://login.com‘]
# remove urls
z = re.sub(pattern, “”, tweet)
Input: z
Output: follow this website for more details and login to
-
Detect and Remove HTML Tags
Import re
tweet = ‘<p>follow this <b>website</b> for more details. </p>’
x = re.findall(‘<.*?>’, tweet)
Input : x
Output: [‘<p>’, ‘<b>’, ‘</b>’, ‘</p>’]
# remove html tags
z = re.sub(‘<.*?>’, “”, tweet)
Input: z
Output: ‘follow this website for more details.’
-
Detect and Remove Email IDs
Here we’ll use “re.search” to find e-mail ID. re.search() only returns the first occurrence that matches the specified pattern. In contrast, re.findall() will iterate over all the lines and will return all non-overlapping matches of pattern in a single step.
import re
tweet = “please send your feedback to myemail@gmail.com “
x = re.search(“[\w\.-]+@[\w\.-]+\.\w+”, tweet)
Input: x
Output: <re.Match object; span=(29, 40), match=’myemail@gmail.com‘>
tweet = “please send your feedback to myemail@gmail.com “
z = re.sub(“[\w\.-]+@[\w\.-]+\.\w+”, ””, tweet)
Output: please send your feedback to
-
Detect and Remove the Hashtag
import re
tweet = “love to explore. #nature #traveller”
x = re.findall(‘#[_]*[a-z]+’,tweet)
Input: x
Output: [‘#nature’, ‘#traveller’]
# remove html tags
z = re.sub(‘#[_]*[a-z]+’, ‘ ’, tweet)
Input: z
Output: “love to explore.”
-
Detect Mentions using re.match() and re.findall()
Here we’ll use re.match and re.findall to detect mentions.
re.match matches the pattern from the start of the string whereas re.findall searches for occurrences of the pattern anywhere in the string.
import re
tweet = “@Bryan appointed as the new team captain”
x = re.match(“(@\w+)”, tweet)
Output: <re.Match object; span=(0, 6), match=’@Bryan’>
tweet = “@Bryan appointed as the new team captain announced in @SportsLive”
x = re.findall(“@\S+”, tweet)
Input: x
Output: [ ‘@Bryan’, ‘@SportsLive’]
Conclusion
Regular Expression is very useful for text manipulation in the text cleaning phase of Natural Language Processing (NLP). In this post, we have used “re.findall”, “re.sub”, “re.search”, “re.match”, and “re.compile” functions, but there are many other functions in the regex library that can help data processing and manipulation. If you don’t have sufficient understanding regarding Regular Expression, we recommend you to go through python’s official page on regex.