Technology

Stemming Vs. Lemmatization with Python NLTK

Stemming and Lemmatization are text/word normalization techniques widely used in text pre-processing. They basically reduce the words to their root form. Here is an example:

Let’s say you have to train the data for classification and you are choosing any vectorizer to transform your data. These vectorizers create a vocabulary(set of unique words) from our data corpus. By applying stemming/lemmatization techniques, we can reduce the vocabulary size by converting the words to their base forms. This will make the vocabulary more distinct and will reduce the ambiguity for the model to train and yield better results.

In this post, we will discuss the practical examples of how stemming and lemmatization can be done on words and sentences using the python nltk package.

Stemming

Stemming is a rule-based normalization approach as it slices the word’s prefix and suffix to reduce them to its root form. Stemming is faster compared to lemmatization as it cuts the prefixes(pre-, extra-, in-, im-, ir-, etc.) and suffixes(ed-, ing-, es-, -ity, -ty, -ship, -ness, etc.) without considering the context of the words. Due to its aggressiveness, there is a possibility that the outcome from the stemming algorithm may not be a valid word.

In the above example, you can see that the outcomes of badly and pharmacies are invalid words.

Porter Stemmer

The Porter stemming algorithm (or “Porter stemmer”) uses suffix-stemming to produce stems. Here is a python code using nltk to create a stemmer object and generate results.

Code Snippet to perform Porter Stemming:

In:

from nltk.stem import PorterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ps = PorterStemmer()
print([ps.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘badli’]

Observing the drawbacks of PorterStemmer, the Snowball Stemming algorithm was introduced.

Snowball Stemmer

This Snowball Stemming Algorithm is also known as Porter2 Stemmer. It is the best version of Porter Stemmer in which a few of the above-discussed stemming issues are resolved.

Code Snippet to perform Snowball Stemming:

In:

from nltk.stem.snowball import SnowballStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ss = SnowballStemmer(language=’english’)
print([ss.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘player’, ‘pharmaci’, ‘bad’]

Here, we can see that the word “badly” is a valid stem, but the word “pharmacies” is still an invalid stem.

Lancaster Stemmer

Compared to snowball and porter stemming, lancaster is the most aggressive stemming algorithm because it tends to over-stem a lot of words. It tries to reduce the word to the shortest stem possible. Here is an example:

Here is an example:

“salty” —- “sal”

“sales” —- “sal”

Code Snippet to perform Lancaster Stemming:

In:

from nltk.stem import LancasterStemmer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
ls = LancasterStemmer()
print([ls.stem(w) for w in words])

Out:

[‘play’, ‘play’, ‘play’, ‘play’, ‘pharm’, ‘bad’]

As mentioned in the beginning, we can reduce the vocabulary and maintain more unique words by stemming.

Code snippet to perform tokenization and stemming on a paragraph:

content = “China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.”

The above content will hereafter be used as the input to the code snippets.

In:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# Porter Stemmed version

porteredContent = [ps.stem(word) for word in word_tokenize(content)]

Try testing the above code snippet by replacing the Porter stemmer with Snowball and Lancaster stemmers.

Let us throw some statistics to compare these three stemming algorithms.

Length of the content is 1041(without spaces)
Length of the content after Porter Stemmer is 943 which took around 0.00499 seconds to process
Length of the content after Snowball Stemmer is 944 which took around 0.00399 seconds to process
Length of the content after Lancaster Stemmer is 835 which took around 0.00399 seconds to process

Obviously, Lancaster Stemmer will have less content length because of its aggressive over-stemming nature. With all the three stemmers discussed above, we weren’t able to get the root word of “pharmacies”. We will now move on to lemmatization since stemming didn’t get us the valid stem word in all cases. While stemming is fast, it is not 100% accurate.

Lemmatization

In Lemmatization, the parts of speech(POS) will be determined first, unlike stemming which stems the word to its root form without considering the context. Lemmatization always considers the context and converts the word to its meaningful root/dictionary(WordNet) form called Lemma.

WordNet Lemmatizer

WordNet is a lexical database (a collection of words) that has been used by major search engines and IR research projects for many years. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word) for word in words])

Out:

[‘play’, ‘playing’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Here, we can see that only “plays” and most anticipated “pharmacies” have been converted to their root forms while the remaining words are not. Without the POS tag, WordNet Lemmatizer assumes every word as a noun. We need to pass a respective POS tag along with the word to the WordNet Lemmatizer.

WordNet Lemmatizer with POS tag:

In:

word = “better”
print(lemmatizer.lemmatize(word, pos=”n”)) # n for noun and it is default
print(lemmatizer.lemmatize(word, pos=”a”)) # a for adjective
print(lemmatizer.lemmatize(word, pos=”v”)) # v for verb
print(lemmatizer.lemmatize(word, pos=”r”)) # r for adverb

Out:

better | good | better | well

For the word “better”, the output is not the same when the POS is an adjective and an adverb.

Now, determining the POS for the word will be an extra task for the lemmatization process. When we are converting a large number of text chunks, it will be difficult to pass a POS tag for each word – we need to automate the fetching of POS tags for each word we lemmatize. Here is a function for that:

In:

import nltk
from nltk.corpus import wordnet

def getWordNetPOS(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tagDict = {“J”: wordnet.ADJ,
“N”: wordnet.NOUN,
“V”: wordnet.VERB,
“R”: wordnet.ADV}
return tagDict.get(tag, wordnet.NOUN)

Out:

getWordNetPOS(“better”) — “r”

get_wordnet_pos(“play”) — “n”

get_wordnet_pos(“bad”) — “a”

Code Snippet to perform WordNet Lemmatization with POS:

In:

from nltk.stem import WordNetLemmatizer
words = [“plays”, “playing”, “played”, “player”, “pharmacies”, “badly”]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words])

Out:

[‘play’, ‘play’, ‘played’, ‘player’, ‘pharmacy’, ‘badly’]

Spacy Lemmatizer, TextBlob Lemmatizer, Stanford CoreNLP Lemmatizer, Gensim Lemmatizer are the other lemmatizers that can be tried. With a spacy lemmatizer, lemmatization can be done without passing any POS tag.

Code snippet to perform lemmatization on a paragraph:

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

wordnetContent = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(content)] # content defined earlier

Time taken to process this content on WordNet Lemmatizer is 0.2234 seconds which is a lot higher when compared to stemming.

Conclusion

Stemming and Lemmatization both generate the root/base form of the word. The only difference is that the stem may not be an actual word whereas the lemma is a meaningful word.

Compared to stemming, lemmatization is slow but helps to train the accurate ML model. If your data is huge, then snowball stemmer(porter2) is a better alternative. If your ML model uses a count vectorizer and it doesn’t bother with the context of the words/sentences, then stemming is the best process that can be considered.

For deep learning models and word embeddings in use, lemmatization is the perfect choice because you will not find word embeddings for invalid stem words.

We recommend you try other methods of lemmatization provided by Spacy, Textblob, Gensim, and Stanford core NLP.

We’re hiring!

Sounds like your cup of tea? Join us!

Stemming Vs. Lemmatization with Python NLTK