Feature Extraction in Natural Language Processing

In simple terms, Feature Extraction is transforming textual data into numerical data. In Natural Language Processing, Feature Extraction is a very trivial method to be followed to better understand the context. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. So we go for numerical representation for individual words as it’s easy for the computer to process numbers.

In this blog, we will discuss various feature extraction methods with examples using sklearn and gensim.

 

  • Countvectorizer
  • TF-IDF Vectorizer
  • Word Embeddings

 

Countvectorizer

 

 

It is a simple and flexible way of extracting features from documents. A Countvectorizer model is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag of words” because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not wherein the document.

Here is a basic snippet of using count vectorization to get vectors

 

from sklearn.feature_extraction.text import CountVectorizer

 

corpus = [“We become what we think about”, “Happiness is not something readymade. It comes from your own actions”]

 

# initialize count vectorizer object

vect = CountVectorizer()

 

# get counts of each token (word) in text data

X = vect.fit_transform(corpus)

 

# convert sparse matrix to numpy array to view

X.toarray()

 

# view token vocabulary and counts

print(“vocabulary”, vect.vocabulary_)

print(“shape”, X.shape)

print(‘vectors: ‘, X.toarray())

 

Output

 

Vocabulary :  {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}

 

Shape :  (2, 10)

 

Vectors :  [[1 1 0 0 0 0 0 1 2 1]

                 [0 0 1 1 1 1 1 0 0 0]]

 

 

 

TF – IDF Vectorizer (Term Frequency – Inverse Document Frequency)

 

 

TF-IDF is short for term frequency-inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus.

The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

And similar to the Countvectorizer, sklearn.feature_extraction.text provides a method.

 

from sklearn.feature_extraction.text import TfidfVectorizer

 

corpus = [“We become what we think about”, “Happiness is not something readymade.”]

 

# initialize tf-idf vectorizer object

vectorizer = TfidfVectorizer()

 

# compute bag of word counts and tf-idf values

tf = vectorizer.fit_transform(corpus)

 

# convert sparse matrix to numpy array to view

print(“Vocabulary”, vectorizer.vocabulary_)

print(“idf”, vectorizer.idf_)

print(“Vectors”, tf.toarray())

 

Output:

 

Vocabulary : {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}

 

idf : [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511

 1.40546511 1.40546511 1.40546511 1.40546511]

 

Vectors : [[0.35355339 0.35355339 0.         0.         0.         0.

  1.         0.35355339 0.70710678 0.35355339]

      [0.         0.         0.4472136  0.4472136  0.4472136  0.4472136

       0.4472136  0.         0.         0.        ]]

 

 

Word Embeddings

 

Word embedding is a learned representation of text, where each word is represented as a real-valued vector in a lower-dimensional space.

In simple terms, word embeddings are the texts converted into numbers and there may be different numerical representations of the same text, but texts with similar context have similar representations.

Word embedding preserves contexts and relationships of words so that it detects similar words more accurately.

Word embedding has several different implementations such as word2vec, GloVe, FastText etc.

Here we will explain word2vec, as it is the most popular implementation.

 

Word2vec

 

Word2vec is widely used in most of the NLP models. It transforms every word into vectors. Word2vec can make the most accurate predictions about the meaning of words. It can capture the contextual meaning of words very well. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in space.

There are two neural embedding algorithms:

 

  • Continuous Bag-of-Words (CBOW) – predicts target word from context
  • Skip-gram – predicts context from the target word

 

 

Here is an example of Word2vec using Gensim. Gensim is a python library for NLP.

 

from gensim.models import Word2Vec

 

# Get document data.

common_texts = [[‘interface’, ‘computer’, ‘technology’],

 [‘survey’, ‘computer’, ‘system’, ‘response’],

 [ ‘brother’, ‘boy’, ‘man’, ‘animal’, ‘human’]]

 

# Initializing Model

model = Word2Vec(common_texts, window=5, min_count=1, workers=4)

 

Result 1 :

 

# Get most similar words of “computer”

model.wv.most_similar(“computer”)

 

Output : 

 

[(‘technology’, 0.21617145836353302),

 (‘system’, 0.09291724115610123),

 (‘interface’, 0.06285080313682556),

 (‘survey’, 0.027057476341724396),

 (‘response’, 0.016134709119796753),

 (‘human’, -0.010839173570275307),

 (‘boy’, -0.02775038219988346),

 (‘animal’, -0.052346907556056976),

 (‘brother’, -0.05987627059221268),

 (‘man’, -0.111670583486557)]

 

Result 2 :

 

# Get most similar words of “computer”

model.wv.most_similar(“human”)

 

Output :

 

[(‘man’, 0.0679759532213211),

 (‘survey’, 0.03364055976271629),

 (‘brother’, 0.00939119141548872),

 (‘boy’, 0.004503018222749233),

 (‘computer’, -0.010839177295565605),

 (‘animal’, -0.02365921437740326),

 (‘technology’, -0.09575347602367401),

 (‘response’, -0.11410721391439438),

 (‘system’, -0.11555543541908264),

 (‘interface’, -0.13429945707321167)]

 

Conclusion

 

In this post, we have discovered different types of text Feature Extraction Methods where we moved from non-context vectorization methods (count vectorizer/BOWs) to context preserving methods (TF-IDF/Word Embeddings). We have explored the above methods practically using Scikit-learn (sklearn) and Gensim libraries.

There are other advanced techniques for Word Embeddings like Facebook’s FastText. We will discuss them in our coming blogs.

Apart from Word Embeddings, Dimension Reductionality is also a Feature Extraction technique that aims to reduce the number of features in a dataset by creating new features from the existing ones and then discarding the original features.

Different techniques that you can explore for dimension reductional are Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE), and many more.

Share this:

We’re hiring!

Sounds like your cup of tea? Join us!