In simple terms, Feature Extraction is transforming textual data into numerical data. In Natural Language Processing, Feature Extraction is a very trivial method to be followed to better understand the context. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. So we go for numerical representation for individual words as it’s easy for the computer to process numbers.
In this blog, we will discuss various feature extraction methods with examples using sklearn and gensim.
- Countvectorizer
- TF-IDF Vectorizer
- Word Embeddings
Countvectorizer
It is a simple and flexible way of extracting features from documents. A Countvectorizer model is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag of words” because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not wherein the document.
Here is a basic snippet of using count vectorization to get vectors
from sklearn.feature_extraction.text import CountVectorizer
corpus = [“We become what we think about”, “Happiness is not something readymade. It comes from your own actions”]
# initialize count vectorizer object
vect = CountVectorizer()
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()
# view token vocabulary and counts
print(“vocabulary”, vect.vocabulary_)
print(“shape”, X.shape)
print(‘vectors: ‘, X.toarray())
Output
Vocabulary : {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}
Shape : (2, 10)
Vectors : [[1 1 0 0 0 0 0 1 2 1]
[0 0 1 1 1 1 1 0 0 0]]
TF – IDF Vectorizer (Term Frequency – Inverse Document Frequency)
TF-IDF is short for term frequency-inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus.
The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
And similar to the Countvectorizer, sklearn.feature_extraction.text provides a method.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [“We become what we think about”, “Happiness is not something readymade.”]
# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()
# compute bag of word counts and tf-idf values
tf = vectorizer.fit_transform(corpus)
# convert sparse matrix to numpy array to view
print(“Vocabulary”, vectorizer.vocabulary_)
print(“idf”, vectorizer.idf_)
print(“Vectors”, tf.toarray())
Output:
Vocabulary : {‘we’: 8, ‘become’: 1, ‘what’: 9, ‘think’: 7, ‘about’: 0, ‘happiness’: 2, ‘is’: 3, ‘not’: 4, ‘something’: 6, ‘readymade’: 5}
idf : [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
1.40546511 1.40546511 1.40546511 1.40546511]
Vectors : [[0.35355339 0.35355339 0. 0. 0. 0.
- 0.35355339 0.70710678 0.35355339]
[0. 0. 0.4472136 0.4472136 0.4472136 0.4472136
0.4472136 0. 0. 0. ]]
Word Embeddings
Word embedding is a learned representation of text, where each word is represented as a real-valued vector in a lower-dimensional space.
In simple terms, word embeddings are the texts converted into numbers and there may be different numerical representations of the same text, but texts with similar context have similar representations.
Word embedding preserves contexts and relationships of words so that it detects similar words more accurately.
Word embedding has several different implementations such as word2vec, GloVe, FastText etc.
Here we will explain word2vec, as it is the most popular implementation.
Word2vec
Word2vec is widely used in most of the NLP models. It transforms every word into vectors. Word2vec can make the most accurate predictions about the meaning of words. It can capture the contextual meaning of words very well. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in space.
There are two neural embedding algorithms:
- Continuous Bag-of-Words (CBOW) – predicts target word from context
- Skip-gram – predicts context from the target word
Here is an example of Word2vec using Gensim. Gensim is a python library for NLP.
from gensim.models import Word2Vec
# Get document data.
common_texts = [[‘interface’, ‘computer’, ‘technology’],
[‘survey’, ‘computer’, ‘system’, ‘response’],
[ ‘brother’, ‘boy’, ‘man’, ‘animal’, ‘human’]]
# Initializing Model
model = Word2Vec(common_texts, window=5, min_count=1, workers=4)
Result 1 :
# Get most similar words of “computer”
model.wv.most_similar(“computer”)
Output :
[(‘technology’, 0.21617145836353302),
(‘system’, 0.09291724115610123),
(‘interface’, 0.06285080313682556),
(‘survey’, 0.027057476341724396),
(‘response’, 0.016134709119796753),
(‘human’, -0.010839173570275307),
(‘boy’, -0.02775038219988346),
(‘animal’, -0.052346907556056976),
(‘brother’, -0.05987627059221268),
(‘man’, -0.111670583486557)]
Result 2 :
# Get most similar words of “computer”
model.wv.most_similar(“human”)
Output :
[(‘man’, 0.0679759532213211),
(‘survey’, 0.03364055976271629),
(‘brother’, 0.00939119141548872),
(‘boy’, 0.004503018222749233),
(‘computer’, -0.010839177295565605),
(‘animal’, -0.02365921437740326),
(‘technology’, -0.09575347602367401),
(‘response’, -0.11410721391439438),
(‘system’, -0.11555543541908264),
(‘interface’, -0.13429945707321167)]
Conclusion
In this post, we have discovered different types of text Feature Extraction Methods where we moved from non-context vectorization methods (count vectorizer/BOWs) to context preserving methods (TF-IDF/Word Embeddings). We have explored the above methods practically using Scikit-learn (sklearn) and Gensim libraries.
There are other advanced techniques for Word Embeddings like Facebook’s FastText. We will discuss them in our coming blogs.
Apart from Word Embeddings, Dimension Reductionality is also a Feature Extraction technique that aims to reduce the number of features in a dataset by creating new features from the existing ones and then discarding the original features.
Different techniques that you can explore for dimension reductional are Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE), and many more.