We talked about word embeddings a bit in our last article, using word2vec. Word embeddings are one of the most powerful tools available to NLP developers today, and most NLP tasks will require some kind of word embedding in one of the levels. Thus, it is important to explore them in more detail and see how they work.
But Why Do We Need Word Embeddings?
Computers understand just numbers, so understanding natural language is a big challenge to it. So, in order to make sense of words, we have to convert them into some kind of numerical representation. In the past, we used some of the primitive methods, like just counting the number of occurrences of the word in a sentence, and then dividing it by the frequencies of other words (count vectorizer and TF-IDF). These methods work fine in trivial use cases, however, we need a more powerful representation for some advanced NLP use cases. These primitive methods also do not take into account the context in the words are used, which is very important in natural language.
One of the ideas that word embeddings work in is that words that appear together are similar in meaning. For example:
“It’s a nice day outside”
“Today was a nice day”.
Here we gave examples of two sentences. Note that in both sentences, the words nice and day appear together. Word vectors will derive that they may be similar in meaning and context.
If we train this over a large enough corpus, we get close to understanding the context the words are used. We use a matrix of several dimensions to represent these words, thus converting them automatically into a numerical representation.
GloVe Word Embeddings, how does it work?
Previously we saw a general idea about word embeddings, in which we said that words that appear together are thought to have the same meaning and context. That may not be the case always, so, instead of just taking words that are just next to each other, we may increase the proximity boundary, and take into account the words which are close to each other. This can be 5 words, 10 words, 15 words, or even more, but an algorithm can automatically optimize the proximity boundary. This is one of the key ideas behind GloVe, which is in short of Global Vectors.
There are also some common words in English language, such as the, I, at, on etc. These words are called stop words. GloVe also takes care that these words are not given too much weight, and less common words also have a chance to form a good vector representation.
As the last step, GloVe also amplifies words that are in close context to each other and reduces the value of words that are far. This gives words that are more close to each other a final boost in scores, than which are not.
How To Use GloVe Word Embeddings With Gensim
The best part about most word embeddings is that you can get pretrained models, which will enable you to use word embeddings right away. Most of them are available in keyed vectors format, which is easily available to import in Gensim. You can download them from GloVe site, we here download the version which contains 6 billion tokens.
This data is available in which each word is represented as different dimensions. Since it’s a normal text file, you can just give it a glance in a normal text editor.
As you can see, each word is represented as a range of numbers, the vectors. Generally, the bigger the dimension of the word is represented in, the better the vector representation will be.
We can load the keyed vector file in gensim as follows.
import gensim model = gensim.models.KeyedVectors.load_word2vec_format('glove.6B.50d.txt', no_header=True)
Here, I am loading the 50-dimensional vector file. You can load other-dimensional files too.
To check whether a word exists in the vector corpus, we can use the in python operation. Eg,
‘king’ in model
We can also get the vector representation of the word:
A quick way to check the dimensions:
Word Similarity
Word Vectors are based on the idea that similar words will have similar vectors. We can check that well using GloVe.
How similar are the words night and day? Let’s find out.
The value ranges from 0 to 1, so we can say that the words are quite similar. Similar results can be observed for king and queen, cat and dog.
However, we can also get a list of similar words, i.e., vectors that are most similar to a word.
A Note About Dimensions
So far, we have worked with 50-dimensional word vectors. Word context improves dramatically the more dimensional vectors we take. For example, the same similarity tests now done with 300-dimensional vectors show way better results.
The caveat? More dimensional vectors take up more memory and are slower to process.
Disadvantages Of Word Vectors
Although powerful, word embeddings can have some disadvantages based on their usage. The major one is that they can’t distinguish between similar-sounding words, e.g. dog bark vs tree bark, both of them are the same in word vector context. They are computing and memory intensive to train, and the corpus of training data needs to be good to obtain good results. Often they can introduce unintended bias
Despite these disadvantages, word vectors are suited for a major number of tasks for NLP and are widely used in the industry.
We have studied GloVe and Word2Vec word embeddings so far in our posts. In the next post, we will look at fasttext model, a much more powerful word embedding model, and see how it compares with these two.