Text Similarity using fastText Word Embeddings in Python

Text Similarity is one of the essential techniques of NLP which is used to find similarities between two chunks of text. In order to perform text similarity, word embedding techniques are used to convert chunks of text to certain dimension vectors. We also perform some mathematical operations on these vectors to find the similarity between the text chunks. Recommendation System, Text Summarization, Information Retrieval, and Text Categorization are some of the main applications of text similarity.

In this tutorial, we will discuss how sentence similarity can be achieved with the fastText module and also the use-case of generating related news articles.

Dataset

Here, we have a science and technology news dataset with a sample of 6188 titles.

Problem Statement

From the above dataset, we are going to pick one article title i.e.,

In:

ROI = “Samsung to spend whopping $22B on artificial intelligence, cars”

Henceforth, we are going to call it ROI in the tutorial. We will be fetching related articles of ROI from the dataset using a fastText sentence vector.

Out:

 

Generate sentence vectors

  1. Import the fastText module and load the model(300 dimension vector).

    import fasttext
    modelPath = “D://” # user defined path
    ft = fasttext.load_model(modelPath+’cc.en.300.bin’) 

  2. Generate sentence vector to the ROI and call it as vector1.

    def generateVector(sentence):
        return ft.get_sentence_vector(sentence)

    vector1 = generateVector(‘Samsung to spend whopping $22B on artificial intelligence, cars’)

    Out:

    array([-6.36741472e-03,  1.08614033e-02,  9.33997519e-03, -2.33159624e-02,
           -9.58340534e-04,  1.86185073e-02,  2.20048483e-02, -2.02285256e-02,
           -1.13004427e-02, -1.38842128e-02, -6.33053621e-03,  1.18326535e-02,
           -2.36112420e-02,  9.13483184e-03,  5.59101533e-03,  1.09400013e-02,
            4.77387244e-03, -1.54347951e-02, -1.35055669e-02, -2.90185958e-02,
            1.35819204e-02,  2.80883280e-03,  3.43523137e-02, -2.22271457e-02,
    

           …………..

     

  3. Generate sentence vector for the entire dataset.

    df[“vector”] = df[“title”].apply(generateVector)

    Out:

Calculate Spatial Distance

Calculate the spatial distance between the ROI and the rest of the dataframe titles to determine the related articles of ROI. The lesser the distance value, the more the related content.

from scipy import spatial

def spatialDistance(vector1, vector2):
    return spatial.distance.euclidean(vector1, vector2)

vector1 is the vector of ROI generated above. vector2 is the vector column of each title of the dataframe.

Generating distance as a score between the static ROI and the rest of the dataframe titles.

df[“score”] = df.apply(lambda x: spatialDistance(vector1, x[‘vector’]), axis=1)

Sorting the score column of the dataframe to determine the closest related titles.

df.drop_duplicates(subset=[“score”]).sort_values(by=[‘score’])

From the dataset, the top 10 article titles related to the ROI are:

In:

outputs = df.drop_duplicates(subset=[“score”]).sort_values(by=[‘score’])[0:10][“title”].tolist()

 

Out:

[‘OnePlus phones go cheaper on Amazon up to Rs 10,000, lots of EMI and exchange offers on latest models’,
‘Samsung to invest nearly $500 mn to set up display factory in India’,
‘Samsung Galaxy S20+ gets listed on Geekbench, revealed to bring 120Hz display, 8K video and more’,
‘Worldwide spend on robotics systems, drones to hit $128.7 billion in 2020’,
‘This is the pitch deck that the CEO of AI startup Directly used to convince its top customers Microsoft and Samsung to invest in a $20 million round’,
‘Dell is working on a software to let users control iPhones from their laptops’,
“Here’s an exclusive look at the pitch deck AI privacy startup Mine used to raise $3 million to help people ask companies to delete their data”,
‘Samsung offering instant cashback of up to Rs 20,000 on Galaxy S10 series’,
‘Google exec reveals how its cloud is helping retailers to keep their sites from crashing on their biggest shopping days of the year’,
“Here’s the pitch deck that email startup Front used to get get top tech execs like Zoom CEO Eric Yuan to invest in its $59 million Series C round”]

Another Example:

If the ROI is SpaceX launches third batch of 60 Starlink mini satellites”,

In:

SpaceX launches third batch of 60 Starlink mini satellites

Out:
[‘SpaceX launches third batch of Starlink satellites’,
“ISRO’s GSAT 30 satellite successfully rides the Ariane 5 rocket into orbit abroad the first launch of 2020”,
‘SpaceX launch LIVE stream: Watch Elon Musk blast next Starlink satellites into orbit today’,
‘ISRO targets to launch 19 satellites within a period of 7 months’,
‘Asteroid alert: NASA tracks four large space rocks racing towards Earth in next 48 hours’,
‘Huawei launches Mate 30 Pro 5G outside of China for first time, enters UAE’,
‘ISRO’s first mission of the decade on this date! Ariane rocket to launch GSAT-30 satellite’,
‘Samsung teases launch of new Galaxy phone in 11 Feb event announcement’,
‘SpaceX launch LIVE stream: Watch Elon Musk’s first launch of 2020 online HERE’,
‘NASA news: Space agency outlines goals for 2020 including a launch to Mars’]

Conclusion

In this tutorial, we have discussed generating related content using fastText sentence embeddings and a mathematical operation called Spatial Distance. We can also try replacing the spatial distance with the cosine similarity between the vectors to find the related content. Pre-processing techniques like lemmatization, stemming and removal of stopwords can also be performed on the dataset before the vector generation to improve the accuracy of the result. This specific use-case of generating related content can be enhanced to the recommendation system application considering the user’s interests.

 

 

Share this:

We’re hiring!

Sounds like your cup of tea? Join us!