In the last article, we have seen how to perform extractive summarization of some text, which selects important sentences and gives them out by ranking them, without changing any text. While they are suitable for some cases, they do not achieve the sophistication of human-like summaries. For that, we will need to dive deep into Abstractive Summarization.
To put abstractive summarization in simple words, a model reads the text and tries to make sense of it. Then it writes out the summary from scratch, putting in the most important information that it finds. This is how a human will summarize text, for this reason, abstractive summarization archives results that are equal to or near human-like summaries.
Language Models In AI
Almost all abstractive summarization techniques use some kind of language model. A language model mimics how humans understand and know the natural language. Abstractive summarization is done mostly by using a pre-trained language model and then fine-tuning it to specific tasks, such as summarization, question-answer generation, and more. We will discuss in brief the most popular language model that is available to us, BERT.
BERT (Bidirectional Encoder Representations from Transformers)
BERT was developed by Google to create a model that can understand the human language well and also can be used for several language-related tasks. To make it work, BERT is pre-trained by two unsupervised learning tasks, which are run simultaneously. First is the masked language modeling, where BERT is given a sentence, but with words taken out, and it tries to guess the words that are skipped. Sort of like fill in the blanks. The second task is called Next Sentence Prediction, where BERT is given a sentence, and it tries to guess the next sentence. This process is repeated, and over time, it gets really good at understanding the language context as well as its underlying meaning.
This creates a basic language model, which is then fine-tuned for specific tasks that we will need the model to do. For example, say we need to train and fine-tune BERT for summarization. We take the model and add some neurons to it, the input being the text and the output being the summary. Since BERT already has an understanding of the language and context of words, this step is pretty easy and doesn’t take much time for it to get trained.
The Pegasus Model, And How It Is Different
Later, people at Google experimented and found out that fine-tuning pre-trained language models for summarization may not work well in all cases. So, they created a language model from scratch, that is specially trained for abstractive summaries.
They call this new training method as Gap Sentence Generation. The idea is that instead of masking out only words as it happens in BERT, they mask out entire sentences, and ask the model to guess the removed sentences. As the model tries to get the underlying missing context, it gets better at abstractive summarization-like tasks.
Using Pegasus through HuggingFace Transformers
We now show an example of using Pegasus through the HuggingFace transformers.
The first thing you need to do is install the necessary Python packages.
pip install transformers sentencepiece
We import them into our file
from transformers import PegasusForConditionalGeneration, PegasusTokenizerimport torch
We define the source text we are going to summarize.
src_text = [ """China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.""" ]
We will be using the pegasus XSUM model, we define it as below.
model_name = 'google/pegasus-xsum'
Which device/hardware will we be using to run the model? The CPU or GPU? We define it here below.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Finally, we create a tokenizer that will be used to parse and tokenize the input text.
tokenizer = PegasusTokenizer.from_pretrained(model_name)
This will download and initialize the model.
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
We tokenize the text, and it returns us PyTorch tensors.
batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
We take these tensors and get the abstractive summary.
translated = model.generate(**batch)
Since the output as tensors is encoded, we have to decode it.
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
You can see the tgt_text variable, to get the summary.
['Huawei has overtaken Samsung as the world’s biggest seller of mobile phones, according to data from research firm Canalys.'
Here we have implemented a case of doing abstractive summarization using a non-BERT language model. There are some other alternatives to it if you want to specially take a BERT model and fine-tune it for summarization. One such example is PreSumm, which produces some decent outputs with good performance. If a Pegasus-like model doesn’t suit your task, giving it a try will be a good idea.