In this article, we will discuss abstractive summarization using T5, and how it is different from BERT-based models.
T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output, in contrast to BERT-style models that can only output either a class label or a span of the input. This text-to-text formatting makes the T5 model fit for multiple NLP tasks like Summarization, Question-Answering, Machine Translation, and Classification problems.
How T5 is different from BERT?
Both T5 and BERT are trained with MLM (Masked Language Model) approach.
What is MLM?
The MLM is a fill-in-the-blank task, where the model masks part of the input text and tries to predict what that masked word should be.
- “I like to eat peanut butter and <MASK> sandwiches,”
- “I like to eat peanut butter and jelly sandwiches,”
The only difference is that T5 replaces multiple consecutive tokens with the single Mask Keyword, unlike, BERT which uses Mask token for each word. This illustration is shown below.
About T5 Models
Google has released the pre-trained T5 text-to-text framework models which are trained on the unlabelled large text corpus called C4 (Colossal Clean Crawled Corpus) using deep learning. C4 is the web extract text of 800Gb cleaned data. The cleaning process involves deduplication, discarding incomplete sentences, and removing offensive or noisy content.
You can get these T5 pre-trained models from the HuggingFace website:
- T5-small with 60 million parameters.
- T5-base with 220 million parameters.
- T5-large with 770 million parameters.
- T5-3B with 3 billion parameters.
- T5-11B with 11 billion parameters.
T5 expects a prefix before the input text to understand the task given by the user. For example, “summarize:” for the summarization, “cola sentence:” for the classification, “translate English to Spanish:” for the machine translation, etc., You can have a look at the below image to understand the above illustration.
Every task we consider uses text as input to the model, which is trained to generate some target text. This allows us to use the same model, loss function, and hyperparameters across our diverse set of tasks including translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue).
Besides the improved transformer architecture and massive unsupervised training data, better decoding methods have also played an important role. Currently, the most prominent decoding methods are Greedy Search, Beam Search, Top-K Sampling, and Top-p Sampling.
Visit this link to know the detailed information about these methods.
Using T5 through the HuggingFace transformers:
HuggingFace, an open-source NLP library that helps load pre-trained models, which are similar to sci-kit learn for machine learning algorithms.
We define the content we are going to summarize.
content = “China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.”
Importing the necessary packages
from transformers import T5Tokenizer, T5ForConditionalGeneration
Loading the tokenizer and model architecture with weights
T5_PATH = ‘t5-large’ # T5 model name
# initialize the model architecture and weights
t5_model = T5ForConditionalGeneration.from_pretrained(T5_PATH)
# initialize the model tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)
The pre-trained model used here is t5-large. Other pre-trained models of t5 are discussed above.
Encode the text
# encode the text into tensor of integers using the tokenizer
inputs = tokenizer.encode(“summarize: ” + article, return_tensors=”pt”, max_length=512, padding=’max_length’, truncation=True)
Generate the summarized text and decode it
summary_ids = t5_model.generate(inputs,
output = t5_tokenizer.decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
The decoding method used here is Beam Search with num_beams value as 2.
With min_length 50 and max_length 50, the output is:
“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales of Huawei’s”
and the time taken to generate the summary is 8.07 seconds with 16 cores CPU host.
With min_length 50 and max_length 100, the output is:
“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But company increased its dominance of the china market which has been faster to recover from COVID-19.”
and the time taken to generate the summary is 14.32 seconds with 16 cores CPU host.
With min_length 100 and max_length 200, the output is:
“Huawei overtakes Samsung as world’s biggest seller of mobile phones in second quarter of 2020. Company shipped 55.8 million devices compared to Samsung’s 53.7 million, Canalys says. Sales fell 5% from same quarter a year earlier, owing to disruption from coronavirus. But Huawei increased its dominance of the china market which has been faster to recover from COVID-19.. Apple is due to release its Q2 iPhone shipment data on friday.”
and the time taken to generate the summary is 23.15 seconds with 16 cores CPU host.
As you increase any of these parameters num_beams, min_lenth, and max_length, the time taken to generate the summary is going to increase.
In this article, we have used the Beam Search decoding method. For a better summary, we can suggest increasing the beam value and trying the other decoding methods(Greedy Search, Beam Search, Top-K Sampling, and Top-p Sampling) mentioned.
With Pegasus, we can only perform abstractive summarization but T5 can perform various NLP tasks like Classification tasks (eg: Sentiment Analysis), Question-Answering, Machine Translation, and Document Summarization. We recommend you go through the other NLP tasks of T5.