Text Summarization

Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.

Finalized Approaches

BART (BERT + GPT-2)
LongFormer (Based on BART)
BigBird (Based on Pegasus)
GPT-2

All Approaches

Pegasus (BERT + GPT-2)

BART (BERT + GPT-2)

LongFormer (Based on BART)

BigBird (Based on Pegasus)

BERT + (Decoder from Scratch)

GPT-2

Pegasus vs BART

Both uses almost same architechture. But, they are trained differently.
They doesn't support longer sequences, Thier Sequence length limit is 512 for Pegaus & 1024 for BART.
Practically speaking, Inputs & Outputs of both models are pretty much same and are like Orignal Transformer.

The base architecture of PEGASUS is a standard Transformer encoder-decoder. Both GSG and MLM are applied simultaneously to this example as pre-training objectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2] (MLM).

Inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations. Here, a document has been corrupted by replacing spans of text with mask symbols. The corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and we use representations from the final hidden state of the decoder.

Pegasus

BART

LongFormer ED vs BigBird

Note: RoBERTa is just a facebook version of google's BERT, This is many times bigger than BERT and trained for a lot longer than BERT. But, the architecture still stays the same

Being extra clear, Both of them have 2 versions of them. One version is for only-enocder tasks (e.g sequence classification), Second version is for encoder-decoder tasks (e.g text summarization)

LongFormer's only-encoder version is based on RoBERTa & LongFormer's encoder-decoder version is based on BART
BigBird's only-encoder version is also based on RoBERTa & BigBird's encoder-decoder version is based on Pegasus
Obviously, I'll be taking about encoder-decoder version of both (considering our main target)
For architectural details see BART for LongFormer & Pegasus for BigBird

Comparing the full self-attention pattern and the configuration of attention patterns in our Longformer.
Unlike BigBird, LongFormer doesn't have any final version. Instead it has variants

Building blocks of the attention mechanism used in BIGBIRD. White color indicates absence of attention. (a) random attention with r = 2, (b) sliding window attention with w = 3 (c) global attention with g = 2. (d) the combined BIGBIRD model.

LongFormer's only-encoder version has a MaxSeq length of 4096
LongFormer's encoder-decoder version has a MaxSeq length of 16384

BigBird's only-encoder version has a MaxSeq length of 4096
BigBird's encoder-decoder version has a MaxSeq length of 4069

LongFormer Encoder Decoder

BigBird

Will Pegasus work here ?

The MaxSeqLen of Pegasus is 512, So they won't work here.

Answer: NO

Will BART work here ?

The MaxSeqLen of both BART is 1024, Average size of our sequences (in dataset) is 1000 (ranging from 500 to 1500)
I think, It worth giving a try by truncating few hundred tokens from the end

Answer: YES, (MayBe)

Will LongFormer ED (LED) work here ?

The MaxSeqLen of LED is 16384, Average size of our sequences (in dataset) is 1000 (ranging from 500 to 1500)
I think, This will work for sure !

Answer: YES

Will BigBird work here ?

The MaxSeqLen of BigBird is 4096, Average size of our sequences (in dataset) is 1000 (ranging from 500 to 1500)
I think, This will work for sure !

Answer: YES

Will RoBERTa (BERT) work here ?

The MaxSeqLen of both RoBERTa & BERT is 512, So they won't work directly.
Even some models (e.g longformer [only-encoder], bigbird [only-encoder]) fix the SeqLen problem (by customizing attention masks) but still remember these are just encoder-only model we'll still be needed to Train a Decoder from scratch

Answer: NO

Will GPT-2 work here ?

The MaxSeqLen of both GPT & GPT-2 is 512 & 1024 (respectivly), Average size of our sequences (in dataset) is 1000 (ranging from 500 to 1500)
I think, It worth giving a try by truncating few hundred tokens from the end

Answer: YES, (MayBe)