T5: Exploring the limits of Transfer Learning with a Unified Text to Text Transformer

4 minute read

Introduction

The T5 paper is a classic paper and is currently widely used as a standard model in research and industry. Notably, at the time of publication, it was the largest model in the world by parameter count. In addition, it introduced the idea of a transformer encoder-decoder architecture (concurrently with Facebook’s BART). Exhaustive experiments

What is the name of the T5 paper?

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Rafel, Google
T5 is short for Text to Text Transfer Transformer
T5 frames every problem as a Text to Text Task
T5 introduces the Colossal Clean Crawled Corpus

C4 - 750 GB cleaned english text
- They train on the scraped webtext for one month in 2018
What is the pretraining objective for T5?
- Masked spans of sentinel tokens (similar to BART)
- Maximum likelihood using teacher forcing
- Different from BART in that we predicting the full masked text rather than number of masked out tokens
In T5 the supervised pretraining object is text to text with the target being the masked tokens
T5 was trained for 1/4 as many pretraining steps as BERT base
- note: it is also a much larger model
T5 proposes the idea of a prefix attention mask to allow for better representation of task specific prefixes

Intuition: Prefix should be able to attend to itself, not just rely on previous tokens:
- Example: English prefix for NMT
Prefix Masking is useful in NMT to allow the input sequence to attend to itself
Prefix Masking in T5 is analogous to BERT for classification in that the input can attend to itself
- BERT ends up with a representation in a classification token
- T5 outputs the correct classification from the language model itself
How does the T5 paper compare encoder vs decoder vs encoder-decoder architectures?
- Assume the number of parameters/ flops is roughly the same
- They compared:
The T5 paper experiments validate the importance of the denoising objective from BERT
- Give better performance over simple LM
The T5 paper compares various unsupervised objectives including BERT style prefix language modeling deshuffling
- BERT style does the best
What simplifications to BERT pretraining objectives does the TF paper explore?
- Only replacing with mask tokens (BERT replaces 15% with random tokens)
- Replace Corrupted Tokens
- Drop Corrupted Tokens
All showed similar performance
The T5 variants of dropping corrupted tokens or replacing corrupted spans is appealing because it reduces training time compared to the regular BERT objective
- the target sequence has to attend over a smaller number of tokens
The T5 paper compared the effect of different corruption rates during pretraining and showed no significant difference
- Stick with 15% from BERT
- note: would expect performance to eventually degrade at higher rates
T5 showed that corrupting spans rather than individual tokens had small effect
- Larger spans do have lower computational cost, due to representation as one token
- Similar result was found in the spanBERT paper
Overall the T5 study on pretraining objectives showed small differences among MLM variants
- Should also consider which variants have the lowest pretraining time
T5 compared different pretraining datasets and showed that domain specific datasets impacted performance on specific downstream tasks
- Should pretrain on a relevant dataset for your task
T5 experiments showed performance decreased with less training data
- Comparison is with equal numbers of pretraining steps
- Training on the full dataset once outperformed training on 1/64th of the dataset 64 times
- note: this provides decent empirical support for the current paradigm of training large language models for a single epoch
What transformer finetuning approaches are compared in T5?

Note: Normal finetuning of a classifier on top of [cls] token representations as in BERT is not relevant for these encoder decoder models.
1. Adapter Layers: Add additional dense layers within each transformer block
  - Vary d, the inner dimension of these adapter layers
  - also have to update layernorm parameters
2. Gradual Unfreezing: Gradually finetune more and more layers
T5 finetuning experiments showed degradation in performance
- was best to train all parameters
- this is similar to the finding in the GOPHER finetuning experiments
What approaches to multitask learning are explored in T5?

Multitask learning with their architecture is equivalent to doing the unsupervised training step of mixtures of different datasets
- Examples Proportional Mixing
- Temperature Scaled Mixing:
  - increasing temperature decreases imbalance between dataset sizes
How does the T5 paper reveal the “bitter lesson” in machine learning?
- The bitter lesson is that scale/ computation is all that matters
- T5 showed marginal difference between lots of different training objectives and datasets
- Large scale of the model (11B parameters) seems to be all that is necessary to perform well
- Ex: GPT3
What scaling techniques does the T5 paper explore?
- Increasing size
- Increasing training steps
- increasing batch size
- Ensembling (average logits of multiple models)
Increasing size seemed to do the best
T5 results: achieved SOTA on most GLUE and super GLUE tasks

Note 11B model is 30 times larger than BERT

Conclusion

T5 is an important paper that introduced a new encoder-decoder transformer model. The exhaustive experiments provide insight into pre-training objectives, finetuning approaches and dataset selection. The T5 model forms the basis for many other works notably PromptTuning, ExT5, and T0. Many checkpoints are available from Huggingface.

Reference

@article{DBLP:journals/corr/abs-1910-10683,
  author    = {Colin Raffel and
               Noam Shazeer and
               Adam Roberts and
               Katherine Lee and
               Sharan Narang and
               Michael Matena and
               Yanqi Zhou and
               Wei Li and
               Peter J. Liu},
  title     = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text
               Transformer},
  journal   = {CoRR},
  volume    = {abs/1910.10683},
  year      = {2019},
  url       = {http://arxiv.org/abs/1910.10683},
  eprinttype = {arXiv},
  eprint    = {1910.10683},
  timestamp = {Fri, 05 Feb 2021 15:43:41 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1910-10683.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Twitter Facebook LinkedIn

Ethan Kim

T5: Exploring the limits of Transfer Learning with a Unified Text to Text Transformer

Introduction

Conclusion

Reference

You May Also Enjoy

ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A

Decoder Inference Optimization

1 Year of a Challenging Big-Bench Task

Scattered or Connected? An Optimized Parameter-efficient