Exploring the Limits of Transfer Learning with a Unified Text-to-Text...

1114 · May 14, 2021, 2:17am

Since the debut of BERT, lots of efforts have been made to revise the methods of pre-training. Among them, this paper might be the most extensive and comprehensive one. The authors from Google spent lots of dollars to experiment many pre-training options then contributed their empirical findings in this paper, which is definitely worth reading if you want to pre-train a transformer from scratch.

Exploration

Architectures
- Encoder-decoder with Denoising objective performed the best
Unsupervised Objectives
- BERT-style wins
  - Corruption rate 15%
  - Corrupts spans with average length of 3
Pre-training Data set
- Experiments are done by training 2^{35} tokens so some corpus might not be trained many times.
- Domain
  - 17GB High rates website text performs comparably well with or even better than the 745G C4.
  - 20GB Wikipedia + Toronto Books Corpus ranks the top on SQuAD and SGLUE.
- Data set size
  - By not repeating but training the same number of steps, the full dataset set ultimately wins
Training Strategy
- Adapter layers perform the worst
- Gradual unfreezing can hardly yield comparable results to fine-tuning
- multi- task learning
  - multi-task training underperforms pre-training followed by fine-tuning on most tasks
  - Multi-task pre-training + fine-tuning improves some MT tasks
Scaling
- Increasing model size and training steps together works the best
- Increasing batch size does not improve the most

Summary

T5 applies the following settings:

Objective
- use a mean span length of 3 and corrupt 15% of the original sequence
Longer training
- We therefore pre-train our models for 1 million steps on a batch size of 211 sequences of length 512, corresponding to a total of about 1 trillion pre-training tokens
- repetition could be harmful, we opted instead to continue using the C4 data set.
Model sizes
- 11B for the largest model (65,536 hidden size with 128-headed attention)
Multi-task pre-training
- pre-training on a multi-task mixture of unsupervised and supervised tasks before fine-tuning worked as well as pre-training on the unsupervised task alone
- It also has the practical benefit of being able to monitor “downstream” performance for the entire duration of training, rather than just during fine-tuning.
Beam search
- For tasks with long output sequences, we found improved performance from using beam search

Rating

5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
4: Strong: I learned a lot from it. I would like to see it accepted.
3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
2: Mediocre: I would rather not see it in the conference.
1.5: Weak: I am pretty confident that it should be rejected.
1: Poor: I would fight to have it rejected.

0 投票者