Reservoir Transformers

This paper demonstrates that Transformers obtain similar or better performance when some of the layers are randomly initialized and never updated. These random layers are interleaved with normal layers to create random features. Experiments on MT and MLM provide evidences that performances do not get hurt at all.


  • Hard to believe this. What happened under the hood? I know I cannot expect an ACL paper to explain these dark secrets but I’m really confused.
  • The following statement is not true

These findings raise interesting repercussions for the study of “BERTology”, as it clearly shows that even completely random and frozen layers can represent linguistic phenomena.

because even these layers are random, at least one layer below each of them is not random and trained to encode some linguistic knowledge. The authors seem to be satirizing linguists.

