This paper demonstrates that Transformers obtain similar or better performance when some of the layers are randomly initialized and never updated. These random layers are interleaved with normal layers to create random features. Experiments on MT and MLM provide evidences that performances do not get hurt at all.
- Hard to believe this. What happened under the hood? I know I cannot expect an ACL paper to explain these dark secrets but I’m really confused.
- The following statement is not true
These findings raise interesting repercussions for the study of “BERTology”, as it clearly shows that even completely random and frozen layers can represent linguistic phenomena.
because even these layers are random, at least one layer below each of them is not random and trained to encode some linguistic knowledge. The authors seem to be satirizing linguists.
- 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
- 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
- 4: Strong: I learned a lot from it. I would like to see it accepted.
- 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
- 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
- 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
- 2: Mediocre: I would rather not see it in the conference.
- 1.5: Weak: I am pretty confident that it should be rejected.
- 1: Poor: I would fight to have it rejected.