Finetuning Pretrained Transformers into RNNs

This paper proposes a recurrent structure to mimic the auto-regressive Transformers such that their quadratic time complexity is reduced to linear.

They replace the softmax attention heads with a MLP-dot one and limit the later tokens to attend to only previous ones.


  • The title could be phrased as “compression” or “distillation” instead of “conversion” as their method is very lossy.
  • Although they claimed any pre-trained Transformer can be “converted”, they only experimented on a toy Transformer trained on WikiText and they didn’t experiment on any real world Transformers like BART or T5.
