Finetuning Pretrained Transformers into RNNs
Jungo Kasai♡∗
Hao Peng♡
Yizhe Zhang♣ Dani Yogatama♠
♡
♡
Gabriel Ilharco
Nikolaos Pappas
Yi Mao♣ Weizhu Chen♣ Noah A. Smith♡♢
♡ Paul G. Allen School of Computer Science & Engineering, University of Washington
♣ Microsoft

♠ DeepMind

♢ Allen Institute for AI

{jkasai,hapeng,gamaga,npappas,nasmith}@cs.washington.edu
{Yizhe.Zhang, maoyi, wzchen}@microsoft.com
dyogatama@google.com
Abstract

arXiv:2103.13076v2 [cs.CL] 20 Sep 2021

Transformers have outperformed recurrent
neural networks (RNNs) in natural language
generation. But this comes with a significant computational cost, as the attention mechanism’s complexity scales quadratically with
sequence length. Efficient transformer variants have received increasing interest in recent
works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the
softmax attention with randomized or heuristic feature maps, but can be difficult to train
and may yield suboptimal accuracy. This work
aims to convert a pretrained transformer into
its efficient recurrent counterpart, improving
efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with
its linear-complexity recurrent alternative and
then finetune. With a learned feature map,
our approach provides an improved tradeoff
between efficiency and accuracy over the standard transformer and other recurrent variants.
We also show that the finetuning process has
lower training cost relative to training these recurrent variants from scratch. As many models
for natural language tasks are increasingly dependent on large-scale pretrained transformers,
this work presents a viable approach to improving inference efficiency without repeating the
expensive pretraining process.1

1

Introduction

Transformer models (Vaswani et al., 2017) have
advanced the state of the art beyond recurrent neural network models (e.g., LSTMs, Hochreiter and
Schmidhuber, 1997; GRUs, Cho et al., 2014) across
a wide range of natural language processing tasks.
In particular, the transformer architecture has been
∗
1

Work was done during an internship at Microsoft.
https://github.com/jungokasai/T2R/.

widely used in autoregressive modeling such as language modeling (Baevski and Auli, 2019) and machine translation (Vaswani et al., 2017). The transformer makes crucial use of interactions between
feature vectors over the input sequence through
the attention mechanism (Bahdanau et al., 2015).
However, this comes with significant computation
and memory footprint during generation. Since the
output is incrementally predicted conditioned on
the prefix, generation steps cannot be parallelized
over time steps and require quadratic time complexity in sequence length. The memory consumption
in every generation step also grows linearly as the
sequence becomes longer. This bottleneck for long
sequence generation limits the use of large-scale
pretrained transformers, such as GPT-3 (Brown
et al., 2020), Image Transformer (Parmar et al.,
2018), and DALL-E (Ramesh et al., 2021).
Recent work aims at reducing the overhead of
autoregressive transformers (Child et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020, inter alia).
Among them are recurrent alternatives that approximate the standard softmax attention (Katharopoulos et al., 2020; Peng et al., 2021; Choromanski
et al., 2021; Schlag et al., 2021). Similar to recurrent neural networks (RNNs), those models represent the context by a recurrent state with a fixed
size, thereby achieving linear time and constant
memory complexity in generation sequence length.
When the recurrent state size is smaller than the
sequence length, these variants provide substantial
speed and memory advantages over the transformer.
A small state size, however, tends to deteriorate the
generation quality (Peng et al., 2021), leading to a
tradeoff between efficiency and accuracy.
This work improves the balance between efficiency and accuracy by a conversion approach:
instead of training a recurrent alternative from
scratch, we develop a method to convert a pretrained transformer into an efficient RNN that
speeds up generation and reduces memory foot-

prints. Our conversion proceeds with a swap-thenfinetune process. Specifically, we change the exponential similarity function in the attention mechanism to the dot product after a single-layer MLP
feature mapping. We then finetune the MLP parameters and the other network parameters. Our
experiments in language modeling and machine
translation show that the conversion can compress
the context into a much smaller recurrent state than
the sequence length (e.g., 1/16 of the sequence
length in WikiText-103 language modeling) while
retaining high accuracy. In addition, this conversion requires much less GPU time than training
randomly initialized models from scratch.
State-of-the-art models in many natural language
tasks are increasingly dependent on large-scale pretrained transformer models (e.g., GPT-2, Radford
et al., 2019; BERT, Devlin et al., 2019; RoBERTa,
Liu et al., 2019; T5, Raffel et al., 2020; BART,
Lewis et al., 2020; DeBERTa, He et al., 2021).
Converting a large off-the-shelf transformer to a
lightweight inference model without repeating the
whole training procedure is particularly useful in
many downstream applications. Our work focuses
on text generation and presents a viable approach
towards efficient inference with high accuracy.

2

Convert a Transformer into an RNN

The transformer architecture consists of multihead
attention, feedforward, and layer normalization
modules (Vaswani et al., 2017). When a transformer is trained for a sequence generation task
with teacher forcing (Williams and Zipser, 1989),
the attention can be parallelized over positions because the target sequence is fully available. During
generation, on the other hand, the output is incrementally constructed. As a result, the attention becomes an inference bottleneck for long sequences.
We present a method to eliminate this bottleneck by
converting a pretrained transformer into an efficient
RNN of linear time and constant space complexity.
We provide a detailed complexity analysis in terms
of the sequence length and model dimensions.
2.1 Multihead Attention
The attention module takes as input sequences of
source and target vectors. The source vectors are
used to produce key and value features, while the
target vectors are mapped to query vectors. More
N
src M
formally, denote by {xtgt
i }i=1 and {xj }j=1 the
src
h
target and source vectors, where xtgt
i , xj ∈ R

and h is the model dimensionality. We assume
r attention heads of d dimensions (h = dr). For
each head, the input vectors are first mapped to
d dimensional query, key, and value features by
learned affine transformations with W∗ ∈ Rd×h
and b∗ ∈ Rd :
qi = Wq xtgt
i + bq ,

kj = Wk xsrc
j + bk ,

vj = Wv xsrc
j + bv .

(1a)

(1b)

The similarities of each query vector qi with all
M key vectors are computed and normalized to
produce attention coefficients, which are then used
to output a weighted average of the value vectors
(Vaswani et al., 2017):
sim (qi , kj )
vj ,
M
j=1 ∑ℓ=1 sim (qi , kℓ )
M

xout
=∑
i

√
sim(x, y) = exp (x ⋅ y/ d) .

(2a)
(2b)

Multihead attention runs this procedure for each of
the r heads in parallel and concatenates r output
vectors to get the final h dimensional vector.2
Generation Speed Overhead Fig. 1 depicts the
transformer computation steps from input vectors
and their time complexity. We assume that the
time complexity of multiplying an n × m matrix by
an m × k is O(nmk) as implemented in cuBLAS
(NVIDIA, 2014).3 It consists of the following two
stages.
• Feature Mapping: computation of {qi }N
i=1 ,
M
M
{kj }j=1 , and {vj }j=1 for all r heads from
input vectors (Eqs. 1a-1b). Time complexity
of O(N h2 ), O(M h2 ), and O(M h2 ).
• Attention: weighted average over the value
vectors (Eq. 2a). O(M N h), quadratic in sequence length (M , N ).
Generation Memory Overhead In autoregressive generation, query, key, and value vectors consume space complexity of O(h), O(M h), and
O(M h) in every generation step. Every step’s
attention weight (Eq. 2a) spans over M source positions, taking O(M r) space, linear in sequence
length M .
2.2 Converting Transformers to RNNs
To address this generation bottleneck of quadratic
time and linear space, we propose Transformerto-RNN (T2R), a method to convert a pretrained
2
Layer normalization (Ba et al., 2016), residual connection
(He et al., 2016), and projection are suppressed for brevity.
3
If the batch size is small enough, parallelization can speed
up matrix multiplication.

Pretrained Transformer

T2R

RNN State

Figure 1: Attention computation steps and their time complexity in pretrained transformer and T2R models during
inference generation. Features φ(qi ) and φ(kj ) are directly computed from input vectors, and qi and kj are never
constructed. M : source length; N : target length; h: model dimensions; k: feature size; r: # heads.

transformer to an RNN inference model of linear
time and constant memory complexity in sequence
length (Fig. 1). T2R follows a swap-then-finetune
procedure that modifies the attention computation
of a pretrained transformer, and finetunes the model
with the task objective.
We first replace the dot-then-exponential similarity function in a pretrained transformer (Eq. 2b) by
̃ (x, y) = φ (x) ⋅ φ (y) ,
sim
(3a)
φ (x) = relu (Wφ x + bφ ) .

(3b)

Here Wφ ∈ R
and bφ ∈ R are learned parameters of a single-layer MLP. They map a d dimensional vector to a k dimensional kernel feature
space. The relu activation (Fukushima, 1980) ensures that the features are non-negative.4 Different
MLP parameters are used for different attention
heads, and thus we add a total of rk(d + 1) learnable parameters per layer (less than 0.2% parameter increase in our language model, §3). We then
finetune all parameters in this modified network,
including the MLP parameters, with the original
task objective.5
During inference generation, we reformulate the
attention computation (Eq. 2a) as
M
̃ (qi , kj )
sim
̃
xout
=∑ M
vj
i
̃
j=1 ∑ℓ=1 sim (qi , kℓ )
(4)
⊺
φ
(k
)
⊗
v
⎞
⎛ φ (qi ) ⋅ ∑M
j
j
j=1
=
⎠
⎝ φ (qi ) ⋅ ∑M
ℓ=1 φ (kℓ )
k×d

4

k

We found that relu stabilized training by prohibiting negative similarities φ(q) ⋅ φ(k). Other activation functions, such
as cos, tanh, and elu, did not improve performance.
5
We tried training the MLP parameters only, but this setting resulted in degraded development performance.

by the associativity of matrix multiplication. This
formulation lends itself to recurrent computation.
In causal attention where each query only attends
to its prefix to predict the next word (M = i), define
states:
i

Si = ∑ φ (kj ) ⊗ vj ,
j=1

i

zi = ∑ φ (kj )

(5)

j=1

where Si , zi ∈ Rk×d , Rk . These states can be computed recurrently (Katharopoulos et al., 2020):
Si = Si−1 + φ (ki ) vi⊺

zi = zi−1 + φ (ki ) (6)

In the self-attention or encoder-to-decoder (cross)
attention of a sequence-to-sequence model, Si and
zi are constant with respect to i and only need to
be computed once. Given the two states at position
i, we can obtain the output vector:
⊺
φ (qi )⊺ Si
out
̃
)
xi = (
φ (qi )⊺ zi

(7)

This avoids quadratic computation with respect to
the input sequence length. We also speed up inference by merging the MLP feature map with the
affine feature maps that produce queries and keys.
̃q ) ,
̃ q xtgt + b
φ (qi ) = relu (W
(8a)
i

̃k ) ,
̃ k xsrc + b
φ (kj ) = relu (W
j

(8b)

̃ q = Wφ Wq , W
̃ k = Wφ Wk , (8c)
where W
̃ q = bφ + Wφ bq , b
̃k = bφ + Wφ bk . (8d)
b

After the model is trained, Eqs. 8c–8d are computed
once before generation; the intermediate features
of qi and kj are never computed during inference.
Generation Speed Overhead The time complexity of each step in a T2R model is shown in
Fig. 1. Similar to the transformer, it proceeds over

two stages.
• Feature Mapping:
computation of
M
{φ(qi )}N
,
{φ(k
)}
and {vj }M
j
i=1
j=1 ,
j=1
for all r heads (Eqs. 8a–8b). Time complexity
of O(N hkr), O(M hkr), and O(M h2 ).
• Attention: the RNN states and the outputs
for r heads (Eqs. 5–7) are computed with
O(M hk) and O(N hk).
Comparing this with the pretrained transformer, we
see that if the feature size is much smaller than
input sequence lengths (k ≪ M, N ), the change in
the attention stage from O(M N h) to O(hk(M +
N )) in T2R brings a substantial speedup.
Generation Memory Overhead T2R only
needs to store the RNN state, and thus its space
complexity is O(hk), constant in sequence length.
This implies reduction in memory footprint when
k ≪ M , compared to the transformer’s O(M h).
2.3 Autoregressive Linear Transformers
In principle, any kernel function can be used as
the similarity function in Eq. 2a (Tsai et al., 2019).
Previous work proposed several untrainable feature map functions φ and developed autoregressive
transformer variants with linear time and constant
space complexity in sequence length (Katharopoulos et al., 2020; Peng et al., 2021; Choromanski
et al., 2021). While those models follow similar
computation steps to T2R, there are several differences in generation efficiency. Since the feature
map in Katharopoulos et al. (2020) preserves input
dimensions, the feature size is always the same as
the head dimensions (k = d). This means that the
speedup and memory savings from using a small
feature size are restricted by design. In our experiments (§3.3), our T2R models gain further efficiency by using a feature size that is even smaller
than the head dimensions (k = 32 and d = 128 for
language modeling). Peng et al. (2021) and Choromanski et al. (2021) scale query and key vectors
by their norms before the random approximation to
bound the error. Consequently, the feature mapping
stage needs additional steps of producing intermediate q and k and scaling them. T2R suppresses
these steps and speeds up generation further (§3.3).

3

Experiments

We present extensive experiments on standard
benchmarks for language modeling and machine
translation. Our results show that T2R achieves

efficient autoregressive generation while retaining
high accuracy.
3.1 Baselines and Comparison
We compare performance with previous transformer models for autoregressive generation with
linear time and constant space complexity in input sequence length.6 As discussed in §2.3, those
prior methods correspond to two different untrainable feature maps φ. We experiment with two
types of feature maps for comparisons: ELU
(φ (x) = elu (x) + 1, Katharopoulos et al., 2020);
RFA (random feature approximation with softmax
temperature reparameterization, Peng et al., 2021).
Each feature map is evaluated in two settings: random initialization and pretrain. Random initialization is our reimplementation of the experiments in
Katharopoulos et al. (2020) and Peng et al. (2021).
The pretrain setting follows the same protocol as
T2R except that we use different feature maps
φ than our proposed one-layer MLP with relu
activation. Positive orthogonal random features
(Performer, Choromanski et al., 2021) provide
similar random approximation to RFA and were
evaluated in the biology domain, but we found that
this method caused training divergence in the language modeling task.7
3.2 Setup and Implementations
We apply our method to causal attention in language models and both cross and causal attention
in machine translation. For language modeling, we
use a 32-dimensional feature map function. We do
not modify the encoder in machine translation as
its generation speed overhead is much less significant than the decoder (Kasai et al., 2021). Our
exploration showed that reducing the feature size
of causal attention tends to have less impact on
the final translation accuracy as opposed to cross
attention; we use feature sizes of 32 and 4 for cross
and causal attention, respectively. This observation
6
See §5 for our discussion on more transformer variants
with linear time complexity, but most of those variants need
modifications for autoregressive modeling and have yet to be
empirically evaluated in autoregressive generation tasks.
7
Our implementation closely follows the code released
by the authors (https://github.com/lucidrains/
performer-pytorch/blob/main/performer_
pytorch/performer_pytorch.py#L75-L81), but
does not subtract the maximum logit; otherwise it would
disallow the linear complexity in causal attention. We
conjecture that this is the reason why Performer becomes less
stable in our experiments. We suspect that some techniques
are necessary to improve numerical stability in language
modeling and machine translation.

is consistent with previous work that showed that
causal attention can be more drastically simplified
than cross attention in transformer machine translation models (You et al., 2020; Tay et al., 2021).
3.2.1

Language Modeling

We use the WikiText-103 benchmark, which
consists of 103M tokens sampled from English
Wikipedia (Merity et al., 2017). We choose similar
hyperparameters to prior work (Baevski and Auli,
2019; Fan et al., 2020): 32 layers, 8 heads, 128
head dimensions, 1024 model dimensions, 4096
fully connected dimensions and dropout (Srivastava et al., 2014) and layer dropout rates of 0.2.
We partition the training data into non-overlapping
blocks of 512 contiguous tokens ignoring document boundaries and train the model to predict
each token from left to right (Baevski and Auli,
2019). Validation and test perplexity are measured
by predicting the last 256 words out of the input of
512 consecutive words to avoid evaluating tokens
in the beginning with limited context (early token
curse, Press et al., 2021). We generally follow the
optimization method from Baevski and Auli (2019),
but some hyperparameters, such as the learning rate
for the T2R finetuning, are adjusted for better convergence than randomly initialized training. See
Appendix A.1 for more details.
3.2.2

Machine Translation

We experiment with 3 translation benchmarks:
WMT14 EN-DE (4.5M train pairs, Bojar et al.,
2016), WMT14 EN-FR (36M, Bojar et al., 2014),
and WMT17 ZH-EN (20M, Bojar et al., 2017). We
follow the preprocessing and data splits by previous work (EN-DE: Vaswani et al., 2017; EN-FR:
Gehring et al., 2017; EN-ZH: Hassan et al., 2018).
We use the hyperparameters of the large sized transformer (Vaswani et al., 2017): 6 layers, 16 attention
heads, 1024 model dimensions, and 4096 hidden
dimensions for both the encoder and decoder. We
apply dropout with 0.3 and label smoothing with
ε = 0.1. Following Ott et al. (2018), we use an
increased batch size of approximately 460K tokens. Each randomly initialized model is trained
for 30K (60K for the large EN-FR dataset) steps
using Adam with a learning rate of 5 ⋅ 10−4 and
β = (0.9, 0.98) (Kingma and Ba, 2015). We observed that convergence of the T2R conversion can
be achieved with 20K (40K for EN-FR) steps and
a reduced learning rate of 2 ⋅ 10−4 . We average the
checkpoints from the last five epochs to obtain the

final model (Vaswani et al., 2017). In inference, we
apply beam search with size 5 and length penalty
0.6. Consistent with previous practice, we evaluate with tokenized BLEU (Papineni et al., 2002).
Further details are described in Appendix A.1.
ppl.

train

Model
k dev. test time
ELU + Random Init.
128 22.0 22.8 470h
RFA + Random Init.
32 20.4 21.3 512h
T2R + Random Init.
32 20.1 20.8 474h
ELU + Pretrain
128 21.5 22.2 97h
RFA + Pretrain
32 20.8 21.6 104h
T2R + Pretrain
32 19.0 19.6 98h
T2R 75% + Pretrain
32 17.9 18.5 95h
Pretrained Transformer
– 17.9 18.5 –
Baevski and Auli (2019) –
– 18.7 –
Table 1: WikiText-103 language modeling results (perplexity). Train time is measured in GPU hours. The
top two rows are our reimplementations of Katharopoulos et al. (2020) and Peng et al. (2021). Pretrain indicates initialization with a pretrained transformer for
language modeling. T2R 75% indicates a model where
every fourth layer from the top is kept as the original
transformer layer. Perplexity (ppl.) is measured by predicting the last 256 words out of the input of 512 consecutive words. All models use 128 head dimensions.
We assume access to a pretrained transformer model
and measure the finetuning time in GPU hours.

3.3 Results
Language Modeling Seen in Table 1 are language modeling results in perplexity. We observe
that T2R with the learnable MLP feature map outperforms the other two linear transformer models
by more than 2.0 perplexity points in the pretrain
setting. Unlike the other linear transformer models,
T2R greatly benefits from pretraining (T2R + Pretrain: 19.6 vs. T2R + Random Init.: 20.8 test perplexity points). We attribute this advantage of T2R
to the fact that the MLP feature map is able to learn
attention patterns that are similar to those of the pretrained transformer, as evidenced in §4. Notice also
that the T2R conversion is ∼5x faster (measured
in GPU hours) than training a model from scratch.
These results illustrate that a lightweight model
can be obtained without repeating the expensive
training of large-scale pretrained language models
such as GPT-2 and GPT-3 (Radford et al., 2019;
Brown et al., 2020). T2R’s generation speedup
(∼4x when producing 512 consecutive words) and

Feature Size k
Model
ELU + Random Init.
RFA + Random Init.
T2R + Random Init.
ELU + Pretrain
RFA + Pretrain
T2R + Pretrain
Pretrained Transformer Large
Vaswani et al. (2017)

cross
64
32
32
64
32
32
–
–

causal
64
4
4
64
4
4
–
–

WMT14
EN-DE
28.4
28.1
27.5
28.4
27.6
28.7
28.9
28.4

EN-FR
*
41.7
39.8
41.8
41.8
42.1
42.2
41.8

WMT17

Train Time

ZH-EN
23.4
23.4
23.1
23.8
23.2
23.8
24.2
–

(GPU hours)
120h
135h
123h
80h
90h
82h
–
–

Table 2: Machine translation test results in BLEU scores. The top two rows are our reimplementations of
Katharopoulos et al. (2020) and Peng et al. (2021). Pretrain indicates initialization with a trained transformerlarge model. *: diverged even when running with multiple random seeds and smaller learning rates. We assume
access to a pretrained transformer model and measure the finetuning time in GPU hours.

Machine Translation Seen in Table 2 are machine translation results in BLEU from various configurations. Departing from the language modeling
experiments, the T2R model underperforms the
other two linear transformer models when initialized randomly. However, consistent with language
modeling, the T2R model substantially benefits
from pretraining (e.g., 28.7 vs. 27.5 BLEU points
in EN-DE). As a result, the T2R model achieves
similar BLEU scores to the original transformer
across all language pairs. ELU trained from the
pretrained transformer yields comparable performance to T2R, but the feature size is much larger
(64 vs. 32 and 64 vs. 4 in cross and causal attention),
thus leading to increased overhead, as shown later.
Note that the T2R finetuning time is only moderately smaller than that of randomly initialized
training here, but further speedup in conversion
can be potentially achieved with more extensive
hyperparameter tuning.9
8

Concurrent work (Lei, 2021) also explores reducing the
number of attention layers for efficiency.
9
We found that the batch size could be reduced for T2R
conversion without hurting accuracy, while randomly initialized models deteriorate with small batch sizes. This suggests

7K
Decoding Speed (Tokens/s)

memory savings are later benchmarked with varying sequence lengths. There remains a gap of 1.1
perplexity points between the T2R and pretrained
transformer models (19.6 vs. 18.5). However, the
gap can be closed when every fourth layer from
the top is kept as the original transformer layer and
the model is finetuned in the same way (T2R 75%).
This suggests that keeping a small fraction of the
quadratic attention layers can provide an effective
middle ground between efficiency and accuracy.8

6K
5K
4K
3K
Transformer
ELU 64-64
RFA 32-4
T2R 32-4

2K
1K
8

16

32

64 128 256 512 1024 2048
Sentence Length

Figure 2: Machine translation speed of various models.
Speed is measured on a single TPU v2 accelerator with
batch size 16 and beam size 1, following Peng et al.
(2021). 32-4 indicates the feature sizes of 32 and 4 for
cross and causal attention, respectively.

Speedup and Memory Savings in Generation
We run a conditional generation experiment to compare the decoding speed of the models in Table 2
(Fig. 2). Here we assume the input and output sequences are of the same length. All models are
tested using greedy decoding with the same batch
size of 16 on a TPU v2 accelerator.10 We see that
indeed the linear transformer models can generate
an almost constant number of tokens per second
regardless of the sequence length and outpace the
transformer model dramatically as the sequence
becomes longer. The T2R model achieves a 15%+
that the computational cost for conversion can be much lighter
than training from scratch, and T2R is advantageous when
only a limited number of GPUs are available.
10
https://opensource.google/projects/
jax.

Validation Perplexity

T2R + Random Init.
T2R + Pretrain

26

24

22

8

16

32
64
128
Sentence Length

256

22

20

512

Figure 3: Memory consumption from the attention
computation of various machine translation models in
inference with batch size 16 and beam size 1.

speedup over ELU and RFA due to its smaller feature sizes and faster feature mapping respectively;
this confirms our analysis on T2R’s speed advantage over them (§2.3). Fig. 3 plots memory consumption from the attention computation during
decoding for machine translation. Since the T2R,
RFA, and ELU models compress keys and values
into a k × d matrix S and a k dimensional vector
z (§2.2), the required memory at each decoding
step is constant over varying sequence lengths. It
is also roughly proportional to the feature size k.
The MLP feature map in the T2R model allows for
small feature dimensions than the ELU feature of
the head dimensions, resulting in a 70% memory reduction. The attention computation in the standard
transformer, on the other hand, consumes memory
linearly in sequence length at each decoding step
because all previous key and value vectors have to
be stored. We also found a similar speedup and
memory savings in unconditional generation with
the T2R language model (∼4x speedup in generating 512 consecutive words over the transformer).

4

24

18

Analysis and Ablations

We presented T2R, a method to convert a pretrained
transformer into an efficient RNN. In this section,
we analyze our conversion approach by examining
the impact of the feature size and induced attention
weight distributions. Our analysis shows that T2R
implicitly learns attention distributions similar to
the original transformer.
Feature Size and Pretraining We saw that T2R
benefits substantially from transformer pretraining.
Fig. 4 compares T2R with pretraining and random

Transformer
8

16

32
64
128
Feature Size k

256

512

Figure 4: WikiText-103 validation perplexity with varying feature sizes.

Average Euclidean Distance

Attention Memory (MB)

Transformer
ELU 64-64
T2R/RFA 32-4

28

0.3

0.2

0.1
T2R before Finetuning
T2R MLP Frozen
T2R

0

8

16

32
64
128
Feature Size k

256

512

Figure 5: Average Euclidean distance of T2R models from the transformer attention weights with varying feature sizes. The distances are computed on
the Wikitext-103 validation data for predicting a word
given the preceding 512 words. All models are initialized with a pretrained transformer model.

initialization in terms of the relation between the
validation perplexity from WikiText-103 and the
feature sizes. We see that as the feature size (RNN
state size) becomes smaller, pretraining becomes
particularly important to achieve low perplexity.
Transformer pretraining achieves a Pareto improvement over random initialization in the tradeoff between efficiency (small feature size) and accuracy
(low perplexity).
Attention Distribution T2R is not explicitly
trained to mimic the original attention distributions,
and there is no guarantee that the MLP feature map
approximates the exponential similarity function,
unlike previous approximation approaches (Peng
et al., 2021; Choromanski et al., 2021). Here, we
analyze the properties of the attention weight dis-

tributions that are induced by finetuning. We use
the validation data from WikiText-103 and run language models to predict the next word given the
input of 512 contiguous words. We compute the
attention weight distribution over the 512 words
for each attention head in the model layers.
Fig. 5 compares the attention distributions from
T2R in various configurations. T2R MLP frozen
indicates a model that is finetuned with the MLP
parameters frozen. Euclidean distances in attention
distributions between the original transformer and
each model are averaged across validation samples,
model layers, and attention heads.11 Comparing
T2R before finetuning and the full T2R model, we
see that the finetuning process induces much more
similar attention distributions, and the distance diminishes as the feature size increases (and the perplexity approaches the original transformer, Fig. 4).
We also observed that when the MLP parameters
are not trained (T2R MLP frozen), the distance
from the original attention distributions increases.
These results suggest that finetuning of the whole
network in T2R implicitly develops similar attention distributions to the original transformer even
though the training supervision comes solely from
language modeling.

5

Further Related Work

In addition to the work we already discussed, we
highlight related methods from prior work that
make transformer models efficient.
5.1 Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) is
closely related to our T2R conversion and uses
a similar pipeline: a teacher model with large capacity is first trained and is used to generate silver
training data for a new lightweight inference model.
It has been successfully applied to machine translation (e.g., Kim and Rush, 2016; Gu et al., 2018)
to make generation efficient. In particular, several
prior works distill a transformer translation model
to an RNN (Senellart et al., 2018; Kim et al., 2019).
We share the same motivation toward fast generation with light memory, but our approach differs in
two ways: the original training data are used for
finetuning an RNN model, and its model parameters are initialized with the “teacher” transformer.
11

We do not consider random initialization baselines here
because random initialization makes it impossible to align
attention heads and layers between models.

Our method does not use the computationally expensive teacher model to generate new training data.
While data generation is a one-time computational
cost, it becomes expensive as the teacher model
size and training data increase. Moreover, since
the pretrained parameters can be directly used, conversion requires fewer GPU hours than training a
brand new lightweight model from scratch (§3.3).
5.2 Efficient Transformers
Prior work suggested many other strategies to improve efficiency in transformers, such as weight
sharing and factorization (Dehghani et al., 2019;
Lan et al., 2020), weight and layer pruning (Michel
et al., 2019; Fan et al., 2020), quantization (Zafrir
et al., 2019; Shen et al., 2020), and modifying the
combination of sublayers (Press et al., 2020; Mandava et al., 2020). Some of these methods present
orthogonal design choices and can be integrated
into our T2R model to gain further efficiency. For a
more comprehensive survey, see Tay et al. (2020b).
Below we describe several prior works along two
major strategies: compressing the attention context
and sparsifying the attention patterns.
Attention Context Compression This strand of
methods compresses the context that is attended to,
thereby reducing the time and memory overhead
in the attention. RNN models that we converted
pretrained transformers into compress the context
into a recurrent state. Other approaches include low
rank approximation of the attention computation
(Wang et al., 2020; Tay et al., 2021) and adding a
memory module that can access multiple tokens at
once (Liu et al., 2018; Dai et al., 2019; Lee et al.,
2019; Ainslie et al., 2020; Rae et al., 2020; Beltagy
et al., 2020; Zaheer et al., 2020).
Sparse Attention Patterns Another approach to
reducing the time and memory overhead from the
attention computation is to limit the tokens that are
attended to by sparsifying the attention patterns.
These patterns can be set in advance or learned during training (Tay et al., 2020b). For example, prior
works introduced fixed patterns of blockwise attention (Qiu et al., 2020) and strided attention (Child
et al., 2019; Beltagy et al., 2020; Zaheer et al.,
2020). Other previous works presented methods
to learn attention patterns from data (Sukhbaatar
et al., 2019; Roy et al., 2020; Tay et al., 2020a).
It should be noted that significant modifications
are necessary to apply many of these methods to
autoregressive generation tasks such as language

modeling and machine translation, and their empirical evaluation in these generation settings has
yet to be conducted (Peng et al., 2021). This work
presents extensive empirical evaluation in autoregressive generation settings.

6

Conclusion and Future Work

We present T2R, a method that converts a pretrained transformer to a recurrent neural network
that reduces the time and memory cost of autoregressive generation. Our experiments in language modeling and machine translation demonstrated that our model produces an improved tradeoff between efficiency and accuracy over randomly initialized training and previous models with
lightweight attention. Our work provides further
support for the claim that large-scale pretrained
models can be compressed into efficient inference
models that facilitate downstream applications.

Acknowledgments
We thank Ofir Press, Bill Dolan, Lei Li, and the
anonymous reviewers for their valuable feedback
and discussion on this work. Nikolaos Pappas was
supported by the Swiss National Science Foundation grant P400P2_183911.

References
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang.
2020. ETC: Encoding long and structured inputs in
transformers. In Proc. of EMNLP.
Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.
2016. Layer normalization.
Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In
Proc. of ICLR.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly
learning to align and translate. In Proc. of ICLR.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer.
Ondřej Bojar, Christian Buck, Christian Federmann,
Barry Haddow, Philipp Koehn, Johannes Leveling,
Christof Monz, Pavel Pecina, Matt Post, Herve
Saint-Amand, Radu Soricut, Lucia Specia, and Aleš
Tamchyna. 2014. Findings of the 2014 workshop on
statistical machine translation. In Proc. of WMT.

Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Shujian Huang,
Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post,
Raphael Rubino, Lucia Specia, and Marco Turchi.
2017. Findings of the 2017 conference on machine
translation (WMT17). In Proc. of WMT.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie
Névéol, Mariana Neves, Martin Popel, Matt Post,
Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos
Zampieri. 2016. Findings of the 2016 conference
on machine translation. In Proc. of WMT.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learners. In Proc. of NeurIPS.
Rewon Child, Scott Gray, Alec Radford, and Ilya
Sutskever. 2019. Generating long sequences with
sparse transformers.
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties
of neural machine translation: Encoder–decoder approaches. In Proc. of SSST-8.
Krzysztof Choromanski, Valerii Likhosherstov, David
Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
Lukasz Kaiser, David Belanger, Lucy Colwell, and
Adrian Weller. 2021. Rethinking attention with Performers. In Proc. of ICLR.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019.
Transformer-XL: Attentive language models beyond
a fixed-length context. In Proc. of ACL.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals,
Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In Proc. of ICLR.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proc. of NAACL.
Angela Fan, Edouard Grave, and Armand Joulin. 2020.
Reducing transformer depth on demand with structured dropout. In Proc. of ICLR.

Kunihiko Fukushima. 1980. Neocognitron: A selforganizing neural network model for a mechanism
of pattern recognition unaffected by shift in position.
Biological Cybernetics.
Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann Dauphin. 2017. Convolutional sequence to sequence learning. In Proc. of ICML.
Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In Proc.
of ICLR.
Hany Hassan, Anthony Aue, Chang Chen, Vishal
Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt,
William Lewis, Mengnan Li, Shujie Liu, Tie-Yan
Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank
Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu,
Yingce Xia, Dongdong Zhang, Zhirui Zhang, and
Ming Zhou. 2018. Achieving human parity on automatic Chinese to English news translation.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Deep residual learning for image recognition. In Proc. of CVPR.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Weizhu Chen. 2021.
DeBERTa: Decodingenhanced BERT with disentangled attention.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.
2015. Distilling the knowledge in a neural network.
In Proc. of NeurIPS Deep Learning and Representation Learning Workshop.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Computation.
Hakan Inan, Khashayar Khosravi, and Richard Socher.
2017. Tying word vectors and word classifiers: A
loss framework for language modeling. In Proc. of
ICLR.
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross,
and Noah A. Smith. 2021. Deep encoder, shallow
decoder: Reevaluating non-autoregressive machine
translation. In Proc. of ICLR.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are
RNNs: Fast autoregressive transformers with linear
attention. In Proc. of ICML.
Yoon Kim and Alexander M. Rush. 2016. Sequencelevel knowledge distillation. In Proc. of EMNLP.
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman
Grundkiewicz, and Nikolay Bogoychev. 2019. From
research to production and back: Ludicrously fast
neural machine translation. In Proc. of WNGT.
Diederik P. Kingma and Jimmy Ba. 2015. Adam:
A method for stochastic optimization. In Proc. of
ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
2020. Reformer: The efficient transformer. In Proc.
of ICLR.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. ALBERT: A lite BERT for self-supervised
learning of language representations. In Proc. of
ICLR.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019.
Set transformer: A framework for attention-based
permutation-invariant neural networks. In Proc. of
ICML.
Tao Lei. 2021. When attention meets fast recurrence:
Training language models with reduced compute.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pretraining for natural language generation, translation,
and comprehension. In Proc. of ACL.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In Proc. of ICLR.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke S. Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A robustly optimized BERT pretraining
approach.
Ilya Loshchilov and Frank Hutter. 2017.
stochastic gradient descent with restarts.

SGDR:

Swetha Mandava, Szymon Migacz, and Alex Fit Florea.
2020. Pay attention when required.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017. Pointer sentinel mixture models. In Proc. of ICLR.
Paul Michel, Omer Levy, and Graham Neubig. 2019.
Are sixteen heads really better than one? In Proc. of
NeurIPS.
Paulius Micikevicius, Sharan Narang, Jonah Alben,
Gregory Diamos, Erich Elsen, David Garcia, Boris
Ginsburg, Michael Houston, Oleksii Kuchaiev,
Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In Proc. of ICLR.
NVIDIA. 2014. The NVIDIA CUDA basic linear algebra subroutines (CUBLAS).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In NAACL Demonstrations.

Myle Ott, Sergey Edunov, David Grangier, and
Michael Auli. 2018. Scaling neural machine translation. In Proc. of WMT.

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber.
2021. Linear transformers are secretly fast weight
memory systems. In Proc. of ICML.

Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL.

Jean Senellart, Dakun Zhang, Bo Wang, Guillaume
Klein, Jean-Pierre Ramatchandirin, Josep Crego,
and Alexander Rush. 2018. OpenNMT system description for WNMT 2018: 800 words/sec on a
single-core CPU. In Proc. of WNG.

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Noam Shazeer, Alexander Ku, and Dustin
Tran. 2018. Image transformer. In Proc. of ICML.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep learning library. In Proc. of NeurIPS.
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
Schwartz, Noah A. Smith, and Lingpeng Kong.
2021. Random feature attention. In Proc. of ICLR.
Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving transformer models by reordering their sublayers. In Proc. of ACL.
Ofir Press, Noah A. Smith, and Mike Lewis. 2021.
Shortformer: Better language modeling using
shorter inputs. In Proc. of ACL.
Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proc. of
EACL.
Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih,
Sinong Wang, and Jie Tang. 2020. Blockwise selfattention for long document understanding. In Proc.
of EMNLP.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar,
Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In Proc. of ICLR.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proc. of ACL.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
Yao, Amir Gholami, Michael W. Mahoney, and Kurt
Keutzer. 2020. Q-BERT: hessian based ultra low
precision quantization of BERT. In Proc. of AAAI.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. JMLR.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In Proc. of ACL.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan,
Zhe Zhao, and Che Zheng. 2021. Synthesizer: Rethinking self-attention in transformer models. In
Proc. of ICML.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and DaCheng Juan. 2020a. Sparse sinkhorn attention. In
Proc of ICML.
Yi Tay, M. Dehghani, Dara Bahri, and Donald Metzler.
2020b. Efficient Transformers: A survey.
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada,
Louis-Philippe Morency, and Ruslan Salakhutdinov.
2019. Transformer dissection: An unified understanding for transformer’s attention via the lens of
kernel. In Proc. of EMNLP.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proc. of NeurIPS.
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han
Fang, and Hao Ma. 2020. Linformer: Self-attention
with linear complexity.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text transformer. JMLR.

Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent
neural networks. Neural Computation.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
Gray, Chelsea Voss, Alec Radford, Mark Chen, and
Ilya Sutskever. 2021. Zero-shot text-to-image generation.

Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,
and Michael Auli. 2019. Pay less attention with
lightweight and dynamic convolutions. In Proc. of
ICLR.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and
Efficient content-based
David Grangier. 2020.
sparse attention with routing transformers. TACL.

Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020.
Hard-coded Gaussian attention for neural machine
translation. In Proc. of ACL.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe
Wasserblat. 2019. Q8BERT: quantized 8bit BERT.
In Proc. of EMC2 .

A Appendix

Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for longer sequences. In Proc. of NeurIPS.

All training is implemented in fairseq (Ott et al.,
2019) and run with PyTorch 1.7.1 (Paszke et al.,
2019), 8 Telsa V100 GPUs, and CUDA 11.0. We
used mixed precision and distributed training over
8 GPUs (Micikevicius et al., 2018; Ott et al., 2018).
Apart from EN→ZH where we used separate BPE
operations and only tied the decoder input and output embeddings, we tie all embeddings (Press and
Wolf, 2017; Inan et al., 2017). We experimented
with feature sizes of [16, 32, 64] and [4, 8, 16, 32]
for language modeling and machine translation respectively, and chose the smallest feature sizes that
retained the development performance compared
to the standard transformer.

A.1 Hyperparameters and Setting

A.1.1 Language Modeling
We generally follow the optimization method from
Baevski and Auli (2019). For optimizing a model
from random initialization, the learning rate is linearly warmed up from 10−7 to 1 for the initial 16K
steps and then annealed using a cosine learning
rate schedule with cycles (Loshchilov and Hutter,
2017). Each period lasts for twice the number of
updates than the previous cycle, and we lower the
maximum and minimum learning rates by 25%
compared to the previous cycle. The initial minimum and maximum learning rates are 10−5 and 1
respectively (Baevski and Auli, 2019). We train the
model with a batch size of about 74K tokens with
a total of 286K steps (Baevski and Auli, 2019).
When we convert a pretrained transformer to an
RNN model by finetuning, we found that we could
speed up training by reducing the warm-up steps,
total update steps, maximum and minimum rates,
and batch size to 8K steps, 142K steps, 5 ⋅ 10−6 ,
0.5, and 25K tokens without loss in validation perplexity.
Randomly Initialized Training We generally
follow the hyperparameters chosen in Baevski and
Auli (2019); Fan et al. (2020). Specifically, we list
the hyperparameters in Table 3 for easy replication.
All other hyperparameter options are left as default
values in fairseq.
Finetuning Pretrained Transformer Seen in
Table 4 are the hyperparameters for finetuning a
pretrained transformer to RNN models. The learning rates, the max number of updates, and the learning period length are all reduced.

architecture
criterion
tokens-per-sample
sample-break-mode
# max tokens
dropout rate
layer dropout rate
decoder embed dim
decoder ffn dim
# decoder attn heads
optimizer
lr-scheduler
lr-period-updates
lr-shrink
t-mult
max-lr
min-lr
lr
clip-norm
warm-up lr
# warmup updates
# max updates
# GPUs
update-freq

transformer_lm_wiki103
adaptive_loss
512
none
3072
0.2
0.2
1024
4096
8
nag
cosine
270K
0.75
2
1
1e-9
1e-4
0.1
1e-7
16K
286K
8
3

Table 3: Language modeling hyperparameters when
randomly initialized in the fairseq library.

A.1.2 Machine Translation
We experiment with 3 translation benchmarks:
WMT14 EN-DE (4.5M train pairs, Bojar et al.,
2016), WMT14 EN-FR (36M, Bojar et al., 2014),
and WMT17 ZH-EN (20M, Bojar et al., 2017). We
follow the preprocessing and data splits by previous work (EN-DE: Vaswani et al., 2017; EN-FR:
Gehring et al., 2017; EN-ZH: Hassan et al., 2018;
Wu et al., 2019). These datasets are all encoded
into subwords by BPE (Sennrich et al., 2016). We
run joint BPE on all language pairs except EN-ZH.
We use the hyperparameters of the large sized transformer (Vaswani et al., 2017): 6 layers, 16 attention
heads, 1024 model dimensions, and 4096 hidden
dimensions for both the encoder and decoder. We
apply dropout with 0.3, weight decay with 0.01 and
label smoothing with ε = 0.1. Following Ott et al.
(2018), we use an increased batch size of approximately 460K tokens by accumulating gradients
without updating parameters.
Randomly Initialized Training We generally
follow the hyperparameters chosen in Vaswani et al.
(2017); Ott et al. (2018). Specifically, we list the
hyperparameters in Table 5 for easy replication. All
other hyperparamter options are left as default values in fairseq. The parameters from the last five
epochs were averaged to obtain the final model.
Finetuning Pretrained Transformer Seen in
Table 6 are the hyperparameters for finetuning a

architecture
criterion
tokens-per-sample
sample-break-mode
# max tokens
dropout rate
layer dropout rate
decoder embed dim
decoder ffn dim
# decoder attn heads
optimizer
lr-scheduler
lr-period-updates
lr-shrink
t-mult
max-lr
min-lr
lr
clip-norm
warm-up lr
# warmup updates
# max updates
# GPUs
update-freq

transformer_lm_wiki103
adaptive_loss
512
none
3072
0.2
0.2
1024
4096
8
nag
cosine
135K
0.75
2
0.5
1e-9
5e-5
0.1
1e-7
8K
142K
8
1

Table 4: Finetuning language modeling hyperparameters in the fairseq library. The learning rates are
smaller than randomly initialized training.

pretrained transformer to RNN models. The learning rate and the max number of updates are reduced.
The parameters from the last five epochs were again
averaged to obtain the final model.
A.2 Attention Distribution
Peakiness of Attention Fig. 6 plots the average
entropy of the T2R models with and without pretraining. Entropy is averaged across validation samples, layers, and attention heads. Comparing Figs.
4 and 6, we see that there is strong correlation
between validation perplexity and entropy. The entropy decreases (and thus the attention distribution
gets peakier) when a large feature size is used or the
transformer pretraining is applied. This observation hints at potential future improvement of linear
transformer models by introducing an inductive
bias towards peaky attention distributions.

transformer_vaswani_en_de_big
label_smoothed_cross_entropy
0.1
3584
0.3
0.0
1024
4096
16
6
1024
4096
16
6
1024
1024
5e-4, 3e-4 (T2R)*
0.9
0.98
inverse square
1e-7
4000
30K, 60K (EN-FR)
0.6
5
8
16

Table 5: Machine translation hyperparameters when
randomly initialized in the fairseq library. *: we
reduced the learning rate for T2R to avoid training divergence.

architecture
criterion
label smoothing
# max tokens
dropout rate
weight decay
encoder embed dim
encoder ffn dim
# encoder attn heads
# encoder layers
decoder embed dim
decoder ffn dim
# decoder attn heads
# decoder layers
max source positions
max target positions
Adam lrate
Adam β1
Adam β2
lr-scheduler
warm-up lr
# warmup updates
# max updates
length penalty
beam size
# GPUs
update-freq

transformer_vaswani_en_de_big
label_smoothed_cross_entropy
0.1
3584
0.3
0.0
1024
4096
16
6
1024
4096
16
6
1024
1024
2e-4
0.9
0.98
inverse square
1e-7
4000
20K, 40K (EN-FR)
0.6
5
8
16

Table 6: Finetuning machine translation hyperparameters. The learning rate is smaller than randomly initialized training.

5
Average Attention Entropy

architecture
criterion
label smoothing
# max tokens
dropout rate
weight decay
encoder embed dim
encoder ffn dim
# encoder attn heads
# encoder layers
decoder embed dim
decoder ffn dim
# decoder attn heads
# decoder layers
max source positions
max target positions
Adam lrate
Adam β1
Adam β2
lr-scheduler
warm-up lr
# warmup updates
# max updates
length penalty
beam size
# GPUs
update-freq

T2R + Random Init.
T2R + Pretrain

4.5

4

Transformer
3.5
8

16

32
64
128
Feature Size k

256

512

Figure 6: Average entropy of the attention weights.
They are computed on the Wikitext-103 validation data
for predicting a word given the preceding 512 words.