When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute Tao Lei ASAPP, Inc. taoleics@gmail.com 1.222 Abstract Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as ENWIK 8, W IKI -103 and BIL LION WORD datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to topperforming Transformer models. For instance, our model achieves a state-of-the-art result on the ENWIK 8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.1 1 Introduction Many recent advances in language modeling have come from leveraging ever larger datasets and model architectures. As a result, the associated computation cost for developing such models have grown enormously, requiring hundreds of GPU hours or days per experiment, and raising concerns about the environmental sustainability of current research (Schwartz et al., 2020). As a consequence, it has become imperative to build computationally efficient models that retain top modeling power while reducing computational costs. The Transformer architecture (Vaswani et al., 2017) was proposed to accelerate model training and has become the predominant architecture in NLP. Specifically, it is built entirely upon selfattention and avoids the use of recurrence to enable strong parallelization. While this change has 1 Our code, experimental setup and models are available at https://github.com/asappresearch/sru. 1.5 Bits Per Character (BPC) arXiv:2102.12459v3 [cs.CL] 15 Sep 2021 43 Transformer-XL SRU++ SRU++ (single attention) 1.3 1.2 1.0 0 90 180 270 360 Figure 1: Bits-per-character on ENWIK 8 dev set vs. GPU hours used for training. SRU++ obtains better BPC by using 1/8 of the resources. We compare with 1.50 Transformer-XL as it is one of the strongest models on the datasets tested. Models are trained with single precision and comparable training settings. 1.33 1.17 led to many empirical success and improved computational efficiency, we are interested in revisit1.00 0 90 180 270 360 ing the architectural question: Is attention all we need for modeling? The attention mechanism permits learning de1.50 pendencies between any parts of the input, making it an extremely powerful neural component in 1.33 many machine learning applications (Bahdanau et al., 2015; Lin et al., 2017). We hypothesize 1.17 1.17this advantage can still be complemented with that other computation that is directly designed for 1.09 se1.00 quential modeling. Indeed, several270recent works 0 90 180 360 have studied and confirmed the same hypothesis by leveraging recurrence in conjunction with attention. For example, Merity (2019) demonstrates that single-headed attention LSTMs can produce results competitive to Transformer models in language modeling. Other work have incorporated RNNs into Transformer, and obtain better results in machine translation (Lei et al., 2018; Hao et al., 2019) and language understanding benchmarks (Huang et al., 2020). These results highlight one possibility – we could build more efficient models by combining attention and fast recurrent networks (Bradbury et al., 2017; Zhang and Sennrich, 2019). In this work, we validate this idea and present a self-attentive recurrent unit that achieves strong computational efficiency. Our work builds upon the SRU (Lei et al., 2018), a highly parallelizable RNN implementation that has been shown effective in language and speech applications (Park et al., 2018; Kim et al., 2019; Hsu et al., 2020; Shangguan et al., 2019). We incorporate attention into the SRU by simply replacing the linear transformation of input with a self-attention component. The proposed architecture, called SRU++, enjoys enhanced modeling capacity and remains equally parallelizable. Figure 1 compares its performance with the Transformer-XL model (Dai et al., 2019) on the ENWIK 8 dataset. SRU++ achieves better results while using a fraction of the training resources needed by the baseline. We evaluate SRU++ on standard language modeling benchmarks including the ENWIK 8, W IKI 103 and BILLION WORD datasets. SRU++ consistently outperforms various Transformer models on these datasets, delivering better or on par results while using 3x-10x less computation. Our model do not use positional encoding, multi-head attention and other techniques useful to Transformer models. Furthermore, we demonstrate that a couple of attention layers are sufficient for SRU++ to obtain near state-of-the-art performance. These changes not only highlight the effectiveness of recurrence but also enable strong computation reduction in training and inference. Finally, we also showcase the effectiveness of SRU++ on the IWSLT’14 De→En translation task, and open source our implementation in Pytorch to facilitate future research. 2 computes the hidden states c[t] by reading the input vector x[t] for each step t. The computation resembles other gated recurrent networks such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014). Specifically, the state vector c[t] is a weighted average between the previous state c[t-1] and a linear transformation of the input W′′ x[t]. The weighted aggregation is controlled by a forget gate f [t] which is a sigmoid function over the current input and hidden state. Once the internal state c[t] is produced, SRU uses a highway network to introduce a skip connection and compute the final output state h[t]. Similarly, the information flow in the highway network is controlled by a reset gate r[t]. Two important code-level optimizations are performed to enhance the parallelism and speed of SRU. First, given the input sequence X = {x[1], · · · , x[L]} where each x[t] ∈ Rd is a ddimensional vector, SRU combines the three matrix multiplications across all time steps as a single multiplication. This significantly improves the computation intensity (e.g. GPU utilization). Specifically, the batched multiplication is a linear projection of the input tensor X ∈ RL×d :   W (1) U ⊤ =  W ′  X⊤ , W′′ where U ∈ RL×3×d is the output tensor, L is the sequence length and d is the hidden state size. The second optimization performs all elementwise operations in an efficient way. This involves f [t] = σ (Wx[t] + v ⊙ c[t-1] + b) r[t] = σ W′ x[t] + v′ ⊙ c[t-1] + b′  c[t] = f [t] ⊙ c[t-1] + (1 − f [t]) ⊙ (W′′ x[t]) h[t] = r[t] ⊙ c[t] + (1 − r[t]) ⊙ x[t] where ⊙ is the element-wise multiplication, W, W′ and W′′ are parameter matrices and v, v′ , b and b′ are parameter vectors to be learnt during training. The SRU architecture consists of a light recurrence component which successively (2) c[t] = f [t] ⊙ c[t-1] + (1 − f [t]) ⊙ U[t, 2] (4) ′ Background: SRU We first describe the Simple Recurrent Unit (SRU) in this section. A single layer of SRU involves the following computation: f [t] = σ(U[t, 0] + v ⊙ c[t-1] + b) ′ r[t] = σ(U[t, 1] + v ⊙ c[t-1] + b ) (3) h[t] = r[t] ⊙ c[t] + (1 − r[t]) ⊙ x[t]. (5) Similar to other built-in operations such as attention and cuDNN LSTM (Appleyard et al., 2016), SRU implements all these operations as a single CUDA kernel to accelerate computation. Note that each dimension of the hidden vectors is independent once U is computed. The computation can run in parallel across each hidden dimension (and each input sequence given a mini-batch of multiple sequences). 3 SRU++ The key modification of SRU++ is to incorporate more expressive non-linear operations into the re- output h, c output h, c Elementwise recurrence Elementwise recurrence Elementwise recurrence 2048*3 2048*3 2048*3 MatMul MatMul MatMul output h, c 512 Attention 512 MatMul 2048 MatMul 2048 input x 2048 input x (a) SRU input x (b) SRU w/ projection trick (c) SRU++ w/ attention Figure 2: An illustration of SRU and SRU++ networks: (a) the original SRU, (b) the SRU variant with projection to reduce the number of parameters, experimented in Lei et al. (2018) and (c) SRU++ proposed in this work. Numbers indicate the dimension of intermediate inputs/outputs given hidden size d = 2048 and attention size d′ = 512. current network. Note that the computation of U (Equation 1) is a linear transformation of the input sequence X. We can replace this linear transformation with self-attention operation to enhance modeling capacity. Specifically, given the input sequence represented as a matrix X ∈ RL×d , the attention component computes the query, key and value representations using the following multiplications, Q = W q X⊤ K = Wk Q V = Wv Q ′ ′ ′ where Wq ∈ Rd ×d , Wk , Wv ∈ Rd ×d are model parameters. d′ is the attention dimension that is typically much smaller than d. Note that the keys K and values V are computed using Q instead of X such that the weight matrices Wk and Wv are significantly smaller. We also tested another variant in which we first project X′ = WX⊤ into the lower dimension d′ , and then apply three independent d′ -by-d′ matrix multiplications over X′ to obtain the query, key and value representations. This variant achieves similar results. Next, we compute a weighted average output ′ A ∈ Rd ×L using the scaled dot-product attention introduced in Vaswani et al. (2017),  ⊤  Q K V⊤ . A⊤ = softmax √ ′ d The final output U required by the elementwise recurrence is obtained by another linear projection, U⊤ = Wo (Q + α · A) . ′ where α ∈ R is a learned scalar and Wo ∈ R3d×d is a parameter matrix. Q + α · A is a residual connection which improves gradient propagation and stabilizes training. We initialize α to zero and as a result, U⊤ = Wo Q = (Wo Wq ) X⊤ initially falls back to a linear transformation of the input X skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As |α| grows, the attention mechanism can learn long-range dependencies for the model. In addition, Wo Wq can be interpreted as applying a matrix factorization trick with a small inner dimension d′ < d, reducing the total number of parameters. Figure 2 (a)(c) compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++ proposed in this section. The last modification is adding layer normalization (Ba et al., 2016) to each SRU++ layer. In our implementation, we apply normalization after the attention operation and before the matrix multiplication with Wo , U⊤ = Wo layernorm(Q + α · A). This implementation is post-layer normalization in which the normalization is added after the residual connection. Alternatively, pre-layer normalization (Xiong et al., 2020) only applies to the nonlinear transformation. While pre-normalization tends to be less sensitive to different learning rates, we use post-normalization for better results following the observations in Liu et al. (2020b). We analyze the effectiveness of layer normalization in Appendix A.2. Model Trans-XL SRU++ SRU++ Batch size B × M 24×512 24×512 16×768 BPC ↓ 1.06 1.03 1.02 Table 1: Test BPC of SRU++ and Transformer-XL on E NWIK 8 dataset. We train SRU++ using the same setting as Transformer-XL base model. Numbers are smaller the better. B is the number of sequence. M is the unroll size (and additional context size). 4 Experimental setup Datasets We evaluate our model on four standard NLP benchmarks. • E NWIK 8 (Hutter, 2006) is a character-level language modeling dataset consisting of 100M tokens taken from Wikipedia. The vocabulary size of this dataset about 200. We use the standard 90M/5M/5M splits as the training, dev and test sets, and report bits-percharacter (BPC) as the evaluation metric. • W IKI -103 (Merity et al., 2017) is a wordlevel language modeling dataset. The training data contains 100M tokens extracted from Wikipedia articles. Following prior work, we use a vocabulary of 260K tokens, and adaptive embedding and softmax layers (Grave et al., 2017; Baevski and Auli, 2019). • B ILLION WORD (Chelba et al., 2013) is one of the largest language modeling datasets containing 768M tokens for training. Unlike W IKI -103 in which sentences in the same article are treated as consecutive inputs to model long context, the sentences in BIL LION WORD are randomly shuffled. Following Baevski and Auli (2019), we use a vocabulary of 800K tokens, adaptive embedding and softmax layers. • IWSLT’14 De→En is a low-resource machine translation dataset consists of 170K translation pairs. We showcase SRU++ can be applied to other tasks such as translation. We follow the same setup of Lin et al. (2020) and other previous work. The dataset uses a shared vocabulary of 14K BPE tokens. Models All our language models are constructed with a word embedding layer, multiple layers of Model Trans-XL SHA-LSTM k=1 k=2 k=5 k = 10 No attention Param 41M 54M 42M BPC ↓ 1.06 1.07 1.022 1.025 1.032 1.033 1.190 GPU hrs ↓ 356 28† 37† 29† 24† 22† 20† Table 2: Results of SRU++ on ENWIK 8 by enabling attention every k layers. We adjust the hidden size so the number of parameters are comparable. † indicates mixed precision training. SRU++ and an output linear layer followed by softmax operation. We use single-head attention in each layer and 10 SRU++ layers for all our models. We use the same dropout probability for all layers and tune this value according to the model size and the results on the dev set. By default, we set the hidden dimension d : d′ = 4 : 1. We report additional analysis and tune this ratio for best results in Section 5 and Appendix A. For simplicity, SRU++ does not use recent techniques that are shown useful to Transformer such as multi-head attention, compressed memory (Rae et al., 2020), relative position (Shaw et al., 2018; Press et al., 2021), nearest-neighbor interpolation (Khandelwal et al., 2020) and attention variants to handle very long context (Sukhbaatar et al., 2019a; Roy et al., 2021). We compare with previous Transformer models that incorporate one or several these techniques. However, we do not compare with results that use additional data or dynamic evaluation (Graves, 2013; Krause et al., 2018), for a fair comparison between all models. Optimization We use RAdam (Liu et al., 2020a) with the default β values as our optimizer. RAdam is a variant of Adam optimizer (Kingma and Ba, 2014) that is reported less sensitive to the choice of learning rate and warmup steps while achieving similar results at the end. We use a fixed weight decay of 0.1 and an initial learning rate of 0.0003 in our experiments. These values are selected based on ENWIK 8 dev set and used for other tasks. See Appendix A.3 for more details. We use a cosine learning rate schedule following Dai et al. (2019). We do not change the initial learning rate unless otherwise specified. See Appendix B for the detailed training configuration of each model. Bits Per Character (BPC) 1.5 1.173 Dev BPC Transformer-XL SRU++ SRU++ (k=10, mixed precision) 1.3 1.2 1.074 1 2 1.067 1.064 3 1.061 4 5 1.058 6 1.057 1.056 7 1.056 8 9 1.049 1.050 1.053 10 1.0 0 90 180 270 1.056 360 1.054 1.054 1.053 1.050 Figure 3: Dev BPC vs. total GPU hours used on EN WIK 8 for each model. Using automatic mixed precision (amp) and only one attention sub-layer achieves 16x reduction. To compute the dev BPC, the maximum attention length is the same as the unroll size M during training. Each training batch contains B sequences (i.e. the batch size) and M consecutive tokens for each sequence (i.e. the unroll size), which gives an effective size of B × M tokens per batch. Following standard practice, the previous training batch is provided as additional context for attention, which results in a maximum attention length of 2 × M . For E NWIK 8 and W IKI -103 datasets, the training data is partitioned into B chunks by concatenating articles and ignoring the boundaries between articles. For BILLION WORD dataset, we follow Dai et al. (2019) and concatenate sentences to create the training batches. Sentences are randomly shuffled and separated by a special token indicating sentence boundaries. 5 Results Does recurrence improve upon attention-only model? We first conduct a comparison with the Transformer-XL model (Dai et al., 2019) on EN WIK 8 dataset2 . Their base model consists of 41M parameters and 12 Transformer layers. Following the official instructions, we reproduced the reported test BPC of 1.06 by training with 4 Nvidia 2080 Ti GPUs. The training took about 4 days or a total of 360 GPU hours equivalently. We train a 10-layer SRU++ model with 42M parameters. For a fair comparison, we use the same hyperparameter setting including the effective batch size, attention context length, learning rate and the number of training iterations as the Transformer-XL base model. Notably, our base model can be trained using 2 GPUs due to less GPU memory usage. After training, we set the at2 https://github.com/kimiyoung/ transformer-xl/tree/master/pytorch 2 3 4 5 6 7 1.050 8 9 Location of the attention Figure 4: Analyzing where to apply attention. We enable only one attention layer (top figure) or two (bottom figure) in the SRU++ model. For the latter, we always apply attention in the last layer and move the location of the other. X-axis is the layer index. The layer closest to the input embedding layer has index 1. tention context length to 2048 for testing, similarly to the Transformer-XL baseline. Table 1 presents the results. Our model achieves a test BPC of 1.03, outperforming the baseline by a large margin. This result suggests that combining recurrence and attention can greatly outperform an attention-only model. We obtain a BPC of 1.02 by extending the attention context length from 512 to 768, while keeping the number of tokens per batch the same. 1.2 1.16 1.12 1 1 2 3 4 5 6 6 7 8 9 10 How much attention is needed? Merity (2019) demonstrated that using a single attention layer with LSTM retains most of the modeling capacity compared to using multiple attention layers. We conduct a similar analysis to understand how much attention is needed in SRU++. To do so, we only enable attention every k layers. The layers without attention become the variant with dimension projection illustrated in Figure 2 (b). Note that k = 1 gives the default SRU++ model with attention in every layer, and k = 10 means only the last layer has attention in a 10-layer model. Table 2 presents the results by varying k. Our base model is the same 10-layer SRU++ model in Table 1. We see that using 50% less attention (k = 2) achieves almost no increase in test BPC. Moreover, using only a single attention module (k = 10) leads to a marginal loss of 0.01 BPC but reduces the training time by 40%. Our results still outperform Transformer-XL model and single-headed attention LSTM (Merity, 2019) greatly by 0.03 BPC. Figure 3 showcases the training efficiency of our model. SRU++ is Model Longformer 30L (Beltagy et al., 2020) All-attention network 36L (Sukhbaatar et al., 2019b) Transformer-XL 24L (Dai et al., 2019) ◦ Compressive memory (Rae et al., 2020) Feedback Transformer (Fan et al., 2020) SRU++ Base ◦ only 2 attention layers (k = 5) SRU++ Large ◦ d = 8 d′ Parameters ↓ 102M 114M 277M 77M 108M 98M 191M 195M Test BPC ↓ 0.99 0.98 0.99 0.97 0.96 0.97 0.98 0.96 0.95 GPU days ↓ 104† 64 6† 4† 12† 13† Table 3: Comparison with top-performing models on ENWIK 8 dataset. We include the training cost (measured by the number of GPUs used × the number of days) if it is reported in the previous work. Our results are obtained using an AWS p3dn instance with 8 V100 GPUs. The reported training time of all-attention network is based on V100 GPUs while the training time of Longformer is based on RTX8000 GPUs (which is about 90% speed of V100). † indicates mixed precision training. Ratio 4 6 8 10 Dimensions d, d′ 3072 768 3840 640 4480 560 5040 504 Dev BPC ↓ 0.997 0.992 0.991 0.992 Table 4: Dev BPC on ENWIK 8 by changing the ratio d : d′ in the SRU++ model while fixing the number of parameters to 108M. 5x faster to reach the dev BPC obtained by the Transformer-XL model. Furthermore, using automatic mixed precision training and a single attention layer (k = 10) achieves 16x reduction on training cost. Where to use attention? Next, we analyze if the location of attention in SRU++ makes a non-trivial difference. Figure 4 (top) compares the results by enabling attention in only one of the SRU++ layers. Applying attention in the first bottom layer achieves significantly worse result. We believe this is due to the lack of positional information for attention, since SRU++ does not use positional encoding. Enabling attention in subsequent layers gives much better and comparable results because recurrence can encode positional information. Moreover, SRU++ consistently achieves worse results by moving the attention to lower layer closer to the input embedding. We also enable a second attention layer while fixing the first one in the 10th layer. The corresponding results are shown in Figure 4 (bottom). Similarly, SRU++ achieves worse results if the attention is added to one of the lower layers. In contrast, results are comparable once the attention is placed in a highenough layer. These observations suggest that the model should first learn local features before attention plays a most effective role at capturing longrange dependencies. More analyses can be found in Appendix A. Does the ratio d : d′ matter? Transformer models by default use a FFN dimension that is 4 times larger than the attention dimension (Vaswani et al., 2017). We analyze the ratio of recurrence dimension d to attention dimension d′ for SRU++. A small value of d′ can reduce the amount of computation and the number of parameters used in attention layers but may limit the modeling capacity. Table 4 compares the results of using different d : d′ ratio given a similar amount of model parameters. We fix the model size to around 108M and use 10 SRU++ layers. Changing this ratio from 4 to a higher value gives better result. The best dev result is obtained with a ratio of 8. Given this observation, we report SRU++ result using a default ratio of 4 as well as a ratio of 8 in the subsequent result sections. This ensures we conduct a comparison that uses a setup similarly to the default of Transformer models, but also showcases stronger results SRU++ can achieve. ENWIK 8 Table 3 compares our model with other top-performing models on the ENWIK 8 dataset. We train a base model with d = 3072 and a large model with d = 4096 using 400K training steps. The unroll size and attention context length are set to 1024 during training and 3072 during evalua- Model All-attention network 36L (Sukhbaatar et al., 2019b) Feedback Transformer (Fan et al., 2020) Transformer (Baevski and Auli, 2019) Transformer-XL 18L (Dai et al., 2019) ◦ Compressive memory (Rae et al., 2020) Routing Transformer (Roy et al., 2021) kNN-LM (Khandelwal et al., 2020) SRU++ Base SRU++ Large ◦ d = 8 d′ ◦ only 2 attention layers (k = 5) Parameters ↓ 133M 139M 247M 257M 148M 232M 234M 225M Test PPL ↓ 20.6 18.2 18.7 18.3 17.1 15.8 15.8 18.3 17.4 17.1 17.3 GPU days ↓ 214 22† † 8 14† 15† 11† Table 5: Comparison with top-performing models on WIKI -103 dataset. We include the training cost (measured by the number of GPUs used × the number of days) if it is reported in the previous work. The reported training costs are based on V100 GPUs. Our results are similarly obtained using an AWS p3dn instance with 8 V100 GPUs. † indicates mixed precision training. Model Transformer SRU++ SRU++ (k = 5) Param 331M 465M 328M 465M PPL ↓ 25.6 25.2 23.9 25.1 23.5 Days ↓ 57† 147† 192† 36† 63† Table 6: Test perplexity and effective GPU days for training of SRU++ models and the Transformer models of Baevski and Auli (2019) on BILLION WORD dataset. tion. To compare the computation efficiency we report the effective GPU days – the number of GPUs multiplied by the number of days needed to finish training. Our base model achieves better BPC and uses a fraction of the training cost reported in previous work. Furthermore, our large models achieve a new state-of-the-art result on this dataset, reaching a test BPC of 0.96 when d = 4 d′ and 0.95 when d = 8 d′ . W IKI -103 Table 5 presents the result of SRU++ models and other top results on the W IKI -103 dataset. We train one base model with 148M parameters and a few large models which contain about 230M parameters. As shown in the table, our base model obtains a test perplexity of 18.3 using 8 GPU days of training, about 3x reduction compared to the Transformer model in Baevski and Auli (2019) and over 10x reduction compared to Feedback Transformer (Fan et al., 2020). Again, changing the hidden size ratio to d = 8 d′ improves the modeling capacity. Our big model Model kNNLM (Khandelwal et al.) Trans (Baevski and Auli) Trans-XL (Dai et al.) Shortformer (Press et al.) SRU++ Large SRU++ Large (k = 5) Speed↑ 145 2.5k 3.2k 15k 15k 22k PPL↓ 15.8 18.7 18.3 18.2 17.1 17.3 Table 7: Inference speed (tokens/second) on W IKI -103 test set. Results of baselines are taken from Press et al. (2021). We use a single V100 GPU, a batch size of 1 and maximum attention length 2560 for consistency. achieves a test perplexity of 17.1. The required training cost remains significantly lower. We double our training iterations to 800K and use a learning rate of 0.0002 for the BILLION WORD dataset. We train a base model using d = 4096, d′ = 1024 and an effective batch size of 65K tokens per gradient update. We also train a large model by increasing the hidden size d to 7616 and the batch size to 98K. In addition, we use only 2 attention layers (k = 5) for the large model. Table 6 reports the test perplexity and associated training cost. Our base and large model obtain a test perplexity of 25.1 and 23.5 respectively, outperforming the Transformer model of Baevski and Auli (2019) given similar model size. Moreover, SRU++ achieves 3-4x training cost reduction and is trained using 8 GPUs. In comparison, the Transformer model uses 32 or 64 V100 GPUs. BILLION WORD Model Transformer SRU++ SRU++ (k = 2) Param 20.1M 20.4M 19.6M BLEU ↑ 35.9±0.1 36.3±0.2 36.1±0.1 Hrs ↓ 10.5 8.5 7.5 Table 8: Results on IWSLT’14 De→En test set. We use a beam size of 5. BLEU scores and training time are averaged over 4 independent runs. Inference speed Table 7 compares the inference speed of SRU++ with other top-performing models on WIKI -103 test set. We use a single V100 GPU for inference. Our large model runs at least 4.5x faster than all baseline models except Shortformer (Press et al., 2021). In addition, our model achieves 0.9-1.1 perplexity lower than Shortformer and runs 50% faster when using 2 attention layers (k = 5). IWSLT Does SRU++ work well for other tasks? We study this question by evaluating SRU++ on the IWSLT’14 De→En translation task. We use the open-sourced training and evaluation code of Lin et al. (2020). The base model is an 8-layer Transformer model containing 20M parameters. We train SRU++ models using 6 layers and d = 1024, resulting in similar number of parameters. We use the original settings such as learning rate and batch size, except that we use RAdam optimizer for consistency and increase the number of training epochs to 50. Both architectures achieve much higher BLEU scores given more training epochs.3 Table 8 presents the test results. Without additional hyperparameter tuning, SRU++ achieves 0.4 BLEU score higher and less training time compared to the Transformer model tuned in Lin et al. (2020). Why does SRU++ reduce training cost in our experiments? Several factors contribute to the computation reduction observed in our experiments. First, combining attention and recurrence gives stronger modeling capacity. As shown in our experiments, SRU++ often achieves comparable results using fewer layers and/or fewer parameters. The required computation are much lower for shallower and smaller models. We also observe higher training efficiency, requiring fewer training steps and smaller training batch compared to several Transformer models. 3 Lin et al. (2020) reports a test BLEU of 35.2. We obtain 35.9 for the same Transformer model by training longer. For example, SRU++ uses a maximum effective batch size of 98K tokens and 800K training steps on the BILLION WORD dataset, while the Transformer model in comparison (Baevski and Auli, 2019) uses 128K tokens and near 1000K steps. The reduced batch size and gradient updates cut down the training cost. Finally, model implementation is an important factor for computation saving. Our implementation is highly efficient for two reasons. First, the fast recurrence operation of SRU is a reusable module that is already optimized for speed (Lei et al., 2018). Second, since recurrence encodes positional information, we can use simple singlehead attention and remove positional encoding. On the contrary, advanced attention and positional encoding mechanism can generate nontrivial computation overhead. To see this, we measure the running time of SRU++ and TransformerXL using Pytorch Profiler. Figure 5 (a) shows the average model forward time of a single batch. SRU++ runs 4-5x times faster compared to the Transformer-XL implementation. Figure 5 (b) breaks down the computation and highlights the most time-consuming operations in both models. The matrix multiplications are one of the most expensive operations for both models. Surprisingly, many operations in the relative attention of Transformer-XL are computationally expensive. For example, the relative attention requires shifting the attention scores and adding up different attention score matrices. Both require a lot of time but they are not needed in non-relative attention. In addition, the last column shows the running time of tensor transpose operators needed by batch matrix-matrix multiplications in attention. Again, the relative attention uses an order of magnitude more time compared to the simple single-head attention used in our model implementation.4 6 Related Work Accelerating common architectures for NLP has become an increasingly important research topic recently (Tay et al., 2020; Sun et al., 2020; Lan et al., 2020). Our work is closely related to two lines of research under this topic. 4 Note that this high latency of tensor transpose might be caused by sub-optimal implementation choices such as a poor arrangement of tensor axes in the open-sourced model. There is room for improvement. Nevertheless, relative attention and positional encoding are reported to be non-trivially slower in other works (Shaw et al., 2018; Tian et al., 2021). 1,200 900 1175.6 SRU++ Transformer-XL 300 SRU++ Transformer-XL 225 600 150 284.6 300 223.3 75 69.8 0 0 41M parameters 139M parameters ul at M em co ce m nc M at M or rn ye La (a) n re ur c Re e n: n: os io io n sit ift sp sit itio an po sh po dd r . l T e . l a r Re sco Re ore sc (b) Figure 5: Profiling of SRU++ and Transformer-XL: (a) forward time (in milliseconds) of small and large models and (b) forward time used in various types of time-consuming operations. We use a single GPU for profiling to avoid extra overhead such as data synchronization between GPUs. We use an unroll size / context length M = 512 and 1024 respectively for small and large models. All models use a batch size B = 16 for profiling. First, previous works have tackled the speed problem of recurrent neural networks (RNNs) and have proposed various fast RNN implementations (Diamos et al., 2016; Campos et al., 2018; Zhang and Sennrich, 2019). Notably, the QuasiRNN (Bradbury et al., 2017) and SRU (Lei et al., 2018) have invented highly-parallelizable recurrence and combined them with convolutions or highway networks respectively. The resulting architectures achieve equivalent parallelism as convolutional and attention models. This advancement eliminates the need of avoiding recurrence computation to trade model training efficiency, a design choice made by the Transformer architecture. Our model builds on top of SRU. Second, several recent works have argued that using attention alone is not the best architecture in terms of model expressiveness. For example, Dong et al. (2021) demonstrate theoretically and empirically that using pure attention results in performance degeneration. Gulati et al. (2020) have combined convolution and attention and obtained new state-of-the-art results for speech recognition. Moreover, RNNs have been incorporated into Transformer architectures, resulting in improved results in machine translation and language understanding tasks (Lei et al., 2018; Huang et al., 2020). Our work is built upon a similar hypothesis that recurrence and attention are complementary at sequence modeling. We demonstrate that jointly leveraging fast recurrence and attention not only achieves state-of-the-art modeling results but also obtain significant computation reduction. Being orthogonal to our work, many recent works improve the efficiency of Transformer mod- els by accelerating attention computation (Zaheer et al., 2020; Katharopoulos et al., 2020; Vyas et al., 2020; Peng et al., 2021). Examples include Longformer (Beltagy et al., 2020), Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020) and Routing Transformer (Roy et al., 2021). In contrast, our work optimizes computational efficiency using recurrence combined with minimal attention and our model can incorporate these attention variants for additional speed improvement. 7 Conclusion We present a highly-efficient architecture combining fast recurrence and attention, and evaluate its effectiveness on various language modeling datasets. We demonstrate fast RNNs with little attention not only achieve top results but also reduce training cost significantly. Our work shares a different idea to accelerating attention, therefore providing an orthogonal direction to advancing stateof-the-art model architecture. As future work, we believe the model can be improved using stronger attention or recurrent implementations, better normalization or optimization techniques. Acknowledgement We would like to thank ASAPP Inc. for making this work possible. We thank Hugh Perkins, Joshua Shapiro, Sam Bowman, Danqi Chen and Yu Zhang for providing invaluable feedback for this work. Finally, we thank Jeremy Wohlwend, Jing Pan, Prashant Sridhar and Kyu Han for helpful discussions, and ASAPP Language Technology and Infra teams for the compute cluster setup for our research experiments. References Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. 2016. Optimizing performance of recurrent neural networks on gpus. arXiv preprint arXiv:1604.01946. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In International Conference on Learning Representations (ICLR). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR). Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In International Conference on Learning Representations (ICLR). Andrew Brock, Soham De, and Samuel L Smith. 2021. Characterizing signal propagation to close the performance gap in unnormalized resnets. In International Conference on Learning Representations. Víctor Campos, Brendan Jou, Xavier Giró i Nieto, Jordi Torres, and Shih-Fu Chang. 2018. Skip rnn: Learning to skip state updates in recurrent neural networks. In International Conference on Learning Representations (ICLR). Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. Technical report, Google. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip. In Proceedings of The 33rd International Conference on Machine Learning (ICML). Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning (ICML). Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2020. Accessing higher-level representations in sequential transformers with feedback memory. arXiv preprint arXiv:2002.09402. Edouard Grave, Armand Joulin, Moustapha Cissé, Hervé Jégou, et al. 2017. Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning (ICML). Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech (INTERSPEECH). Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. 2019. Modeling recurrence for transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation. Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems. Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing. Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra, and Bing Xiang. 2020. Trans-blstm: Transformer with bidirectional lstm for language understanding. arXiv preprint arXiv:2003.07000. Marcus Hutter. 2006. The human knowledge compression contest. http://prize.hutter1.net/. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning (ICML). Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations (ICLR). Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. 2019. From research to production and back: Ludicrously fast neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR). Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR). Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2018. Dynamic evaluation of neural sequence models. In Proceedings of the 35th International Conference on Machine Learning (ICML). Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations (ICLR). Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. 2020. Autoregressive knowledge distillation through imitation learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In International Conference on Learning Representations (ICLR). Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020a. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations (ICLR). Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020b. Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stephen Merity. 2019. Single headed attention rnn: Stop thinking with your head. arXiv preprint arXiv:1911.11423. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In International Conference on Learning Representations (ICLR). Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. 2018. Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems (NeurIPS). Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. 2021. Random feature attention. In International Conference on Learning Representations (ICLR). Ofir Press, Noah A. Smith, and Mike Lewis. 2021. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR). Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green AI. Communications of the ACM. Yuan Shangguan, Jian Li, Qiao Liang, Raziel Alvarez, and Ian McGraw. 2019. Optimizing speech recognition for the edge. arXiv preprint arXiv:1909.12408. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. 2020. PowerNorm: Rethinking batch normalization in transformers. In Proceedings of the 37th International Conference on Machine Learning (ICML). Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019a. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019b. Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a compact task-agnostic BERT for resourcelimited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732. Ran Tian, Joshua Maynez, and Ankur P Parikh. 2021. Shatter: An efficient transformer encoder with single-headed self-attention and relative sequence partitioning. arXiv preprint arXiv:2108.13032. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. 2020. Fast transformers with clustered attention. In Advances in Neural Information Processing Systems (NeurIPS). Sinong Wang, Belinda Z Li, Madian Khabsa, Han Linformer: SelfFang, and Hao Ma. 2020. attention with linear complexity. arXiv preprint arXiv:2006.04768. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML). Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. 2019. Understanding and improving layer normalization. In Advances in Neural Information Processing Systems (NeurIPS). Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems (NeurIPS). Biao Zhang and Rico Sennrich. 2019. A lightweight recurrent network for sequence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. A Additional results A.1 Detailed analysis of attention Table 10 presents a more comprehensive analysis of attention in SRU++ models. First, we change the number of attention layers and their locations in the model. As shown in the top block of Table 10, using attention in 50% of the layers leads to no (or negligible) loss in model performance. This is consistent with the results in Table 2 using a smaller model. Enabling attention in higher layers performs slightly better than evenly distributing attention from the bottom to top layers. We also experiment with using more than one attention head in each of the attention layer, as shown in the middle block of the table. Unlike Transformer models however, we do not observe a significant improvement using multiple heads. We hypothesize that the recurrence states can already carry different features or information that are present in different input positions, making redundant heads unnecessary. Finally, changing the ratio d : d′ from 4 to 8 gives similar improvements regardless of using 2 attention layers or 10 attention layers. This suggests that the amount of attention and the hidden size ratio can be tuned independently for best model performance. initialization. For example, we have to use a smaller learning rate of 0.00025 or lower to avoid sudden gradient explosion during training. These results suggest possible future work by improving the normalization method (Shen et al., 2020; Brock et al., 2021). A.3 Tuning weight decay and learning rate We find that tuning the weight decay and learning rate critical to the success of training SRU++ and achieving best results. Table 9 provides a sensitivity analysis by testing different learning rates and weight decay values. Increasing the weight decay consistently gives better results for all learning rates tested. Tuning the learning rate is also needed to reach the best result. The non-trivial effect of weight decay seems to be unique for SRU++. On the other hand, the performance of SRU++ remains robust once the appropriate weight decay and learning rate are set. As shown in previous results and analyses, SRU++ achieves strong and relatively stable results to various hidden sizes, number of attention layers and datasets. In particular, using the same weight decay value generalize well for all datasets (including language modeling and translation tasks) and model configurations tested. A.2 The effectiveness of layer normalization In our experiments, we have always used layer normalization to stabilize training. However, we also found layer normalization to achieve worse generalization for larger models that are more prone to over-fitting. Figure 6 showcases our empirical observation on the ENWIK 8 dataset. Using layer normalization achieves more rapid training progress and lower training loss, but results in higher dev loss in the case of training a 108M model. This generalization gap remains even if we tune the dropout rate carefully. In addition, although using layer normalization in the smaller model with 41M parameters gives slightly better dev results, we still observe a larger generalization gap (indicated by the difference between training loss and dev loss) compared to the run without layer normalization. Similar over-fitting patterns are observed on Wiki-103 dataset, and also in previous work (Xu et al., 2019). On the other hand, turning off layer normalization can achieve better generalization but makes training sensitive to learning rate and parameter 3 × 10−4 2 × 10−4 1.5 × 10−4 0.10 1.014 1.022 1.030 0.01 1.035 1.038 0.00 1.047 1.040 Table 9: Dev BPC of SRU++ given a learning rate ∈ {1.5, 2, 3} × 10−4 and a weight decay ∈ {0.1, 0.01, 0}. ‘-‘ means the training run diverged or got gradient explosion. B Training details Language modeling We use the RAdam optimizer5 with the default hyperparameters β1 = 0.9 and β2 = 0.999 for all our experiments. We use a cosine learning rate schedule with only 1 cycle for simplicity. For faster training, we also leverage the native automatic mixed precision (AMP) training and distributed data parallel (DDP) of Pytorch in all experiments, except those in Table 1 and Fig5 https://github.com/LiyuanLucasLiu/ RAdam ure 1 for a fair comparison with the TransformerXL implementation. Table 11 shows the detailed training configuration of SRU++ models on ENWIK 8 dataset. Most training options are kept the same for all models. We tune the dropout probability more carefully as we found training is more prone to over-fitting and under-fitting for this dataset. The large model is trained with 2x batch size. As a result, we increase √ the learning rate proportionally by a factor of 2 (Hoffer et al., 2017), which results in a rounded learning rate of 0.0004. Table 12 presents the detailed training configuration on W IKI -103 dataset. Similarly we use d = 3072 and d = 4096 for the base and large model respectively for a hidden size ratio d : d′ = 4 : 1. Following (Baevski and Auli, 2019), we use an adaptive word embedding layer and an adaptive softmax layer for our models, and we tie the weight matrices of the two layers. We keep the total number of parameters comparable when we use a different hidden size ratio d : d′ = 8 : 1. Machine translation We use the open-sourced code from Lin et al. (2020) for the IWSLT’14 De→En translation task. The Transformer model tuned by the original work uses 8 layers for both the encoder and decoder and a total of 20M parameters. Most of the training configuration remains the same as the original work6 , except for a couple of changes. First, we use RAdam optimizer and the same β values for consistency with the language model task. We use the same weight decay value of 0.1 for SRU++. The Transformer model uses a weight decay of 0 that is tuned based on dev set performance. Second, we increase the number of training epochs to 50 (or equivalently 64K training steps) since all models achieve better BLEU scores by training longer. This ensures we compare models when they reach the maximum performance. Our SRU++ model uses a hidden size d = 1024, an attention size d′ = 256 and 6 layers for the encoder and decoder, resulting in a similar number of parameters as the Transformer model in comparison. Let Xsrc be the output representation of the SRU++ encoder. Each SRU++ decoder layer make uses of Xsrc by simplying treating it as extra attention context. That is, the query, key and value 6 https://github.com/asappresearch/ imitkd/blob/master/configs/iwslt/ teacher.yaml representations are computed by concatenating the input of the current layer Xtgt with Xsrc , Q = [Qsrc , Qtgt ] = Wq [Xsrc , Xtgt ]⊤ K = Wk Q V = Wv Q The resulting representations Qtgt , K and V are used for the rest of the attention computation. The attention mask is set such that each target token can only attend to all source tokens and preceding target tokens. Num of heads d d′ Model size Dev BPC All layers 6,7,8,9,10 2,4,6,8,10 8,9,10 3,6,9 1 3072 768 3136 784 108M 102M 102M 103M 0.997 0.997 0.999 1.000 1.001 5,10 1 2 1 2 3072 768 98M 1.002 1.002 1.007 1.006 1 3072 768 4480 560 Layers that has attention 10 All layers 5,10 All layers 5,10 97M 108M 98M 109M 104M 0.997 1.002 0.991 0.997 Table 10: Results of 10-layer SRU++ models by varying the attention setting. We report the dev BPC on the EN WIK 8 dataset. The first column indicates layers where the attention are located. Smaller index numbers represent layers that are closer to the input of the model. 0.96 0.85 0.87 0.77 0.78 0.70 0.69 w/o layernorm (train) w/o layernorm (dev) w/ layernorm (train) w/ layernorm (dev) 0.62 0K 100K 200K 41M parameters 300K 400K 0K 100K 200K 300K 400K 108M parameters Figure 6: Understanding the empirical effect of layer normalization. We show the training and dev loss of SRU++ models using 41M parameters and 108M parameters on ENWIK 8 dataset. The model with layer normalization fits the training data better, but achieves worse generalization. Attention / unroll size - train Attention / unroll size - test Batch size × Num of GPUs Dropout Gradient clipping Hidden size ratio d : d′ Hidden size d Hidden size d′ Learning rate LR warmup steps Training steps Weight decay Model size Dev BPC Test BPC Base model (k = 5) 1024 3072 4×8 0.22 1.0 4 3072 768 0.0003 16K 400K 0.1 98M 1.002 0.980 Base model Large model Large model 1024 3072 4×8 0.22 1.0 4 3072 768 0.0003 16K 400K 0.1 108M 0.997 0.974 1024 3072 8×8 0.32 1.0 4 4096 1024 0.0004 16K 400K 0.1 191M 0.985 0.963 1024 3072 8×8 0.35 1.0 8 6016 752 0.0004 16K 400K 0.1 195M 0.974 0.953 Table 11: Training details of SRU++ models on ENWIK 8 dataset. Attention / unroll size - train Attention / unroll size - test Batch size × Num of GPUs Dropout Gradient clipping Hidden size ratio d : d′ Hidden size d Hidden size d′ Learning rate LR warmup steps Training steps Weight decay Model size Dev PPL Test PPL Base model Large model 768 2560 8×8 0.15 1.0 4 3072 768 0.0003 16K 400K 0.1 148M 17.5 18.3 1024 2560 8×8 0.2 1.0 4 4096 1024 0.0003 16K 400K 0.1 232M 16.7 17.4 Large model (k = 5) 1024 2560 8×8 0.2 1.0 8 5952 744 0.0003 16K 400K 0.1 225M 16.6 17.3 Table 12: Training details of SRU++ models on W IKI -103 dataset. Large model 1024 2560 8×8 0.2 1.0 8 5952 744 0.0003 16K 400K 0.1 234M 16.4 17.1