Shortformer: Better Language Modeling Using Shorter Inputs Ofir Press1,2 1 Noah A. Smith1,3 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Facebook AI Research 3 Allen Institute for AI ofirp@cs.washington.edu arXiv:2012.15832v2 [cs.CL] 3 Jun 2021 Abstract Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText103, without adding any parameters.1 1 Mike Lewis2 Introduction Scaling up transformer (Vaswani et al., 2017) language models (Radford et al., 2019; Lewis et al., 2019; Raffel et al., 2019; Brown et al., 2020) has been an important driver of progress in NLP. Language models require data to be segmented into subsequences for both training and inference: memory constraints limit a language model to handling at most a few thousand tokens at once, while many training and evaluation datasets are much longer. 1 Our code is available at https://github.com/ ofirpress/shortformer Recent work focuses on increasing the length of input subsequences, which determines the maximum number of tokens a model can attend to (Baevski and Auli, 2018; Sukhbaatar et al., 2019; Kitaev et al., 2020; Roy et al., 2020). We challenge the assumption that longer input subsequences are always better by showing that existing transformers do not always effectively use them. We then introduce new methods based on shorter input subsequences that improve runtime, memory efficiency, and perplexity. We first investigate how input subsequence length affects transformer language models (§3). Naı̈ve evaluation—where we split a large evaluation set into multiple nonoverlapping subsequences, each evaluated independently—initially supports the commonly-held belief that models that train and do inference on longer subsequences achieve better perplexity (Table 1, col. 3). However, when we evaluate each model with a sliding window (Baevski and Auli, 2018), outputting one token at a time using the maximal amount of context, we find—surprisingly—that models using subsequences exceeding 1,024 tokens do not further improve performance (Table 1, col. 5). We conclude that the performance gains (using naı̈ve evaluation) of models that use longer subsequences occur not only because of their better modeling ability, but partly because they divide the evaluation set into longer subsequences. This division helps because of an issue we call the early token curse: by default, early tokens in a subsequence will have short histories to attend to. Using longer subsequences means fewer tokens will suffer from the early token curse. For example, when using inputs of length 1,024, about 94% of tokens get to attend to more than 64 preceding tokens. If we use inputs of length 128, only 50% of tokens get to attend to 64 or more preceding tokens. Based on this analysis, we explore how to improve models by using shorter inputs. We introduce two techniques. for two distinct tasks: generation and evaluation. In order to define these tasks, we first define nonoverlapping and sliding window inference. Staged Training (§4) First, we show that initially training on shorter subsequences (before moving to longer ones) leads not only to much faster and more memory-efficient training, but it surprisingly also greatly improves perplexity, suggesting that longer inputs are harmful early in training. Nonoverlapping Inference To evaluate a string longer than L, we can evaluate each subsequence of L tokens independently. This fast approach is commonly used during training; if used, tokens in one subsequence cannot condition on those in the previous subsequence, giving rise to the early token curse discussed in §1. See Figure 1(a). Position-Infused Attention (§5) Second, we consider a natural way to avoid the early token curse during training and inference: attending to cached representations from the previously evaluated subsequence (Dai et al., 2019). This approach interferes with conventional absolute position embeddings in a way that forced Dai et al. to use relative position embeddings, which are computationally expensive. We introduce a fast, simple alternative: instead of adding absolute position embeddings to word embeddings—thereby entangling a word’s content and positional information—we add them to the keys and queries in the self-attention mechanism (but not to the values). This does not increase parameter count or runtime. Token representations can then be cached and reused in subsequent computations. We show that when using this method, shorter subsequence models outperform longer ones. Finally, we show additive gains from combining staged training and position-infused attention (Shortformer, §6), resulting in a model that trains much quicker and achieves better perplexity on WikiText-103. We also show that these results transfer to language modeling on the Toronto Book Corpus (§A.5, appendix). 2 Background and Experimental Setup Transformer language models map a list of tokens xn−L:n−1 to a probability distribution over the next token xn . We refer to the list of tokens as the current input subsequence (whose length is L). Causal masking lets us make L predictions at once, with the prediction for token i + 1 conditioned on the ith token and all previous inputs xn−L:i−1 , but not on future inputs. We define the number of tokens the model can attend to at each timestep as its effective context window. Note that L is not to be confused with the (typically much greater) length of a training or evaluation dataset. During inference, language models can be used Sliding Window Inference An alternative to the above is to use a sliding window during inference. Here, we choose a stride S between 1 and L − 1 and advance the window by S tokens after each forward pass.2 This means that L − S tokens from the previous block are re-encoded, and only S new tokens are outputted. The advantage is that all outputs in each subsequence after the first have at least L − S previous tokens to condition on. However, since tokens must be re-encoded multiple times, this approach is much slower. When S = 1, we output one token every inference pass, each using the maximal context window, but this is the slowest approach. See Figure 1(b). Minimal and Maximal Effective Context Window Sizes In the nonoverlapping approach, the min. and max. effective context window sizes are 1 and L, respectively. In the sliding window approach, the max. context window size is still L, but the min. context window size is now L − S + 1. Evaluation vs. Generation In evaluation, a model assigns a perplexity score to a given sequence. Evaluation is done using either nonoverlapping inference or with a sliding window of any stride; since we already have the target sequence we can simultaneously make predictions for multiple timesteps using causal masking. In generation, a model generates a new sequence, as in demonstrations of GPT-3 (Brown et al., 2020). Generation is done only with a sliding window with stride S = 1, which we refer to as token-by-token generation. During generation, we append to the input a single new token, get a prediction from the model about the next token (e.g., using beam search or picking the token with the highest probability); the process is then repeated.3 2 Nonoverlapping inference can be viewed as sliding window inference with stride L. 3 In this paper we do not consider open-ended generation; we generate the dev. set, and for next-token prediction we (a) (b) a1 b2 c3 d1 e2 f3 (c) a1 b2 c3 b1 c2 d 3 c1 d 2 e 3 d1 e2 f3 a1 b2 c3 d 4 e 5 f6 Figure 1: Language model modes for generating or evaluating 6 tokens (a, b, . . . , f ) when subsequence length L = 3. The numbers denote the position embeddings (P.E.). (a) Nonoverlapping (§2). (b) Sliding window, stride S = 1 . Here, after the first inference pass we ignore all outputs other than the last (§2). (c) Caching (§5.2) where each subsequence attends to representations of the previous one. (In the next iteration, tokens d, e and f become the cache, with P.E. 1, 2 and 3, the three new tokens get P.E. 4, 5, and 6.) Experimental Setup Our baseline is the Baevski and Auli (2018) model, henceforth B & A, trained and evaluated on WikiText-103 (Merity et al., 2016). We use this baseline because of its prominent role in recent language modeling developments (Khandelwal et al., 2020; Press et al., 2020). The training set contains 103.2 million tokens from English Wikipedia. The B & A model has 16 transformer layers of dimension 1,024, with 8 heads in each self-attention sublayer, and feedforward sublayers with an inner dimension of 4,096. This model ties the word embedding and softmax matrices (Press and Wolf, 2017; Inan et al., 2017) and uses sinusoidal position embeddings. It has a subsequence length of 3,072 tokens and achieves a perplexity of 18.65 ± 0.24 (std. dev.) on the development set. In our experiments, other than varying the subsequence length, we modify no other hyperparameters, including the random seed and number of training epochs (205). 3 How Does Context Window Size Affect Transformers? Segmenting a corpus into subsequences results in different effective context windows for different timesteps depending on where they fall in a segment. Subsequence length L is an upper bound on the effective context window at each timestep. When making the first prediction, the model attends only to the first input token. When making the second prediction, the model attends to the first two inputs, and so on, up to the Lth timestep where the model can attend to all input tokens when making the Lth prediction. 3.1 Context Window Size Matters Table 1 explores the effect of subsequence length in the B & A model on training runtime and on dev. set perplexity and runtime.4 We fix the number use the ground truth token. This has the same complexity as sampling the token with the highest probability. 4 For consistency, throughout the paper we run inference with a batch size of one. This causes models shorter than Train Inference Nonoverlapping Sliding Window (Token-by-token) Subseq. Length Speed ↑ PPL ↓ Speed ↑ PPL ↓ Speed ↑ 32 64 128 256 512 1024 1536 3072 28.3k 28.5k 28.9k 28.1k 26.1k 22.9k 18.4k 13.9k 35.37 28.03 23.81 21.45 20.10 19.11 19.05 18.65 2.4k 4.8k 9.2k 14.8k 18.1k 18.3k 17.1k 14.7k 24.98 21.47 19.76 18.86 18.41 17.97 18.14 17.92 74 69 70 63 37 18 11 5 Table 1: Subsequence length’s effects on performance of the B & A model on the WikiText-103 dev. set. The baseline is the last row. Token-by-token inf. was computed with a sliding window stride S = 1 to output one token at a time; see §2. We measure speed in tok./sec. per GPU and use a batch size of 1 for inf. of tokens in each batch to 9,216 but vary the subsequence length L and batch size (so the product of the batch size and subsequence length remains at 9,216). We report results for both nonoverlapping inference and sliding window inference with stride S = 1, which generates only one new token per forward pass; it thus has the maximal effective context window for each generated token. We find that performance increases as S decreases until it reaches a peak and then stops improving (not shown in Table 1).5 We derive the following conclusions: Training on long sequences is expensive. Models trained on subsequences of length 256 are twice as fast as models trained on subsequences of 3,072 tokens, but gains for even shorter lengths are negligible (Tab. 1, col. 2). Long subsequence lengths can improve results. When using the naı̈ve approach, nonoverL = 512 to run slowly (in N.o. eval.), although during batched N.o. eval. they are slightly faster than the L = 512 model. 5 For example, the L = 3,072 model’s performance peaked at S = 512 (used in Baevski and Auli (2018)) and then stopped improving. Thus, the result shown in Table 1 for that model with S = 1 can also be achieved with S = 512 even though that runs 500 times faster, at 2.5k tok./sec. Initial Stage Epochs lapping evaluation, we see a monotonic decrease in dev. perplexity when increasing L (Tab. 1, col. 3). Increasing the minimum effective context window size is more important than increasing the maximum one. Using a sliding window for token-by-token evaluation substantially improves results for all models (Tab. 1, col. 5). Here, we see negligible improvement between the models trained with subsequence lengths of 1,024 and 3,072 tokens (0.05 perplexity). This approach improves results by increasing the minimum amount of context available at each timestep which indicates that long contexts may not be beneficial to transformer models, but very short contexts are harmful. However, sliding window inference can be expensive since each token is encoded many times. For example, token-by-token inference for the L = 3,072 model is almost 300 times slower than nonoverlapping inference. 25 50 75 100 125 150 175 200 Initial Stage Subseqence Length 32 64 128 256 512 1024 1536 17.94 17.57 17.58 18.19 18.06 18.20 18.77 17.81 17.59 17.52 18.08 18.01 18.14 18.62 17.93 17.61 17.55 18.01 18.05 18.03 18.57 18.14 17.67 17.62 18.00 18.10 18.00 18.51 18.61 17.88 17.70 18.00 18.13 17.98 18.49 19.45 18.37 17.98 18.01 18.15 18.00 18.49 21.16 19.51 18.57 18.23 18.20 18.08 18.57 35.38 28.03 23.80 21.45 19.63 18.56 18.84 Table 2: Each model’s perplexity at the end of training (dev. set, nonoverlapping eval.). All models have a subsequence length of 3,072 tokens at the end of training. The B & A baseline achieves 18.65 ± 0.24 perplexity. Our experimental setup is described in §2. We do not change any hyperparameters other than reducing subsequence length while correspondingly increasing batch size to keep the number of tokens per batch constant. As in the baseline, all models are trained for 205 epochs. All models are trained in two stages; the second stage always uses a subsequence length of 3,072, since that lead to the best performance (discussed at end of this subsection). Appendix Table 6 shows the time each training routine takes to match the baseline model’s performance on the validation set of WikiText-103.7 Many configurations match this performance in less than half the time it takes to train the baseline itself; some reach baseline performance in only 37% of the time needed to train the baseline. Although all models take less time to train than the baseline, Table 2 shows that many outperform it. For example, the best model—trained with subsequence length L = 128 until epoch 50— outperforms the baseline by 1.1 perplexity despite completing training in 87% of the time the baseline takes to do so. The model that trains with L = 128 until epoch 100 achieves similarly strong results (17.62 perplexity) and finishes training in 74% of the time it takes the baseline.8 These results are very robust to the choice of initial stage subsequence length and number of epochs. Table 2 shows that all models with an initial stage of L = 1,024 tokens or less that switch to the second stage at epoch 125 or before beat the baseline by a large margin at the end of training. Additionally, Appendix Table 6 shows that those models match the baseline’s perplexity in at most 71% of the time it takes to train the baseline. When we use nonoverlapping evaluation, the B & A baseline obtains 18.65 perplexity on the development set; our best model obtains 17.52. When we use sliding window evaluation (following Baevski & Auli, we use stride S = 512), our best 6 Curriculum learning (Bengio et al., 2009) trains on easier inputs before progressing to harder ones. Our approach does not change the order in which the training examples are given to the model, but instead modifies their lengths. 7 Table 7 in the appendix shows the epoch at which every model matched the baseline’s performance. 8 Table 8 in the appendix shows the total time it took to train each model. 4 Training Subsequence Length §3 results show that models trained on shorter subsequences can be effective at test time, and are much faster to train. We further explore this below. 4.1 Staged Training We propose a two-stage training routine that initially uses short input subsequences followed by long subsequences.6 This method was previously applied to speed up the training of BERT (Devlin et al., 2019), but we show that it also improves perplexity. We use sinusoidal position embeddings; learned position embeddings, which we do not consider, create a dependency between the parameterization and subsequence length. In our experiments, we neither modify nor reset the state of the optimization algorithm between the two stages. 4.2 Experiments (a) (b) self-attention K V Q the cat sat self-attention K Q V 1 the 2 cat 3 sat Figure 2: Inputs to the self-attention sublayer, conventionally (left) and with position-infused attention (right), for L = 3, at timestep 3. The numbers denote the position embeddings. model obtains 16.89 perplexity, a large improvement on the 17.92 B & A result in that setting. On the test set, using the same sliding window evaluation, our model obtains 17.56 perplexity, a substantial gain over the baseline’s 18.70 test-set perplexity. Appendix Table 10 shows that our best model uses almost five times less memory during the first stage than the baseline. We also found that setting L to less than 3,072 tokens in the second stage degraded performance. (Appendix Table 9 shows staged training results with an initial stage length of 128 for 50 epochs (as in the best model) and varying lengths for the second stage. We found this to also be true for other initial stage lengths and epochs.) Unlike results in Table 1, where we show that models with L larger than 1,024 do not substantially improve token-bytoken generation perplexity, models trained using staged training improve when given longer inputs (Appendix Table 9). Further, we explored using more than two stages (up to six), but this did not outperform our two-stage curriculum. Finally, Appendix A.5 shows that staged training substantially improves on the Toronto Book Corpus (Zhu et al., 2015). 5 Repositioning Position Embeddings Sliding window inference substantially improves performance by increasing the minimum effective context window size. But it is very slow. We could solve this by letting the model attend to representations of the previous subsequence during inference on the current one. In this case, the same token representations would be used in different positions since a token generated near the end of one subsequence would be cached and reused near the start of the next one. However, transformer model representations entangle positional and content information, so a cached token representation would encode an incorrect position when reused in a new position. TransformerXL (Dai et al., 2019) uses relative position embeddings to solve this problem. However, that approach is slower and uses more parameters and memory than the baseline transformer.9 We solve this using no extra parameters, memory, or runtime. We also show that our method can use much shorter input subsequences and still achieve superior performance. Transformer Language Models The baseline transformer LM, given a token list T of length L and a tensor P containing the first L position embeddings, produces L next-token predictions using the following procedure: 1. Embed each token in T , producing tensor X. 2. Add the position embedding of each index to the token at that index: X = X + P. 3. Feed X through each transformer layer. The self-attention sublayer in each transformer layer is invoked as follows: self-attention(key=X, query=X, value=X) 4. Transform the outputs of the last transformer layer using the softmax layer, giving L nexttoken probability distributions. 5.1 Position-Infused Attention (PIA) We propose to let the model reuse previous outputs by making each output contain no explicit positional information. To do this, we modify the 9 The self-attention coefficients between q queries and k keys in TransformerXL are the sum of two dot products of size q · k; the unmodified attention sublayer and our PIA method both compute only one dot product of size q ·k. We also benchmarked the TransformerXL model using its publicly released code and found that their relative position embeddings slow inference by 22% and require 26% more parameters than their implementation of the unmodified self-attention sublayer. model so that it does not add position embeddings at the beginning of the computation (step 2), but rather adds them to the query and key vectors at each layer (but not to the value vectors). The outputs at each layer are the transformed, weighted sums of the value vectors, and, since the value vectors in our model do not contain explicit positional information, the outputs also do not. Formally, steps 1 and 4 do not change, step 2 is omitted, and step 3 is modified to invoke the self-attention sublayer as follows: sequence tasks such as sentence-level translation, where sequence lengths are short. Most language models, including B & A, train on their data as nonoverlapping subsequences. This means that training subsequences can be shuffled at each epoch and consumed in random order. However, when using PIA, we would like the cache to contain the previous subsequence. We therefore do not shuffle the data, making the cached subsequence the previously occurring one. Figure 1(c) depicts training with a cache that contains representations of the previous subsequence. self-attention(key=X+P, query=X+P, value=X) Figure 2 (b) depicts this method. Although PIA sublayer outputs contain no explicit positioning information, the attention mechanism can still compute position-dependent outputs because positional information is added to the query and key vectors. Our method is implementable in just a few lines of code. 5.2 PIA Enables Caching In the unmodified transformer, to generate a string whose length exceeds L, it would have to be split into separate subsequences, and the model would be unable to attend to the previous subsequence when generating the current one. Using PIA, we can store and attend to representations of the previous subsequence since they no longer contain any explicit positioning information. Therefore, all our PIA models use a cache, where representations from the previous forward pass are stored and attended to in the next forward pass. Caching makes generation faster. The complexity of the attention mechanism is O(q·k) where q is the number of queries (outputs) and k is the number of key-value pairs (inputs). To generate a sequence whose length exceeds L using tokenby-token generation in the unmodified transformer (with subsequence length L), attention takes O(L2 ) time (since there are L queries and L keys). Using PIA and caching, we can reuse L − 1 of the previous outputs at every layer. Thus, our attention sublayer takes O(L) time (because now there is a single query and L keys). Our approach is useful in scenarios where we need to evaluate or generate sequences that are longer than the model’s subsequence length. Therefore, it would not be applicable to sequence-to- 5.3 Experiments We use the experimental setup described in §2. The B & A baseline achieves 18.65 on the development set. We train two additional baselines, the first uses PIA without caching and the second uses caching but no PIA. If just PIA is used (without caching), performance degrades to 19.35 perplexity, but the model’s speed and memory usage do not change. Using caching without PIA severely hurts performance, obtaining 41.59 perplexity. Disabling data shuffling in the PIA-only model achieves similar performance to that model when it does use data shuffling, at 19.44 perplexity. Not shuffling the data is necessary for recurrent-style training that caches previously computed subsequence representations. Our next experiments use the recurrent-style training of Dai et al. (2019), where we receive L new tokens at every training iteration and attend to L′ cached representations (of the subsequence of tokens that came immediately prior to the L new tokens). As before, we output L predictions at every training iteration. This means that the maximal and minimal effective context window sizes are L′ + L and L′ + 1, respectively. In all our models with PIA and caching, we set ′ L = L because a manual exploration of different models where L′ 6= L did not yield better results. Table 3 compares the results of our models that use PIA and caching to the baseline on the WikiText-103 dev. set. Evaluation and generation speeds are shown in the nonoverlapping (N.o.) and sliding window (S.W., with stride S = 1) speed columns, respectively.10 Unlike in the baseline, token-by-token evaluation in our model achieves the same perplexity as nonoverlapping evaluation 10 Note that Baevski and Auli (2018) show that the baseline model can also achieve 17.92 during S.W. evaluation, when S = 512, with a speed of 2.5k tokens per second. Train Inference Speed ↑ Subseq. Length Speed ↑ PPL ↓ N.o. S.W. 32 64 128 256 512 768 1024 1536 22.0k 23.8k 24.4k 23.5k 21.5k 17.6k 16.6k 12.9k 20.53 19.07 18.37 17.92 17.85 18.16 18.19 19.11 2.0k 4.1k 7.9k 12.8k 14.5k 13.8k 13.9k 7.9k 49 51 50 48 46 43 39 34 Baseline (3072) 13.9k 18.65 17.92 14.7k - 5 Table 3: Dev. perplexity and speed for PIA models trained with different subsequence lengths (L). PIA models attend to L new and L cached tokens at each inference pass. N.o. is nonoverlapping eval.; S.W. is sliding window eval., where we always use S = 1 (tokenby-token) here. The baseline is evaluated with both evaluation methods. We measure speed in tok./sec. per GPU and use a batch size of 1 for inference. since in both cases, the predictions for each input subsequence are conditioned not only on the current input, but also on the previous input, making the context window the same in both inference modes (in both cases, at every timestep, the context window is all tokens up to that timestep). Table 3 shows that as we increase subsequence length, perplexity improves, peaking at 512 before starting to degrade. Our best model obtains 17.85 perplexity, which is multiple standard deviations better than the baseline (18.65, N.o.). Table 5 in §6 shows a similar gain on the test set. The best model runs 1% slower than the baseline during N.o. eval. (since caching reduces the speed gain from smaller attention matrices in this mode). Table 10 (appendix) shows that it uses less than half of the memory the baseline does during training. Our best model trains 55% faster than the baseline. Our best model, with subsequence length 512, has attention matrices of size 512 · 1,024 (since we have 512 queries—one per every new token—and 1,024 keys and 1,024 values—one per every new token and every cached token). In the baseline, all attention matrices are of size 3,072 · 3,072. Caching previously computed representations lets us do token-by-token generation efficiently when generating more than L tokens. Our model is nine times faster than the baseline at token-bytoken generation even as it achieves better perplexity and uses much less memory (Tab. 3, col. 5). First Stage Subseq. Length Train Inference Speed ↑ PPL ↓ 32 64 128 256 21.6k 22.6k 22.9k 22.5k 17.66 17.56 17.47 17.50 PIA + Cache w/o Staged Training 21.5k 17.85 Table 4: Dev. perplexity for models that use PIA, caching, and staged training (with final subseq. length of 512). We measure speed in tok./sec. per GPU. Evaluation speed is the same for all models, at 14.5k tok./sec. PIA and caching also greatly improve perplexity on the Toronto Book Corpus; see A.5 in the appendix. 6 Shortformer Results To assess whether the gains from staged training, PIA and caching are additive, we take our best caching PIA model, with subsequence length 512, and apply staged training to it, training it with a subsequence length of between 32 to 256 for the first half of training.11 Table 4 shows the results. As in §4.2, where staged training was applied to the unmodified baseline, the results are very robust to the choice of initial stage subsequence length, with all the different choices improving perplexity over the model that does not use staged training. The best model (with initial subsequence length 128), which we call Shortformer, achieves 17.47 dev. set perplexity and trains 65% faster than the baseline. Since its attention matrices are of dimension 512 · 1,024 (the baseline’s are 3,072 · 3,072), our model uses less memory (§A.4, appendix). It has the same number of parameters as the baseline. Figure 3 (appendix) compares our best models using each method we presented (and their combination) to the baseline. It shows that combining caching, PIA and staged training (Shortformer) yields the quickest training and best perplexity when using nonoverlapping evaluation. Evaluation speed is similar for all of these models. Finally, Table 5 compares our best models on the test set of WikiText-103 to the state of the art.12 Shortformer is almost twice as fast to train as the baseline and achieves superior results. Like the 11 We picked 50% of epochs as the length of the first stage since that produced near-optimal results at a fast speed in §4. 12 We benchmarked speed, on V100 GPUs, for all models that had publicly available code. Train Inference (Test) Model Param. ↓ Speed ↑ Mode Speed ↑ PPL ↓ Baseline 247M 13.9k TransformerXL∗ Sandwich T. Compressive T. Routing T. kNN-LM∗∗ 257M 247M 329M 247M PIA + Caching Staged Training Shortformer 247M 247M 247M 6.0k 13.9k 13.9k N.o. S.W. N.o. S.W. N.o. N.o S.W. 14.7k 2.5k 3.2k 2.5k 145 19.40 18.70 18.30 17.96 17.1 15.8 15.79 21.5k 17.6k 22.9k N.o. S.W. N.o. 14.5k 2.5k 14.5k 18.55 17.56 18.15 Table 5: Comparison of our best models to other strong LMs (see text for citations and explanations) evaluating the WikiText-103 test set, where S = 512. We measure speed in tok./sec. per GPU, and use a batch size of 1 for inference. ∗ TransformerXL runs on an older version of PyTorch, which might affect speed. ∗∗ kNN-LM requires a 400GB datastore. best model from §5.3, it is nine times faster than the baseline for token-by-token generation. Since it uses a cache, sliding window evaluation does not increase Shortformer’s performance. By training the baseline with staged training (and no PIA or caching), we obtain a model (our best model from §4.2) that, with sliding window eval., obtains even better results, but that model is much slower than Shortformer (Table 5, second-to-last row). Shortformer outperforms the baseline’s perplexity and performs within a standard deviation of the Sandwich Transformer (Press et al., 2020) and TransformerXL. It does not outperform the Compressive Transformer (Rae et al., 2020), Routing Transformer (Roy et al., 2020) and kNNLM (Khandelwal et al., 2020), which make orthogonal improvements that can be applied to any language model, at the price of slower decoding. Combining them with our approach may yield further gains. These results are similar to those we obtain on the Toronto Book Corpus (§A.5 in the appendix). 7 Related Work Staged Training Devlin et al. (2019) used a staged training routine for BERT by performing the first 90% of training on short subsequences (of length 128) before moving on to longer ones (of length 512). They use this method to speed training, but we show that also it improves perplexity and analyze different configurations of this method. Many recent papers have explored improving transformer efficiency by reducing the quadratic cost of self-attention, motivated by scaling to longer sequences (Kitaev et al., 2020; Roy et al., 2020; Tay et al., 2020). We instead demonstrate improved results with shorter sequences, which naturally also improve efficiency. One way to reduce transformer memory usage is to sparsify the attention matrix by letting the model attend only to a subset of nearby tokens at each timestep (Child et al., 2019; Beltagy et al., 2020; Roy et al., 2020). Training on shorter subsequence lengths is much more efficient: we use multiple, but much smaller, attention matrices. Since attention uses memory and computation in a way that scales quadratically with input size, splitting the inputs into multiple subsequences each processed independently lets us use less memory and run faster. Like our method, Beltagy et al. (2020) attend at each timestep to a growing number of neighbors as training progresses, but they use five stages, which we found not to be superior to our two-staged method. The adaptive attention span model of Sukhbaatar et al. (2019) learns the maximum effective context window sizes for each head at each layer independently. Like in our method, context window sizes are smaller at the start of training and lengthen as training progresses. We show that a simple approach of manually choosing two subsequence lengths is highly effective. In addition, keeping subsequence lengths equal across all heads and layers lets us save memory and runtime. Position-Infused Attention TransformerXL (Dai et al., 2019) caches and attends to previous representations using an attention sublayer that uses relative positioning (Shaw et al., 2018). It runs much slower than the unmodified attention sublayer, requires extra parameters, and requires internally modifying the self-attention sublayer, while our PIA method (§5) does not. In parallel with our work, Ke et al. (2020) compute attention coefficients by summing two attention matrices, one based on position-position interactions and the other on content-content interactions. As in PIA, they do not add position embeddings at the bottom of the model. They present results only for BERT, which uses much smaller subsequences than our models. 8 Conclusion Our results challenge the conventional wisdom that longer subsequences are always better. By first training on shorter subsequences and then progressing to longer ones via staged training, we improve perplexity and reduce training time. We additionally propose position-infused attention, which enables caching and efficiently attending to previous outputs; we show that models using this method do not require large input subsequences. We finally show that these two methods can be combined to produce a speedier and more accurate model. Acknowledgments Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR. Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinking positional encoding in language pre-training. We thank Tim Dettmers, Jungo Kasai, Gabriel Ilharco, Hao Peng, Sewon Min, Mandar Joshi, Omer Levy, Luke Zettlemoyer, Julian Michael, Edward Misback, Sofia Serrano, Nikolaos Pappas, Jesse Dodge, Myle Ott, and Sam Shleifer for their valuable feedback and fruitful discussions. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations (ICLR). References Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations. Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA. Association for Computing Machinery. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. URL https://openai.com/blog/sparse-transformers. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2996–3005, Online. Association for Computational Linguistics. Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2020. Efficient content-based sparse attention with routing transformers. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, Florence, Italy. Association for Computational Linguistics. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19– 27. A Appendix Initial Stage Epochs Table 6 shows the time each staged training model needs to match baseline performance, as a fraction of the time it takes to train the baseline. The fastest three configurations each match the baseline’s performance in just 37% of the time it takes to train the baseline. This result is very robust to hyperparameter changes, as all models trained with initial subsequence length of between 64 and 512, that switch to the second stage at epoch 50 to 150, manage to match the baseline’s performance in at most 59% of the time it takes to train it. 0 25 50 75 100 125 150 175 200 Initial Stage Subsequence Length 32 64 128 256 512 1024 1536 0.60 0.54 0.53 0.65 0.64 0.71 0.53 0.48 0.47 0.54 0.59 0.63 0.81 0.51 0.43 0.42 0.48 0.53 0.56 0.79 0.52 0.40 0.38 0.41 0.47 0.50 0.73 0.61 0.41 0.37 0.37 0.42 0.46 0.69 0.48 0.39 0.37 0.40 0.44 0.66 0.48 0.43 0.45 0.51 0.70 0.59 Initial Stage Epochs Table 6: Time needed to match baseline performance (dev. set, nonoverlapping eval.) as a fraction of time needed to train the baseline (smaller is better). Models never matching the baseline have empty cells. All models have a subsequence length of 3,072 tokens at the end of training. 25 50 75 100 125 150 175 200 Initial Stage Subsequence Length 32 64 128 256 512 1024 1536 136 123 122 146 144 155 135 124 122 136 144 149 179 143 128 125 136 144 145 181 158 135 130 136 145 142 175 190 149 141 140 146 144 174 176 160 154 153 151 174 191 178 177 176 189 202 Table 7: Epoch at which each model matches the baseline. Some models never match the baseline, and so those cells are empty. Initial Stage Epochs A.1 Additional Staged Training Results Table 8: Total time needed to train each model as a fraction of the time needed for baseline training. A.2 Staged Training with Shorter Final Stage L In section 4, all models presented used Staged Training with a final input subsequence length L of 3,072 tokens. In Table 9, we show the results of training with a first stage with L = 128 for 50 epochs, and using varying subsequence lengths for the second stage. The best result is obtained when the second stage uses L = 3,072. In addition, in all of our other experiments (not presented here) with different L and epoch number values for the first stage, we observed that using L = 3,072 for the second stage always achieved the best perplexities. Models trained with staged training and evaluated with a sliding window sometimes perform slightly worse when S is decreased, but this difference is much smaller than the standard deviation. The L = 1536 and L = 3072 models peaked at S = 512, and then as S was decreased perplexity started slightly degrading.13 A.3 Training Speed vs. Performance Figure 3 compares the validation performance and training speed of the baseline to our models. A.4 Memory Usage To understand how much memory our models and the baseline use during training, we find the largest batch size that we can load into memory for both our models and the baseline. Models that can simultaneously make more predictions are more memory efficient. 13 Tables 7 and 8 show the epoch at which each model matched the baseline’s performance and the total time it took to train each of our staged training models. 25 50 75 100 125 150 175 200 Initial Stage Subsequence Length 32 64 128 256 512 1024 1536 0.94 0.94 0.94 0.94 0.94 0.95 0.97 0.87 0.87 0.87 0.87 0.88 0.90 0.94 0.81 0.81 0.81 0.81 0.82 0.85 0.90 0.75 0.75 0.74 0.75 0.77 0.80 0.87 0.68 0.68 0.68 0.69 0.71 0.75 0.84 0.62 0.62 0.61 0.62 0.65 0.70 0.81 0.56 0.56 0.55 0.56 0.59 0.66 0.78 0.49 0.49 0.48 0.50 0.53 0.61 0.75 We conjecture that this is because of a train-test mismatch that occurs since the average effective context length during training is 3,072 = 1,536 and so the model focuses on learning 2 how to make predictions for the tokens in the center of the input, and does not perform as well when making predictions for tokens at the end of the input (which is what we use when using sliding window evaluation). 19.0 Inference PPL ↓ Final Subseq. Length Nonoverlapping 256 512 1024 1536 3072 21.26 19.69 18.64 18.10 17.52 Sliding Window S = 512 S = 1 19.69 17.60 17.28 16.89 18.72 18.04 17.58 17.30 17.01 Table 9: Inference perplexity for staged training models trained with an initial stage subsequence length of 128 for 50 epochs and varying second stage subsequence length L (for the second stage’s 155 epochs). S is stride. To see how these models perform without staged training, refer to Table 1. Training Model Max Batch Size ↑ Max Predictions ↑ Baseline 2 6,144 Stage 1 230 29,440 Stage 2 2 6,144 26 13,312 Staged Training PIA + Caching Shortformer Stage 1 160 20,480 Stage 2 26 13,312 Table 10: Memory usage of the baseline and our models during WikiText-103 training. For each model we show the maximal batch size that it could fit on one GPU at once during training. The max predictions column denotes the number of tokens predicted at each feedforward pass, which we calculate by multiplying batch size by number of predictions per subsequence (which is equivalent to L). We benchmarked all models on a V100 GPU, with 32GB of memory. Note that the second stage in the staged training model matches the performance of the baseline model, because those architectures are identical. The same is true for the second stage of the Shortformer and the PIA + Caching model. Perplexity ( ) 18.5 Baseline 18.0 17.5 17.0 13k PIA + Cache Staged Training 15k 17k Shortformer 19k Tokens per Second ( ) 21k 23k Figure 3: Dev. perplexity vs. training speed for the baseline and our best staged training model, our best PIA and caching model, and our best combined model (Shortformer). All models are evaluated using nonoverlapping evaluation. Table 10 shows the memory usage for the baseline model and our models. Since our Shortformer model has much smaller attention matrices, during training it can make more than double the nexttoken predictions than the baseline can in each feedforward pass. During inference, we use a batch size of 1 throughout the paper, following (Dai et al., 2019), and in our experiments, the PIA + Caching model, the final staged training model and the baseline all use a similar amount of memory during nonoverlapping evaluation. During token-by-token inference, the maximum number of predictions for the baseline model is 7, whereas our model can fit a batch size of 39 (so 39 predictions are made during token-by-token inference), making our model more than 5 times more memory efficient than the baseline. Using a batch size of one is a realistic benchmarking scenario: in large models such as GPT-3, a batch size of one is used during inference. A.5 Toronto Book Corpus To verify that our results transfer to other datasets, we ran our models on the Toronto Book Corpus (TBC) (Zhu et al., 2015), a 700M token collection of books that has previously been used in the training corpus of BERT (along with English Wikipedia). We use the same train/development/test split as (Khandelwal et al., 2020) and (Press et al., 2020), as well as their tokenization, which uses BERT’s vocabulary of 29K BPE subwords. As in (Khandelwal et al., 2020) and (Press et al., 2020), since the vocabulary is much smaller than WikiText-103’s, we use a tied word embedding and softmax matrix (Press and Wolf, 2017; Inan et al., 2017), instead of using the adaptive word embeddings (Baevski and Auli, 2018) as in the WikiText-103 models. To fairly compare our models to the ones from (Khandelwal et al., 2020) and (Press et al., 2020), our initial set of experiments on the TBC use a maximum subsequence length of 1,024 (for staged training), train for 59 epochs, and for all other hyperparameters we use the same values as the ones we used for WikiText-103 (see Experiment Setup in Section 2). In this setting, the baseline achieves a perplexity of 15.38 ± 0.39 (standard deviation) on the development set. We do not tune the hyperparameters of our methods on the TBC, we simply use the same values as the best ones that we found on the WikiText-103 dataset. For staged training, our best model trained 50 for 205 % of the epochs with L = 128 and spent the rest of training with the same subsequence size as the baseline. For the TBC, we again trained the staged training model model with L = 128 50 % of training, and then move on for the first 205 to L = 1,024, to match the Sandwich Transformer (Press et al., 2020) and kNN-LM (Khandelwal et al., 2020) which used 1,024 as the subsequence length. For the PIA + Caching model, we set L = 512, as we did for our best PIA + Caching on the WikiText-103 dataset. For the Toronto Book Corpus Shortformer, we trained for the first half of training with L = 128 before moving on to training with L = 512, as in our WikiText-103 models (Section 6). Train Inference PPL ↓ Model Speed↑ Mode Speed↑ Dev. Test kNN-LM∗ Sandwich T. N.o. 19.2k 15.38 12.73 S.W. 9.6k 14.75 11.89 24.0k S.W. 14.20 10.89 24.0k S.W. 9.6k - 10.83 PIA + Caching 20.5k Staged Training 25.5k Shortformer 21.3k Baseline 24.0k N.o. 15.0k 13.86 11.20 N.o. 19.2k 13.81 11.18 S.W. 9.6k 13.13 10.72 N.o. 15.5k 13.40 10.88 Table 11: Comparison of our best models to other strong LMs trained on the Toronto Book Corpus (TBC). Following Khandelwal et al. (2020) and Press et al. (2020), for the baseline and our staged training model, we set L = 1,024 and S = 512 when using sliding window (S.W.) evaluation in the TBC dataset. All models have 261M parameters. ∗ kNN-LM requires a 400GB datastore. Table 11 shows that staged training and the Shortformer improve over the baseline by a wide margin and match the results of the Sandwich Transformer and the kNN-LM. As noted in Section 6, those contributions are orthogonal to ours, and combining them might yield further gains. Since in Table 11 the final stage of the staged training model (and the baseline) both have L = 1,024, Shortformer lacks a speed advantage in this scenario. Table 12 shows results for our staged training model trained with a final stage subsequence length of 3,072 tokens, as in our WikiText-103 experiments in Section 4. This model trains faster than the L = 3,072 baseline and also achieves much better perplexity scores (the baseline in this setting achieves a perplexity of 14.52 ± 0.15 (standard deviation) on the development set). In addition, note that the Shortformer model from Table 11 achieves better perplexity than even the baseline with L = 3,072, although Shortformer is much faster to train and uses much smaller attention matrices during inference (of size 512 · 1024; the baseline has attention matrices of size 3,072 · 3,072, as in Section 6). Train Inference PPL ↓ Model Speed↑ Mode Speed↑ Dev. Test Baseline (L = 3,072) 14.2 N.o. S.W. 15.1 2.5k 14.52 11.69 14.14 11.43 Staged Training (L = 3,072) 18.1 N.o. S.W. 15.1 2.5k 13.19 10.76 12.80 10.48 Table 12: Comparison of the staged training model to the baseline, when the subsequence length L is set to 3,072. In this table, S = 512.