C URRENT L IMITATIONS OF L ANGUAGE M ODELS : W HAT YOU N EED IS R ETRIEVAL arXiv:2009.06857v1 [cs.CL] 15 Sep 2020 Aran Komatsuzaki Georgia Institute of Technology EleutherAI akomatsuzaki3@gatech.edu A BSTRACT We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input as well as performing general textual tasks like GPT-2/3 due to its need for a specific fine-tuning dataset. (2) and (3) do not improve the prediction of the first ∼ 103 tokens. Scaling up a model size (e.g. efficiently with (4)) still results in poor performance scaling for some tasks. We argue (5) would resolve many of these limitations, and it can (a) reduce the amount of supervision and (b) efficiently extend the context over the entire training dataset and the entire past of the current sample. We speculate how to modify MARGE to perform unsupervised causal modeling that achieves (b) with the retriever jointly trained. 1 Classification of Recent Language Model Approaches We consider some of the successful recent language model approaches to improve the performancecomputes trade-off of language model (Kaplan et al., 2020) and classify them for the ease of our argument. Each category tends to be orthogonal to each other, so that they can often be used in combination for further gain. Fig. 1 summarizes our classification. There are four major categories: non-causal models, extension of batch length with efficient attention, memory and retrieval. We note that batch length in this paper is defined to be the length of a minibatch across timestep dimension. We avoid calling it context length to be more specific. In this paper, the default modality of our consideration is text unless specified otherwise. Also, memory and retriever are defined to possess a certain property addressed later in this section, so their usage may differ from the usual one. Efficient attention also refers to non-attention alternatives with better complexity, such as variants of convolution, for the convenience of our argument. Non-causal models in this paper are defined to be the model that predicts a token using future information. The example includes various BERT variants, including BART and T5 (Devlin et al., 2018; Raffel et al., 2019; Yang et al., 2019; Liu et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020; Lewis et al., 2019; Hendrycks et al., 2020). On the other hand, causal models include the original Transformer encoder-decoder (Vaswani et al., 2017) and GPT-2/3. Though we usually consider the decoder-only model as the latter for a causal model, it should be clear from the context whether a causal model refers to the former, the latter or both. The most long-range language modeling approaches that rely on efficient attention without recurrence or memory, including Reformer (Kitaev et al., 2020), Routing Transformer (Roy et al., 2020), Big Bird (Zaheer et al., 2020) and many others (Beltagy et al., 2020; Tay et al., 2020; Sukhbaatar et al., 2019; Child et al., 2019), extend batch length with efficient attention without suffering from quadratic complexity in terms of length. Notably, the approaches as Transformer-XL (Dai et al., 2019) and Compressive Transformer (Rae et al., 2019) are not included in this category, since they are recurrent and their goal is to extend the context rather than batch length per se. 1 It is reasonable to assume that simply extending batch length may not be enough to cover all the information needed for prediction. To extend the context beyond batch length, one needs to resort to either memory or retriever, with which the model can attend to stored relevant information without increased batch length. Essentially, these methods enable attention over the entire training dataset and the context of the current sample. In this paper, memory-based methods and retrieval-based methods refer to the models that are equipped with memory and retriever, respectively. Memory is defined to retrieve relevant information from its own dense tensor that encodes the training dataset and the context of the current sample, whereas retriever is defined to retrieve relevant segments of data from the training dataset and the context of the current sample in the original, raw format. Notably, memory is more straightforward, since the parameters of a neural network can be considered as a form of memory, as they are updated through backpropagation so as to incorporate the input information to improve their prediction in subsequent iterations. We refer to this type of memory as implicit memory and any method of improving the capacity to store the input information in this way beyond the baseline as an implicit memory approach. On the other hand, if a model combines the input information with the stored activations from the previous iteration to use it for the next iteration in the forward pass, the sent activations is called explicit memory. In other words, the memory used in recurrence at the level of TBPTT segment (e.g. Transformer-XL) is referred to as explicit memory. It is also possible to store the activations and keep them as trainable parameters, which can be called hybrid memory, and they share the same properties as both types. For example, explicit memory approach includes Transformer-XL, Compressive Transformer and many others (Yang et al., 2019; Burtsev & Sapunov, 2020), while implicit memory approach includes simply increasing the parameter count and conditional computation approach, such as Sparsely-gated MoE (MoE) (Shazeer et al., 2017; Lepikhin et al., 2020), Product Key Memory (PKM) (Lample et al., 2019) and others (Dehghani et al., 2018; Lugosch et al., 2020). On the other hand, it is less clear how to design a good retriever, since the data to be retrieved is stored in a raw format unlike for memory. The retrieval-based approach can be distinguished by whether the retriever and the language model are trained jointly (end-to-end differentiably) or not, where the former refers to models as MARGE, KIF (Fan et al., 2020), REALM (Guu et al., 2020) and CRISS (Tran et al., 2020), and the latter refers to the rest of retrieval-based approaches. If they are not trained jointly, the retriever is usually either a tf-idf variant such as BM25 (Robertson & Zaragoza, 2009) (e.g. (Wang & McAllester, 2020)), DPR (Karpukhin et al., 2020) (e.g. Fusionin-Decoder (Izacard & Grave, 2020), RAG (Lewis et al., 2020b)) or a pre-trained language model (e.g. knn-LM and BERT-knn (Kassner & Schütze, 2020)). While many of these models have not been applied for unsupervised causal modeling, we will pose a general idea to convert them for unsupervised causal modeling. For discussion about some relatively successful examples of attempts to improve the performancecomputes trade-off outside of the categories mentioned, the reader can refer to Appendix 6.1. 2 Non-Causal Models For various tasks in NLP, non-causal models tend to show better performance-computes trade-off than causal models. For example, T5 outperforms the causal baseline not only in discriminative tasks but also generative tasks such as summarization and translation. BART achieves lower perplexity over the causal baseline on the task of fine-tuning to a conversational dataset. Furthermore, it was recently reported that XLNet achieves a superior performance to the GPT-3 of the corresponding size at Winogrande with few-shot learning (Davison, 2020). UnifiedQA, a T5-like pre-trained model fine-tuned with a general QA dataset, is evaluated with few-shot learning examples a la GPT-3, and it performs nearly on par with about 30 times larger GPT-3 (that is not fine-tuned) (Hendrycks et al., 2020). However, the current approach of non-causal models suffer from some limitations. Firstly, many real-life tasks are open-ended text generation with the output loosely constrained by the input in a way similar to many of the prompt-output generation tasks a la GPT-2 and longrange unsupervised text generation. However, non-causal models are rarely evaluated on this sort of 2 Figure 1: Classification of recent language model approaches to improve performance-computes trade-off. Blue indicates a major category. Orange indicates an example. Batch length in this paper is defined to be the length of a minibatch across timestep dimension. Figure 2: (Raffel et al., 2019). The performance of the baseline T5 with and without fine-tuning on various datasets. Pre-training is beneficial for all fine-tuning datasets but WMT EnFr, which is the largest dataset. tasks. This practice unfairly favors a certain family of models, including masked language models, since they tend to perform poorly compared with the causal models. More details can be found in Appendix 6.2.1. While non-causal models excel at reading/encoder-intensive tasks, they perform poorly on some writing/decoder-intensive tasks. There needs to be more effort spent on seamlessly combining causal and non-causal models to achieve the best of both worlds. Secondly, it is unclear how to fine-tune a non-causal model to perform an arbitrary textual task as GPT-2/3 at least as efficiently as GPT-2/3. Unlike GPT-2/3, non-causal models rely on the availability of a fine-tuning dataset relevant to a given task, which is usually not known beforehand. Hence, non-causal models are likely to perform poorly on the tasks where there is no fine-tuning dataset available for. This cannot be naively resolved by using a very general or large fine-tuning dataset as WebText to deal with various tasks, since the specificity of fine-tuning dataset is crucial for the power of fine-tuning. For example, Raffel et al. (2019) shows that conventionally fine-tuned T5 substantially outperforms the baseline fine-tuned with all the fine-tuning datasets combined, which implies that fine-tuning dataset needs to be specific. They also show that, when a fine-tuning dataset is large enough, the fine-tuned model degenerates into a causal model in terms of its performance, as the benefit of pre-training vanishes (Fig. 2). Thus, there needs to be more investigation into how non-causal models would perform on previously unknown tasks compared with causal counterparts and how to improve the performance if necessary. 3 3 Extension of Batch Length With Efficient Attention Before beginning this section, we have a few caveats. Since the modality of our primary concern in this paper is text, our conclusion regarding the approach of extending batch length with efficient attention would differ if the modality of interest is different, especially if it has a structure more exploitable with an inductive bias such as image and video. In particular, we believe that, unlike text, such exploitable modalities will benefit substantially more from better efficient attention or related architectural modification than text. However, this is out of the scope of this paper. There have been many recent works that propose a language model that simply replaces full attention with an efficient attention to improve the computational complexity in hope of extending the batch length, i.e., from O(Ld2model +L2 dmodel ) to O(Ld2model +f (L)dmodel ), where L is the batch length, and f (L) is some function such that f (L) = o(L2 ). Notable example of f (L) includes linear, loglinear and O(L3/2 ). Since we have Ld2model > L3/2 dmodel if L < d2model ∼ 106 , in practice even f (L) = O(L3/2 ) is sufficient to make the contribution of attention not to dominate that of linear layer. Thus, the computational complexity of many efficient attention models is already empirically on par with that of linear layers. In other words, one can no longer practically gain from improving the complexity of efficient attention any further. Most naturally available samples as well as the reasonable output of most tasks have rather limited length, though others (e.g. books) do not. For example, the average sample length of WebText is only about 1000 tokens, and likewise for the webpages in Common Crawl. Also, when batch length is extended from L (e.g. 1024) to L′ , this does not tend to result in substantial improvement in the loss of the first L tokens of a sample. For example, Fig. 3 (left) demonstrates this. On the other hand, other approaches such as increasing the number of parameters (implicit memory) would improve the loss for early tokens as shown in Fig. 3 (right). It can be found on Appendix 6.3.4 a more detail on how each approach improves or does not improve the loss of early (the first L) tokens of each sample. In particular, Appendix 6.3.4 concludes that any approach that satisfies a certain condition (including retrieval and implicit memory approach) and improves the loss does improve the early token prediction. Since many text data have rather limited sample length, the gain from longer batch length due to efficient attention can be observed only on a small subset of entire text data. Even for the long samples, the prediction is poor for its early tokens, which results in poor generation at the beginning and cannot be mitigated by better loss of latter tokens. The approach of simply replacing full attention with efficient attention to extend the batch length has been successful to some extent. We summarize our observations about this approach in this section as well as that of Appendix 6.3 (including the one subsections not mentioned above) as follows: 1. In practice, one cannot improve the effective computational complexity of efficient attention models any significantly further due to the contribution from linear layers. 2. Appendix 6.3.1 shows that improvement from extending the batch length would be limited at some point by the trade-off between model size and batch size that follows from Kaplan et al. (2020). 3. This approach is unlikely to result in any improvement for modeling the first L ∼ 103 tokens of a sample, i.e., the entirety of many samples of our interest, that tend to have short length, and the early tokens of long samples, whose suboptimal generation would also affect the generation of the later tokens. 4. Appendix 6.3.3 argues that it is possible to combine efficient attention with retrieval-based approach to increase k of kNNs retrieved for possibly further improvement of performance. However, it is uncertain if further improvement in efficient attention would bring a notable gain for this use over the existing efficient attention. 4 Memory-Based Approach There are both pros and cons for each of explicit approach and implicit approach. Explicit memory (recurrence) is more sensitive to the recent past in contrast to implicit memory. For example, the perplexity of Transformer does not improve whether the minibatch of the previous iteration contains the 4 Figure 3: (Kaplan et al., 2020). (Left): n-th per-token loss for various n on models with various parameter counts and batch lengths. Training runs with shorter batch length L = 8 (dashed lines) perform better on early tokens, which implies that the model with longer batch length does not improve theIf anyone loss of early tokens that have shorter context. (Right): n-th per-token loss for various n on models with various parameter counts with L = 1024. Even for early tokens, increasing the parameter count results in a significant improvement in their loss. immediate past of a sample in the current minibatch, while the performance of Transformer-XL (explicit memory) improves if its memory contains the activations of the immediate past of the current sample. This property is an advantage for explicit memory. However, explicit memory often cannot utilize information of the distant past such as the information from samples processed many iterations ago. While implicit memory (e.g. increasing the parameter count) can utilize such information by storing it in more parameters as can be seen from the fact that it improves the loss of early tokens (Fig. 3), this is not the case for existing explicit memory approaches such as Transformer-XL. The issue that recurrent models cannot utilize information well beyond the range of batch length through recurrence has been known for a long time, and there has been little progress in this direction. Thus, many of our conclusions for the approach of extending batch length with efficient attention apply to explicit memory as well. One may counter the above argument by noting that self-attention is known to perform poorly on some tasks recurrence excel at. We believe this is not a valid counter-argument as addressed in Appendix 6.4.1. Implicit memory approach, by definition, aims for compressing as much information as possible in the parameters in a way such that it can retrieve them efficiently. Since having more parameters allows storing more information, conditional computation is more advantageous. The most successful conditional computation approach thus far is presumably MoE. Both MoE and PKM essentially expand the effective size of df f and sparsely accesses to each unit as discussed in Appendix 6.4.2. Obviously, there are other dimensions of Transformer that can be expanded with conditional computation, including depth and heads. We can also regard, for example, Routing Transformer as a conditional computation approach to let each head to attend to some timesteps only, though we separate them in this paper for the ease of argument. Though we have tried to conditionally expand depth and the heads, it has not saved performance-computes trade-off thus far. There are many works that have tried to expand depth conditionally (Dehghani et al., 2018; Elbayad et al., 2019), but there has been no such work that demonstrated a robust scaling and a substantial improvement in performance-computes trade-off. We hope this problem will be resolved soon. Since conditional computation aims for saving the computes and adding the parameters, the parameter GPU memory may eventually become a bottleneck if the model size keeps increasing. However, L2L proposed in Pudipeddi et al. (2020) reduces the parameter GPU memory from O(L) to O(1) by saving the parameters in CPU, where L is the number of layers, while maintaining comparable speed. Despite the success of conditional computation, the models that are made larger and more capacious with conditional computation do not necessarily outperform retrieval-based methods, as argued in Appendix 6.4.3. This motivates for the use of retrieval-based approach in conjunction to conditional computation. 5 This section can be summarized as follows: 1. Explicit memory approach suffers from many of the same problems as extending batch length with efficient attention. Hence, most of our conclusions in the previous section applies to this method as well. 2. Conditional computation, as a branch of implicit memory approach, brings a dramatic improvement of performance-computes trade-off, usually by efficiently approximating a Transformer with larger hyperparameters such as df f . 3. Given that the vanilla models that spend more computes perform poorly at some tasks compared with substantially smaller retrieval-based models, retrieval-based approach should also be needed, though there still is a space left for conditional computation to bring more gains. 5 Retrieval-Based Approach 5.1 Retrieval as Self-supervision The supervision as few-shot learning and fine-tuning may not be always possible in the general reallife tasks, since there are numerous, diverse tasks, and naively finding a set of relevant samples for each task accurately may not be feasible. This necessitates the use of self-supervision to reduce the required amount of supervision. The conventional method of construction of a dataset or feeding samples into a model can be thought of as a form of retrieval. Retrieval with a language model would partially automate this process and reduce the amount of supervision required to solve a given task. For example, CRISS, a retrievalbased model and the current state-of-the-art of unsupervised machine translation (even without back-translation) over various language directions with substantial improvement over its baseline, retrieves the target for a given input sequence. The retriever essentially is trained to learn what humans would feed for a given input sequence. Due to the ubiquity of this procedure in machine learning, retrieval should be a crucial component. 5.2 Retrieval Resolves Many Limitations Other Approaches Suffer From As stated at the end of the previous section and Appendix 6.4.3, retrieval-based models, such as Fusion-in-Decoder and MARGE, often substantially outperform the Transformer language models that either have a larger model size or spend more computes on various tasks, which implies that it is not only sufficient to scale up a model but also equip it with a retriever to improve the performance of language model. Hence, retrieval complements conditional computation in some tasks. Retrieval also improves early token prediction as argued in Appendix 6.3.4. Retrieval may also help noncausal models perform general textual task through self-supervision. As argued previously, efficient attention and recurrence have limited range of context both within the same sample and outside of it. However, retrieval can achieve indefinite context length within the same sample and the entire training dataset. For example, knn-LM achieves the state-of-the-art in Wikitext-103 for the amount of computes thanks to its context over the entire training dataset enabled by kNN search with faiss. While there has been no demonstration of achieving indefinite context length within the same sample with retrieval, in principle this can be similarly done as suggested later. Retrieval can also self-supervise few-shot learning of GPT-3 by feeding retrieved relevant samples. In fact, our proposed modified MARGE presented below trains the retriever that retrieves few-shot learning examples with the language model jointly, so that both components can benefit from the improvement of each other. 6 5.3 MARGE and Its Significance For retrieval-based models, it is more desirable to train the retriever and the language model jointly in a way such that the objective of the retriever is to maximize the performance of the language model, since models with separately trained retriever, such as knn-LM. There are various models with jointly trained retriever, but we focus on MARGE below, as MARGE also incorporates the trait of Fusion-in-Decoder, the attention bias for finding the similarity between source segments and target segments efficiently. MARGE performs respectably, for example, even without fine-tuning or back translation on unsupervised zero-shot machine translation with relatively small computes spent, not to mention that it also achieves the state-of-the-art performance on some tasks on translation and summarization with smaller computes spent. MARGE has the added attention bias set in a way such that it becomes larger between a source segment and a target segment closer to each other in terms of cosine similarity of their embeddings, and it computes the prediction loss of the target segments that are conditioned on the source segments. This design allows gradient update to make the embedder to embed a target segment and a source segment closer in terms of cosine similarity if the perplexity of the target segment conditioned on the source segment is lower. Thus, the model can find the source segments that give a lower perplexity to a given target segment, and simultaneously the language model conditioned on retrieved contexts can be trained. After many iterations, the clusters are no longer up-to-date to the trained embedder. Hence, the embedder is used to find more up-to-date clusters of related segments using kNN search of the newly computed embeddings. We outline the notable aspects of MARGE, while skipping the details irrelevant to our subsequent discussion: • We first divide each sample from the dataset into segments, each of which consists of N consecutive tokens, where N = 512 in the paper, and treat them as samples. • We heuristically create a set of shards of segments, so that each segment is closely related to others in the same shard. • MARGE uses the conventional Transformer encoder-decoder architecture with the usual prediction loss on the decoder. The normalized output of the first half layers of the encoder at the first timestep is used as the embedding of each segment for weight sharing. • After each I iterations, where I = 10000 in the paper, we compute the embedding of each segment and the similarity between each embedding within each shard and sample clusters of k-nearest neighbors for batch construction. • Each batch consists of sub-batches, each of which consists of the same number of evidence (source) segments and target segments, which are similar to each other within the same shard and processed by the encoder and the decoder, respectively. • The encoder-decoder attention comes from every source segment to every target segment within the same sub-batch. The encoder-decoder attention from the source segment zi to the target segment xj is modified in a way such that q · k → q · k + βe(xj ) · e(zi ) for a query q, a key k, the embedder e and a trainable scalar β with softmax taken over all the tokens in the source segments within the same sub-batch. 5.4 Modified MARGE For combining retrieval with unsupervised causal modeling, the general idea is that, as in Wang & McAllester (2020), the current context is used (embedded in this case) to retrieve similar segments across the dataset and the past of the current sample, which are added to the context for the prediction. Unlike Wang & McAllester (2020), we would like to retrieve the relevant contexts multiple times per sample so as to retrieve the contexts more relevant to the most recently available context. Hence, when we divide a sample into multiple segments, each of which has length N , (i.e., a sequence S0 , S1 , . . . St , . . ., where St is the t-th segment of the sample in the order of timestep), we want to make the value of N reasonably small, e.g., N ≪ 512, but large enough so that each segment makes sense by itself. If St is a source segment, St+1 is used as its target segment. Using faiss, this allows unsupervised causal modeling over the entire training dataset as well as the entire 7 Figure 4: The conventional unsupervised causal modeling (left) vs. our modified MARGE (right). Note that each block consists of multiple tokens. If a model tries to predict the tokens in last two blocks (red), the attention is paid to the past tokens in the red region as well as the tokens in context (green). The modified MARGE reads the immediate context (purple) to find the relevant parts across the dataset and the past of the current sample in order to use them as a context. past of the current sample. Fig. 4 compares the conventional unsupervised causal modeling and our modified MARGE. While the original MARGE uses a heuristics exploiting the metadata of each document to collect similar segments into a shard, for more general case without metadata or to find similarity among samples with not necessarily similar metadata, the use of kNN search with faiss (Johnson et al., 2017) is preferred. At the beginning of the training, we can let each sub-batch to consist of the segments from the same sample (or from similar samples), since poorly initialized clusters are known to result in slow improvement. The model also needs to mask the encoder-decoder attention from a segment to its past segment in the same sample in order to preserve causality. Acknowledgments I wish to record my deep sense of gratitude and profound thanks to Loren Lugosch, Madison May, Lukasz Kaiser and Phillip Wang for their helpful, inspiring, well-thought feedback. References Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer. arXiv e-prints, art. arXiv:2004.05150, April 2020. Mikhail S. Burtsev and Grigory V. Sapunov. arXiv:2006.11527, June 2020. Memory Transformer. arXiv e-prints, art. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers. arXiv e-prints, art. arXiv:1904.10509, Apr 2019. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv e-prints, art. arXiv:1901.02860, Jan 2019. Joe Davison. Twitter post: Comparison of GPT-3 with XLNet with Few-Shot Learning. 2020. URL https://twitter.com/joeddav/status/1285238992724787200. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal Transformers. arXiv e-prints, art. arXiv:1807.03819, July 2018. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv e-prints, October 2018. Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-Adaptive Transformer. arXiv e-prints, art. arXiv:1910.10073, October 2019. Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes. Augmenting Transformers with KNN-Based Composite Memory for Dialogue. arXiv e-prints, art. arXiv:2004.12744, April 2020. 8 Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: RetrievalAugmented Language Model Pre-Training. arXiv e-prints, art. arXiv:2002.08909, February 2020. Michael Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. arXiv eprints, art. arXiv:1906.06755, June 2019. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. arXiv e-prints, art. arXiv:2009.03300, September 2020. Gautier Izacard and Edouard Grave. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv e-prints, art. arXiv:2007.01282, July 2020. Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. arXiv e-prints, art. arXiv:1702.08734, February 2017. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models. arXiv e-prints, art. arXiv:2001.08361, Jan 2020. Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. arXiv e-prints, art. arXiv:2004.04906, April 2020. Nora Kassner and Hinrich Schütze. BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA. arXiv e-prints, art. arXiv:2005.00766, May 2020. Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. arXiv e-prints, art. arXiv:2001.04451, Jan 2020. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large Memory Layers with Product Keys. arXiv e-prints, art. arXiv:1907.05242, July 2019. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv e-prints, art. arXiv:2006.16668, June 2020. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pretraining for Natural Language Generation, Translation, and Comprehension. arXiv e-prints, art. arXiv:1910.13461, October 2019. Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. Pre-training via Paraphrasing. arXiv e-prints, art. arXiv:2006.15020, June 2020a. Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv e-prints, art. arXiv:2005.11401, May 2020b. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, art. arXiv:1907.11692, July 2019. Loren Lugosch, Derek Nowrouzezahrai, and Brett H. Meyer. Surprisal-Triggered Conditional Computation with Neural Networks. arXiv e-prints, art. arXiv:2006.01659, June 2020. Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. DeLighT: Very Deep and Light-weight Transformer. arXiv e-prints, art. arXiv:2008.00623, August 2020. Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. Training Large Neural Networks with Constant Memory using a New Execution Algorithm. arXiv e-prints, art. arXiv:2002.05645, February 2020. 9 Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive Transformers for Long-Range Sequence Modelling. arXiv e-prints, art. arXiv:1911.05507, November 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-toText Transformer. arXiv e-prints, art. arXiv:1910.10683, Oct 2019. Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333389, April 2009. ISSN 1554-0669. doi: 10.1561/1500000019. URL https://doi.org/10.1561/1500000019. Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient Content-Based Sparse Attention with Routing Transformers. arXiv e-prints, art. arXiv:2003.05997, March 2020. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv e-prints, art. arXiv:1701.06538, Jan 2017. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. arXiv e-prints, art. arXiv:2009.01325, September 2020. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive Attention Span in Transformers. arXiv e-prints, art. arXiv:1905.07799, May 2019. Bowen Tan, Zichao Yang, Maruan AI-Shedivat, Eric P. Xing, and Zhiting Hu. Progressive Generation of Long Text. arXiv e-prints, art. arXiv:2006.15720, June 2020. Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking Self-Attention in Transformer Models. arXiv e-prints, art. arXiv:2005.00743, May 2020. Chau Tran, Yuqing Tang, Xian Li, and Jiatao Gu. Cross-lingual Retrieval for Iterative SelfSupervised Training. arXiv e-prints, art. arXiv:2006.09526, June 2020. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need. ArXiv e-prints, June 2017. Hai Wang and David McAllester. On-The-Fly Information Retrieval Augmentation for Language Models. arXiv e-prints, art. arXiv:2007.01528, July 2020. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv e-prints, art. arXiv:1906.08237, June 2019. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Transformers for Longer Sequences. arXiv e-prints, art. arXiv:2007.14062, July 2020. 6 Appendix 6.1 Some Other Approaches 6.1.1 Learning to Summarize from Human Feedback (Stiennon et al., 2020) This work achieves super-human-level summarization on TL;DR dataset by training a reward function on human feedback and fine-tuning a pre-trained causal model with PPO. This approach is very general for being applicable to any rank-able tasks with human evaluation easily available. 10 Since modeling human-generated contents with MLE may be merely an imitation of human behavior and may not allow super-human performance, it is possible that this approach is more efficient for achieving super-human performance than MLE modeling. However, due to its need for a large amount of supervision, we do not investigate this approach further in our paper. 6.1.2 Progressive Generation of Long Text (Tan et al., 2020) Progressive Generation of Long Text achieves substantially better text generation (human-evaluated) quality and diversity than GPT-2 by pre-training multiple Seq2Seq models, each of which ”superresolutes” a text, ultimately from a set of keywords into a complete text. However, since the evaluation is based on fine-tuning to a domain-specific dataset, the same remark as non-causal models apply. There needs more investigation into how it would cope with general textual tasks. 6.1.3 DeLighT: Very Deep and Light-weight Transformer (Mehta et al., 2020) DeLighT, a Transformer language model, each of whose layer consists of a smaller feedforward network, a single head self-attention and DExTra, a module consisting of many layers of thin group linear layers, performs on par with the baseline Transformer language model with twice more parameters. Since group linear layer is slow without a custom CUDA kernel, the model would become faster than the baseline for a given performance if there is a custom CUDA kernel implemented for DExTra. However, since we often observe that a deep thin model is slower than a shallow wide model with the same parameter count using GPUs regardless of using a custom CUDA kernel, we are not certain if this would improve the performance-computes trade-off. Furthermore, since the source of improvement seems to stem from the use of DExTra, a block that has no interaction between different timesteps, we can assume that this module augments the feedforward network likely without any effect on the attention layer. Hence, DExTra is in a direct competition with MoE. While it is likely that they can be combined for further gain, the attraction of DExTra may diminish if MoE is also taken into the consideration. 6.2 More Details of Sec. 2: Non-causal Models 6.2.1 Open-ended text generation with the output loosely constrained by the input Let us consider the task of open-ended text generation in the sense that the input provided (e.g. prompt) is short and not very informative, while the expected output is substantially longer and more informative, in a way similar to how GPT-2 generates an article given an input. Since we expect a model to have the same autonomy as humans, the model has to be able to generate without detailed specifications from humans. This is in contrast to many of the tasks non-causal models are evaluated on (BART as a notable exception), which are often either discriminative tasks or generative tasks whose input sequence contains substantial information about the output, such as summarization and translation. In fact, Lewis et al. (2019) shows that BART and BERT variants perform poorly on none but the causal generative modeling of ELI5 dataset compared with the causal model, as they argue that the task on this dataset belongs to the aforementioned family of tasks. 6.3 More Details of Sec. 3: Extension of Batch Length With Efficient Attention 6.3.1 Improvement from extending the batch length would be limited at some point by the trade-off between model size and batch size According to Kaplan et al. (2020), the compute-optimal training strategy of causal Transformer language model is to allocate most of the increase in available computes (C) into parameter count (N ) rather than batch size (B) or the number of iterations (S): N ∝ C 0.73 , B ∝ C 0.24 , S ∝ C 0.03 . 11 Note that we have B = bL, where b is the number of sequences per minibatch. In practice, b has to be sufficiently large; otherwise, it is often observed that the resulting perplexity degrades due to lack of diversity in minibatch. Since optimal size of B for even the largest model explored in Kaplan et al. (2020) is about several millions of tokens, it is reasonable to assume that L = Bb cannot be made arbitrarily large, possibly at most the order of million. Thus, it is safe to argue that an improvement from extending the batch length would be limited by the trade-off between model size and batch size. However, a caveat follows. The aforementioned compute-optimal strategy is under the assumption that the batch length is fixed (L = 1024) and that WebText is used as the dataset. If L can be enlarged and if a dataset with longer average sample size is used, the scaling exponents for computeoptimal training should change, though given the importance of batch size to the performance, it is reasonable to assume that the change is small. 6.3.2 Why does extension of batch length with efficient attention not improve the early token prediction? To our assertion that extension of batch length does not improve the early token prediction, there is a good counter-argument: the loss of early tokens may be substantially improved if batch length is extended from L ∼ 1024 to L′ ≫ L, so that numerous samples are covered and the model will utilize inter-sample information a la implicit memory approach and retrieval-based approach. As addressed before, it is very likely that L′ is bounded above (possibly at the order of million). Hence, if samples are ordered randomly, the samples in the same context are far from the kNNs from each other. Therefore, the performance gain from inter-sample information is likely to be negligible compared with the gain with retrieval-based approach, which can find the k-nearest neighbors. 6.3.3 Combination of retrieval-based approach with the approach of extending batch length with efficient attention It is possible to combine efficient attention with retrieval-based approach in order to increase k of kNN samples to be used for prediction with a retriever, since a larger k notably results in a better performance to some extent according to Izacard & Grave (2020). However, there are two caveats. Firstly, the existing efficient attention may be already efficient enough that any further improvement of the method may not be beneficial for this use, as the empirical complexity of efficient attention would not be improved further according to Sec. 3. Secondly, the design of such efficient attention may depend on how the retrieval-based model is designed. For example, MARGE embeds each sample, so that the samples with the closest embeddings can attend to each other through attention. In this case, instead of applying a method as Routing Transformer, one can simply approximate the full attention with top-k ′ approximation at sample-level for some reasonable k ′ , which does not require any sophisticated efficient attention method. 6.3.4 How to assess if a given method improves early token prediction In this paper, we define ”the improvement of early token prediction” to be the improvement of the loss of the first L tokens of each sample for a small enough L. We set L ∼ 1024, since the average length of a sample in the currently used gigantic text datasets is about 1000 (for being a subset of Common Crawl) and since setting L ∼ dmodel is a reasonable choice for making the full-attention computes comparable to that of the linear layer (as dmodel ∼ 1024). The most straightforward method to verify if a given model improves early token prediction or not is to compare the n-th per-token loss of early tokens for various n, which we have performed on various models, which led to our conclusion. However, we can also demonstrate to some extent that a given method improves the early token prediction without comparing the per-token loss for some cases. Models without recurrence tend to divide each sample into segments, each of which has length L, and they tend to be evaluated on a segment without being conditioned on the past tokens 12 outside of the segment within the same sample. In this case, each segment is essentially treated as a sample for the lack of dependence on each other. Hence, the usual loss of the test dataset is equal to the loss of the first L tokens of each sample. In other words, for a model without recurrence, if the segment length of the model is equal to that of its baseline, it suffices to compare the loss of the model with that of the baseline to see if the model improves the early token prediction, since ”early token” cont. For example, knn-LM and its baseline are the same vanilla Transformer, except that the former is also conditioned on the distribution of the relevant tokens from the training dataset. The computes spent for the former is only marginally larger than that of the latter, while the improvement in the loss is substantial; hence, knn-LM improves the early token prediction for the given computes. A similar conclusion can be made for the GPT-2 with a retriever as in Wang & McAllester (2020), which is an augmented GPT-2 fine-tuned to predict conditioned on the sentences fed by a retriever. Since the fine-tuning computes are marginal relative to that of the pre-training and since the improvement in performance is large enough, we can argue that they too improve early token prediction for the given computes. As discussed later, many implicit memory models are the same as its baseline except for being an efficient approximation to a Transformer with a larger model size. Hence, this implies that such implicit memory models that improve the performance-computes trade-off, including MoE and PKM, improve the early token prediction for the given computes. 6.4 More Details of Sec. 4: Memory-Based Approach 6.4.1 The tasks recurrence excels at and self-attention struggles with Self-attention is known to perform poorly on some tasks recurrence excel at. For example, Hahn (2019) theoretically shows the former struggles with predicting the parity of a given long bitstring the latter excels at. However, TBPTT-segment-wise recurrence as explicit memory approach with selfattention usually needs to have long enough batch length (e.g. 128 tokens) to make its performancecomputes trade-off reasonably high on various tasks of practical interest. With this batch length, the local behavior of explicit memory model should be identical to that of self-attention. Hence, it should struggle just as self-attention. Furthermore, the tasks self-attention struggles with and recurrence excels at tend to be difficult even for humans. For example, if humans, without provided any instruction, are forced to find a pattern in an extremely long bitstring to figure out that the problem asks for the parity, they too would struggle with the task. 6.4.2 More details about MoE and PKM Let us consider a feedforward layer with a very large df f . MoE can imitate this by dividing this layer into many feedforward layers with smaller df f , each of which is an expert of MoE, and the gating of MoE controls which expert to be activated for a given input vector. On the other hand, PKM can imitate a large feedforward network by exploiting the following structural sparsity of feedforward layer we have observed: masking every units but the top-k units of df f units by zero does not affect df f does the training of the model substantially even if k ≪ dmodel (we observed that setting k = 10 not affect the perplexity at all in some cases). Hence, PKM tries to predict which units are the top-k units for a given input vector by its product key mechanism and efficiently computes the output by fused gather-matmul, which in practice results in slower processing than MoE due to memory bottleneck. Scaling up the model size by evenly increasing df f , dmodel , h (the number of heads) and depth, without increasing dk (the dimension of each head) tends to beat other approaches. This is presumably why GPT-2 and GPT-3 were scaled up in a similar manner. Let us call this optimal model scaling. Hence, increasing df f only for scaling is suboptimal, and it is likely that the scaling law using this scaling with respect to the parameter count may not hold as long as that of the optimal scaling, not to mention that the power law exponent is likely worse. However, the gain from MoE and PKM is still substantial. 13 Figure 5: (Izacard & Grave, 2020). Comparison of the state-of-the-art QA models. 6.4.3 Limitations of conditional computation More successful conditional computation methods, such as MoE, are mostly an efficient approximation to a Transformer architecture with larger model hyperparameters, such as df f . Hence, its performance is, at best, equal to that of the larger Transformer. However, scaling up model hyperparameters does not seem to solely resolve all the problems. Fig. 5 shows that T5 performs significantly worse than some open-domain QA models despite substantially larger computes spent for the training. Likewise, Lewis et al. (2020a) shows that MARGE outperforms various baselines that consume substantially more computes and use more task-specific human intervention. Thus, we may not only need conditional computation but also retrieval-based approaches in order to improve the performance-computes trade-off further. 14