arXiv:2501.09223v1 [cs.CL] 16 Jan 2025 Foundations of Large Language Models Tong Xiao and Jingbo Zhu January 17, 2025 NLP Lab, Northeastern University & NiuTrans Research Copyright © 2021-2025 Tong Xiao and Jingbo Zhu NATURAL L ANGUAGE P ROCESSING L AB , N ORTHEASTERN U NIVERSITY & N IU T RANS R ESEARCH Licensed under the Creative Commons Attribution-NonCommercial 4.0 Unported License (the “License”). You may not use this file except in compliance with the License. You may obtain a copy of the License at http://creativecommons.org/licenses/by-nc/4.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS ” BASIS , WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. January 17, 2025 Preface Large language models originated from natural language processing, but they have undoubtedly become one of the most revolutionary technological advancements in the field of artificial intelligence in recent years. An important insight brought by large language models is that knowledge of the world and languages can be acquired through large-scale language modeling tasks, and in this way, we can create a universal model that handles diverse problems. This discovery has profoundly impacted the research methodologies in natural language processing and many related disciplines. We have shifted from training specialized systems from scratch using a large amount of labeled data to a new paradigm of using large-scale pre-training to obtain foundation models, which are then fine-tuned, aligned, and prompted. This book aims to outline the basic concepts of large language models and introduce the related techniques. As the title suggests, the book focuses more on the foundational aspects of large language models rather than providing comprehensive coverage of all cutting-edge methods. The book consists of four chapters: • Chapter 1 introduces the basics of pre-training. This is the foundation of large language models, and common pre-training methods and model architectures will be discussed here. • Chapter 2 introduces generative models, which are the large language models we commonly refer to today. After presenting the basic process of building these models, we will also explore how to scale up model training and handle long texts. • Chapter 3 introduces prompting methods for large language models. We will discuss various prompting strategies, along with more advanced methods such as chain-of-thought reasoning and automatic prompt design. • Chapter 4 introduces alignment methods for large language models. This chapter focuses on instruction fine-tuning and alignment based on human feedback. If readers have some background in machine learning and natural language processing, along with a certain understanding of neural networks like Transformers, reading this book will be quite easy. However, even without this prior knowledge, it is still perfectly fine, as we have made the content of each chapter as self-contained as possible, ensuring that readers will not be burdened with too much reading difficulty. In writing this book, we have gradually realized that it is more like a compilation of "notes" we have taken while learning about large language models. Through this note-taking writing style, we hope to offer readers a flexible learning path. Whether they wish to dive deep into a specific area or gain a comprehensive understanding of large language models, they will find the knowledge and insights they need within these "notes". We would like to thank the students in our laboratory and all our friends who have shared with us their views on large language models and helped with corrections of errors in writing. In particular, we wish to thank Weiqiao Shan, Yongyu Mu, Chenglong Wang, Kaiyan Chang, Yuchun Fan, Hang Zhou, Xinyu Liu, Huiwen Bao, Tong Zheng, Junhao Ruan, and Qing Yang. ii Notation a variable a row vector or matrix f (a) max f (a) function of a maximum value of f (a) arg maxa f (a) value of a that maximizes f (a) x input token sequence to a model xj input token at position j y output token sequence produced by a model yi output token at position i θ model parameters Pr(a) probability of a Pr(a|b) conditional probability of a given b Pr(·|b) probability distribution of a variable given b Prθ (a) probability of a as parameterized by θ ht hidden state at time step t in sequential models H matrix of all hidden states over time in a sequence Q, K, V Softmax(A) query, key, and value matrices in attention mechanisms Softmax function that normalizes the input vector or matrix A L loss function D dataset used for training or fine-tuning a model ∂L ∂θ gradient of the loss function L with respect to the parameters θ KL(p || q) KL divergence between distributions p and q iii Contents 1 Pre-training 1.1 1 Pre-training NLP Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Unsupervised, Supervised and Self-supervised Pre-training . . . . . . . . 2 1.1.2 Adapting Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . 3 Self-supervised Pre-training Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Decoder-only Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Encoder-only Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Encoder-Decoder Pre-training . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.4 Comparison of Pre-training Tasks . . . . . . . . . . . . . . . . . . . . . 20 Example: BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3.1 The Standard Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3.2 More Training and Larger Models . . . . . . . . . . . . . . . . . . . . . 27 1.3.3 More Efficient Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3.4 Multi-lingual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.4 Applying BERT Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.2 1.3 2 Generative Models 2.1 2.2 2.3 36 A Brief Introduction to LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.1 Decoder-only Transformers . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.2 Training LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.1.3 Fine-tuning LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.1.4 Aligning LLMs with the World . . . . . . . . . . . . . . . . . . . . . . 46 2.1.5 Prompting LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Training at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2 Model Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.2.3 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2.4 Scaling Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Long Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.3.1 Optimization from HPC Perspectives . . . . . . . . . . . . . . . . . . . 67 2.3.2 Efficient Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.3.3 Cache and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.3.4 Sharing across Heads and Layers . . . . . . . . . . . . . . . . . . . . . 80 iv v 2.4 2.3.5 Position Extrapolation and Interpolation . . . . . . . . . . . . . . . . . . 82 2.3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3 Prompting 3.1 3.2 3.3 3.4 96 General Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.1.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.1.2 In-context Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.1.3 Prompt Engineering Strategies . . . . . . . . . . . . . . . . . . . . . . . 101 3.1.4 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Advanced Prompting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.2.1 Chain of Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.2.2 Problem Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.2.3 Self-refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.2.4 Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.2.5 RAG and Tool Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Learning to Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.3.1 Prompt Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.3.2 Soft Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.3.3 Prompt Length Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 152 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4 Alignment 155 4.1 An Overview of LLM Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2 Instruction Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.3 4.4 4.2.1 Supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.2.2 Fine-tuning Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 161 4.2.3 Fine-tuning with Less Data . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.2.4 Instruction Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.2.5 Using Weak Models to Improve Strong Models . . . . . . . . . . . . . . 169 Human Preference Alignment: RLHF . . . . . . . . . . . . . . . . . . . . . . . 172 4.3.1 Basics of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 173 4.3.2 Training Reward Models . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.3.3 Training LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Improved Human Preference Alignment . . . . . . . . . . . . . . . . . . . . . . 187 4.4.1 Better Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Notation vi 4.5 4.4.2 Direct Preference Optimization . . . . . . . . . . . . . . . . . . . . . . 193 4.4.3 Automatic Preference Data Generation . . . . . . . . . . . . . . . . . . 196 4.4.4 Step-by-step Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.4.5 Inference-time Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 200 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Bibliography 203 C HAPTER 1 Pre-training The development of neural sequence models, such as Transformers [Vaswani et al., 2017], along with the improvements in large-scale self-supervised learning, has opened the door to universal language understanding and generation. This achievement is largely motivated by pre-training: we separate common components from many neural network-based systems, and then train them on huge amounts of unlabeled data using self-supervision. These pre-trained models serve as foundation models that can be easily adapted to different tasks via fine-tuning or prompting. As a result, the paradigm of NLP has been enormously changed. In many cases, large-scale supervised learning for specific tasks is no longer required, and instead, we only need to adapt pre-trained foundation models. While pre-training has gained popularity in recent NLP research, this concept dates back decades to the early days of deep learning. For example, early attempts to pre-train deep learning systems include unsupervised learning for RNNs, deep feedforward networks, autoencoders, and others [Schmidhuber, 2015]. In the modern era of deep learning, we experienced a resurgence of pre-training, caused in part by the large-scale unsupervised learning of various word embedding models [Mikolov et al., 2013b; Pennington et al., 2014]. During the same period, pre-training also attracted significant interest in computer vision, where the backbone models were trained on relatively large labeled datasets such as ImageNet, and then applied to different downstream tasks [He et al., 2019; Zoph et al., 2020]. Large-scale research on pre-training in NLP began with the development of language models using self-supervised learning. This family of models covers several well-known examples like BERT [Devlin et al., 2019] and GPT [Brown et al., 2020], all with a similar idea that general language understanding and generation can be achieved by training the models to predict masked words in a huge amount of text. Despite the simple nature of this approach, the resulting models show remarkable capability in modeling linguistic structure, though they are not explicitly trained to achieve this. The generality of the pre-training tasks leads to systems that exhibit strong performance in a large variety of NLP problems, even outperforming previously well-developed supervised systems. More recently, pre-trained large language models have achieved a greater success, showing the exciting prospects for more general artificial intelligence [Bubeck et al., 2023]. This chapter discusses the concept of pre-training in the context of NLP. It begins with a general introduction to pre-training methods and their applications. BERT is then used as an example to illustrate how a sequence model is trained via a self-supervised task, called masked language modeling. This is followed by a discussion of methods for adapting pre-trained sequence models for various NLP tasks. Note that in this chapter, we will focus primarily on the pre-training paradigm in NLP, and therefore, we do not intend to cover details about generative large language models. A detailed discussion of these models will be left to subsequent chapters. 1.1 Pre-training NLP Models The discussion of pre-training issues in NLP typically involves two types of problems: sequence modeling (or sequence encoding) and sequence generation. While these problems have different 1 Pre-training 2 forms, for simplicity, we describe them using a single model defined as follows: o = g(x0 , x1 , ..., xm ; θ) = gθ (x0 , x1 , ..., xm ) (1.1) where {x0 , x1 , ..., xm } denotes a sequence of input tokens1 , x0 denotes a special symbol (hsi or [CLS]) attached to the beginning of a sequence, g(·; θ) (also written as gθ (·)) denotes a neural network with parameters θ, and o denotes the output of the neural network. Different problems can vary based on the form of the output o. For example, in token prediction problems (as in language modeling), o is a distribution over a vocabulary; in sequence encoding problems, o is a representation of the input sequence, often expressed as a real-valued vector sequence. There are two fundamental issues here. • Optimizing θ on a pre-training task. Unlike standard learning problems in NLP, pre-training does not assume specific downstream tasks to which the model will be applied. Instead, the goal is to train a model that can generalize across various tasks. • Applying the pre-trained model gθ̂ (·) to downstream tasks. To adapt the model to these tasks, we need to adjust the parameters θ̂ slightly using labeled data or prompt the model with task descriptions. In this section, we discuss the basic ideas in addressing these issues. 1.1.1 Unsupervised, Supervised and Self-supervised Pre-training In deep learning, pre-training refers to the process of optimizing a neural network before it is further trained/tuned and applied to the tasks of interest. This approach is based on an assumption that a model pre-trained on one task can be adapted to perform another task. As a result, we do not need to train a deep, complex neural network from scratch on tasks with limited labeled data. Instead, we can make use of tasks where supervision signals are easier to obtain. This reduces the reliance on task-specific labeled data, enabling the development of more general models that are not confined to particular problems. During the resurgence of neural networks through deep learning, many early attempts to achieve pre-training were focused on unsupervised learning. In these methods, the parameters of a neural network are optimized using a criterion that is not directly related to specific tasks. For example, we can minimize the reconstruction cross-entropy of the input vector for each layer [Bengio et al., 2006]. Unsupervised pre-training is commonly employed as a preliminary step before supervised learning, offering several advantages, such as aiding in the discovery of better local minima and adding a regularization effect to the training process [Erhan et al., 2010]. These benefits make the subsequent supervised learning phase easier and more stable. A second approach to pre-training is to pre-train a neural network on supervised learning tasks. For example, consider a sequence model designed to encode input sequences into some 1 Here we assume that tokens are basic units of text that are separated through tokenization. Sometimes, we will use the terms token and word interchangeably, though they have closely related but slightly different meanings in NLP. 1.1 Pre-training NLP Models 3 representations. In pre-training, this model is combined with a classification layer to form a classification system. This system is then trained on a pre-training task, such as classifying sentences based on sentiment (e.g., determining if a sentence conveys a positive or negative sentiment). Then, we adapt the sequence model to a downstream task. We build a new classification system based on this pre-trained sequence model and a new classification layer (e.g., determining if a sequence is subjective or objective). Typically, we need to fine-tune the parameters of the new model using task-specific labeled data, ensuring the model is optimally adjusted to perform well on this new type of data. The fine-tuned model is then employed to classify new sequences for this task. An advantage of supervised pre-training is that the training process, either in the pretraining or fine-tuning phase, is straightforward, as it follows the well-studied general paradigm of supervised learning in machine learning. However, as the complexity of the neural network increases, the demand for more labeled data also grows. This, in turn, makes the pre-training task more difficult, especially when large-scale labeled data is not available. A third approach to pre-training is self-supervised learning. In this approach, a neural network is trained using the supervision signals generated by itself, rather than those provided by humans. This is generally done by constructing its own training tasks directly from unlabeled data, such as having the system create pseudo labels. While self-supervised learning has recently emerged as a very popular method in NLP, it is not a new concept. In machine learning, a related concept is self-training where a model is iteratively improved by learning from the pseudo labels assigned to a dataset. To do this, we need some seed data to build an initial model. This model then generates pseudo labels for unlabeled data, and these pseudo labels are subsequently used to iteratively refine and bootstrap the model itself. Such a method has been successfully used in several NLP areas, such as word sense disambiguation [Yarowsky, 1995] and document classification [Blum and Mitchell, 1998]. Unlike the standard self-training method, self-supervised pre-training in NLP does not rely on an initial model for annotating the data. Instead, all the supervision signals are created from the text, and the entire model is trained from scratch. A well-known example of this is training sequence models by successively predicting a masked word given its preceding or surrounding words in a text. This enables large-scale self-supervised learning for deep neural networks, leading to the success of pre-training in many understanding, writing, and reasoning tasks. Figure 1.1 shows a comparison of the above three pre-training approaches. Self-supervised pre-training is so successful that most current state-of-the-art NLP models are based on this paradigm. Therefore, in this chapter and throughout this book, we will focus on self-supervised pre-training. We will show how sequence models are pre-trained via self-supervision and how the pre-trained models are applied. 1.1.2 Adapting Pre-trained Models As mentioned above, two major types of models are widely used in NLP pre-training. • Sequence Encoding Models. Given a sequence of words or tokens, a sequence encoding model represents this sequence as either a real-valued vector or a sequence of vectors, and obtains a representation of the sequence. This representation is typically used as input to another model, such as a sentence classification system. Pre-training 4 Prompting Zero/Few Shot Learning Pre-training Training Pre-training Tuning Pre-training Tuning Unsupervised Supervised Supervised Supervised SelfSupervised Supervised Unlabeled Data Labeled Data Labeled Data Labeled Data Unlabeled Data Labeled Data Task 2 Task 1 (a) Unsupervised Pre-training (b) Supervised Pre-training (c) Self-supervised Pre-training Fig. 1.1: Illustration of unsupervised, supervised, and self-supervised pre-training. In unsupervised pre-training, the pre-training is performed on large-scale unlabeled data. It can be viewed as a preliminary step to have a good starting point for the subsequent optimization process, though considerable effort is still required to further train the model with labeled data after pre-training. In supervised pre-training, the underlying assumption is that different (supervised) learning tasks are related. So we can first train the model on one task, and transfer the resulting model to another task with some training or tuning effort. In self-supervised pre-training, a model is pre-trained on large-scale unlabeled data via self-supervision. The model can be well trained in this way, and we can efficiently adapt it to new tasks through fine-tuning or prompting. • Sequence Generation Models. In NLP, sequence generation generally refers to the problem of generating a sequence of tokens based on a given context. The term context has different meanings across applications. For example, it refers to the preceding tokens in language modeling, and refers to the source-language sequence in machine translation2 . We need different techniques for applying these models to downstream tasks after pre-training. Here we are interested in the following two methods. 1.1.2.1 Fine-tuning of Pre-trained Models For sequence encoding pre-training, a common method of adapting pre-trained models is finetuning. Let Encodeθ (·) denote an encoder with parameters θ, for example, Encodeθ (·) can be a standard Transformer encoder. Provided we have pre-trained this model in some way and obtained the optimal parameters θ̂, we can employ it to model any sequence and generate the corresponding representation, like this H = Encodeθ̂ (x) (1.2) where x is the input sequence {x0 , x1 , ..., xm }, and H is the output representation which is a sequence of real-valued vectors {h0 , h1 , ..., hm }. Because the encoder does not work as a standalone NLP system, it is often integrated as a component into a bigger system. Consider, for example, a text classification problem in which we identify the polarity (i.e., positive, negative, 2 More precisely, in auto-regressive decoding of machine translation, each target-language token is generated based on both its preceding tokens and source-language sequence. 1.1 Pre-training NLP Models 5 and neutral) of a given text. We can build a text classification system by stacking a classifier on top of the encoder. Let Classifyω (·) be a neural network with parameters ω. Then, the text classification model can be expressed in the form Prω,θ̂ (·|x) = Classifyω (H) (1.3) = Classifyω (Encodeθ̂ (x)) Here Prω,θ̂ (·|x) is a probability distribution over the label set {positive, negative, neutral}, and the label with the highest probability in this distribution is selected as output. To keep the notation uncluttered, we will use Fω,θ̂ (·) to denote Classifyω (Encodeθ̂ (·)). Because the model parameters ω and θ̂ are not optimized for the classification task, we cannot directly use this model. Instead, we must use a modified version of the model that is adapted to the task. A typical way is to fine-tune the model by giving explicit labeling in downstream tasks. We can train Fω,θ̂ (·) on a labeled dataset, treating it as a common supervised learning task. The outcome of the fine-tuning is the parameters ω̃ and θ̃ that are further optimized. Alternatively, we can freeze the encoder parameters θ̂ to maintain their pre-trained state, and focus solely on optimizing ω. This allows the classifier to be efficiently adapted to work in tandem with the pre-trained encoder. Once we have obtained a fine-tuned model, we can use it to classify a new text. For example, suppose we have a comment posted on a travel website: I love the food here. It’s amazing! We first tokenize this text into tokens3 , and then feed the token sequence xnew into the fine-tuned model Fω̃,θ̃ (·). The model generates a distribution over classes by Fω̃,θ̃ (xnew ) = h Pr(positive|xnew ) Pr(negative|xnew ) Pr(neutral|xnew ) i (1.4) And we select the label of the entry with the maximum value as output. In this example it is positive. In general, the amount of labeled data used in fine-tuning is small compared to that of the pre-training data, and so fine-tuning is less computationally expensive. This makes the adaption of pre-trained models very efficient in practice: given a pre-trained model and a downstream task, we just need to collect some labeled data, and slightly adjust the model parameters on this data. A more detailed discussion of fine-tuning can be found in Section 1.4. 1.1.2.2 Prompting of Pre-trained Models Unlike sequence encoding models, sequence generation models are often employed independently to address language generation problems, such as question answering and machine translation, without the need for additional modules. It is therefore straightforward to fine-tune these models 3 The text can be tokenized in many different ways. One of the simplest is to segment the text into English words and punctuations {I, love, the, food, here, ., It, ’s, amazing, !} Pre-training 6 as complete systems on downstream tasks. For example, we can fine-tune a pre-trained encoderdecoder multilingual model on some bilingual data to improve its performance on a specific translation task. Among various sequence generation models, a notable example is the large language models trained on very large amounts of data. These language models are trained to simply predict the next token given its preceding tokens. Although token prediction is such a simple task that it has long been restricted to “language modeling” only, it has been found to enable the learning of the general knowledge of languages by repeating the task a large number of times. The result is that the pre-trained large language models exhibit remarkably good abilities in token prediction, making it possible to transform numerous NLP problems into simple text generation problems through prompting the large language models. For example, we can frame the above text classification problem as a text generation task I love the food here. It’s amazing! I’m Here indicates the word or phrase we want to predict (call it the completion). If the predicted word is happy, or glad, or satisfied or a related positive word, we can classify the text as positive. This example shows a simple prompting method in which we concatenate the input text with I’m to form a prompt. Then, the completion helps decide which label is assigned to the original text. Given the strong performance of language understanding and generation of large language models, a prompt can instruct the models to perform more complex tasks. Here is a prompt where we prompt the LLM to perform polarity classification with an instruction. Assume that the polarity of a text is a label chosen from {positive, negative, neutral}. Identify the polarity of the input. Input: I love the food here. It’s amazing! Polarity: The first two sentences are a description of the task. Input and Polarity are indicators of the input and output, respectively. We expect the model to complete the text and at the same time give the correct polarity label. By using instruction-based prompts, we can adapt large language models to solve NLP problems without the need for additional training. This example also demonstrates the zero-shot learning capability of large language models, which can perform tasks that were not observed during the training phase. Another method for enabling new capabilities in a neural network is few-shot learning. This is typically achieved through in-context learning (ICT). More specifically, we add some samples that demonstrate how an input corresponds to an output. These samples, known as demonstrations, are used to teach large language models how to perform the task. Below is an example involving demonstrations 1.2 Self-supervised Pre-training Tasks 7 Assume that the polarity of a text is a label chosen from {positive, negative, neutral}. Identify the polarity of the input. Input: The traffic is terrible during rush hours, making it difficult to reach the airport on time. Polarity: Negative Input: The weather here is wonderful. Polarity: Positive Input: I love the food here. It’s amazing! Polarity: Prompting and in-context learning play important roles in the recent rise of large language models. We will discuss these issues more deeply in Chapter 3. However, it is worth noting that while prompting is a powerful way to adapt large language models, some tuning efforts are still needed to ensure the models can follow instructions accurately. Additionally, the fine-tuning process is crucial for aligning the values of these models with human values. More detailed discussions of fine-tuning can be found in Chapter 4. 1.2 Self-supervised Pre-training Tasks In this section, we consider self-supervised pre-training approaches for different neural architectures, including decoder-only, encoder-only, and encoder-decoder architectures. We restrict our discussion to Transformers since they form the basis of most pre-trained models in NLP. However, pre-training is a broad concept, and so we just give a brief introduction to basic approaches in order to make this section concise. 1.2.1 Decoder-only Pre-training The decoder-only architecture has been widely used in developing language models [Radford et al., 2018]. For example, we can use a Transformer decoder as a language model by simply removing cross-attention sub-layers from it. Such a model predicts the distribution of tokens at a position given its preceding tokens, and the output is the token with the maximum probability. The standard way to train this model, as in the language modeling problem, is to minimize a loss function over a collection of token sequences. Let Decoderθ (·) denote a decoder with parameters θ. At each position i, the decoder generates a distribution of the next tokens based on its preceding tokens {x0 , ..., xi }, denoted by Prθ (·|x0 , ..., xi ) (or pθi+1 for short). Suppose we have the goldstandard distribution at the same position, denoted by pgold i+1 . For language modeling, we can think gold of pi+1 as a one-hot representation of the correct predicted word. We then define a loss function L(pθi+1 , pgold i+1 ) to measure the difference between the model prediction and the true prediction. In NLP, the log-scale cross-entropy loss is typically used. Given a sequence of m tokens {x0 , ..., xm }, the loss on this sequence is the sum of the loss Pre-training 8 over the positions {0, ..., m − 1}, given by Lossθ (x0 , ..., xm ) = = m−1 X i=0 m−1 X L(pθi+1 , pgold i+1 ) LogCrossEntropy(pθi+1 , pgold i+1 ) (1.5) i=0 where LogCrossEntropy(·) is the log-scale cross-entropy, and pgold i+1 is the one-hot representation of xi+1 . This loss function can be extended to a set of sequences D. In this case, the objective of pre-training is to find the best parameters that minimize the loss on D θ̂ = arg min θ X Lossθ (x) (1.6) x∈D Note that this objective is mathematically equivalent to maximum likelihood estimation, and can be re-expressed as θ̂ = arg max θ = arg max θ X log Prθ (x) x∈D i−1 XX x∈D i=0 log Prθ (xi+1 |x0 , ..., xi ) (1.7) With these optimized parameters θ̂, we can use the pre-trained language model Decoderθ̂ (·) to compute the probability Prθ̂ (xi+1 |x0 , ..., xi ) at each position of a given sequence. 1.2.2 Encoder-only Pre-training As defined in Section 1.1.2.1, an encoder Encoderθ (·) is a function that reads a sequence of tokens x = x0 ...xm and produces a sequence of vectors H = h0 ...hm 4 . Training this model is not straightforward, as we do not have gold-standard data for measuring how good the output of the real-valued function is. A typical approach to encoder pre-training is to combine the encoder with some output layers to receive supervision signals that are easier to obtain. Figure 1.2 shows a common architecture for pre-training Transformer encoders, where we add a Softmax layer on top of the Transformer encoder. Clearly, this architecture is the same as that of the decoder-based language model, and the output is a sequence of probability distributions   pW,θ  1.   .  = SoftmaxW (Encoderθ (x))  .  (1.9) pW,θ m 4 If we view hi as a row vector, H can be written as H =   h0  ..   .  hm (1.8) 1.2 Self-supervised Pre-training Tasks 9 Self-supervision E.g., evaluate how well the model reconstructs the masked token Output for Downstream Tasks Softmax Prediction Network Encoder Pre-trained Encoder e0 e1 e2 x0 x1 x2 e3 e4 e0 e1 e2 e3 e4 x3 x4 x0 x1 x2 x3 x4 (masked) (a) Pre-training (b) Applying the Pre-trained Encoder Fig. 1.2: Pre-training a Transformer encoder (left) and then applying the pre-trained encoder (right). In the pre-training phase, the encoder, together with a Softmax layer, is trained via self-supervision. In the application phase, the Softmax layer is removed, and the pre-trained encoder is combined with a prediction network to address specific problems. In general, for better adaptation to these tasks, the system is fine-tuned using labeled data. Here pW,θ is the output distribution Pr(·|x) at position i. We use SoftmaxW (·) to denote that i the Softmax layer is parameterized by W, that is, SoftmaxW (H) = Softmax(H · W). For notation simplicity, we will sometimes drop the superscripts W and θ affixed to each probability distribution. The difference between this model and standard language models is that the output pi has different meanings in encoder pre-training and language modeling. In language modeling, pi is the probability distribution of predicting the next word. This follows an auto-regressive decoding process: a language model only observes the words up to position i and predicts the next. By contrast, in encoder pre-training, the entire sequence can be observed at once, and so it makes no sense to predict any of the tokens in this sequence. 1.2.2.1 Masked Language Modeling One of the most popular methods of encoder pre-training is masked language modeling, which forms the basis of the well-known BERT model [Devlin et al., 2019]. The idea of masked language modeling is to create prediction challenges by masking out some of the tokens in the input sequence and training a model to predict the masked tokens. In this sense, the conventional language modeling problem, which is sometimes called causal language modeling, is a special case of masked language modeling: at each position, we mask the tokens in the right-context, and predict the token at this position using its left context. However, in causal language modeling we only make use of the left-context in word prediction, while the prediction may depend on tokens in the right-context. By contrast, in masked language modeling, all the unmasked tokens are used for word prediction, leading to a bidirectional model that makes predictions based on both left and right-contexts. Pre-training 10 More formally, for an input sequence x = x0 ...xm , suppose that we mask the tokens at positions A(x) = {i1 , ..., iu }. Hence we obtain a masked token sequence x̄ where the token at each position in A(x) is replaced with a special symbol [MASK]. For example, for the following sequence The early bird catches the worm we may have a masked token sequence like this The [MASK] bird catches the [MASK] where we mask the tokens early and worm (i.e., i1 = 2 and i2 = 6). Now we have two sequences x and x̄. The model is then optimized so that we can correctly predict x based on x̄. This can be thought of as an autoencoding-like process, and the training objective is to maximize the reconstruction probability Pr(x|x̄). Note that there is a simple position-wise alignment between x and x̄. Because an unmasked token in x̄ is the same as the token in x at the same position, there is no need to consider the prediction for this unmasked token. This leads to a simplified training objective which only maximizes the probabilities for masked tokens. We can express this objective in a maximum likelihood estimation fashion (Ŵ, θ̂) = arg max W,θ X X log PrW,θ (xi |x̄) i (1.10) LogCrossEntropy(pW,θ , pgold ) i i (1.11) x∈D i∈A(x) or alternatively express it using the cross-entropy loss c θ̂) = arg min (W, W,θ X X x∈D i∈A(x) where PrW,θ (xk |x̄) is the probability of the true token xk at position k given the corrupted input k W,θ x, and pk is the probability distribution at position k given the corrupted input x. To illustrate, consider the above example where two tokens of the sequence “the early bird catches the worm” are masked. For this example, the objective is to maximize the sum of log-scale probabilities Loss = log Pr(x2 = early|x̄ = [CLS] The [MASK] bird catches the [MASK]) + | {z x̄2 } | {z x̄6 } log Pr(x6 = worm|x̄ = [CLS] The [MASK] bird catches the [MASK]) | {z x̄2 } | {z x̄6 } (1.12) c and θ̂, we can drop W. c Then, we can further Once we obtain the optimized parameters W fine-tune the pre-trained encoder Encoderθ̂ (·) or directly apply it to downstream tasks. 1.2.2.2 Permuted Language Modeling While masked language modeling is simple and widely applied, it introduces new issues. One drawback is the use of a special token, [MASK], which is employed only during training but not 1.2 Self-supervised Pre-training Tasks 11 at test time. This leads to a discrepancy between training and inference. Moreover, the autoencoding process overlooks the dependencies between masked tokens. For example, in the above example, the prediction of x2 (i.e., the first masked token) is made independently of x6 (i.e., the second masked token), though x6 should be considered in the context of x2 . These issues can be addressed using the permuted language modeling approach to pretraining [Yang et al., 2019]. Similar to causal language modeling, permuted language modeling involves making sequential predictions of tokens. However, unlike causal modeling where predictions follow the natural sequence of the text (like left-to-right or right-to-left), permuted language modeling allows for predictions in any order. The approach is straightforward: we determine an order for token predictions and then train the model in a standard language modeling manner, as described in Section 1.2.1. Note that in this approach, the actual order of tokens in the text remains unchanged, and only the order in which we predict these tokens differs from standard language modeling. For example, consider a sequence of 5 tokens x0 x1 x2 x3 x4 . Let ei represent the embedding of xi (i.e., combination of the token embedding and positional embedding). In standard language modeling, we would generate this sequence in the order of x0 → x1 → x2 → x3 → x4 . The probability of the sequence can be modeled via a generation process. Pr(x) = Pr(x0 ) · Pr(x1 |x0 ) · Pr(x2 |x0 , x1 ) · Pr(x3 |x0 , x1 , x2 ) · Pr(x4 |x0 , x1 , x2 , x3 ) = Pr(x0 ) · Pr(x1 |e0 ) · Pr(x2 |e0 , e1 ) · Pr(x3 |e0 , e1 , e2 ) · Pr(x4 |e0 , e1 , e2 , e3 ) (1.13) Now, let us consider a different order for token prediction: x0 → x4 → x2 → x1 → x3 . The sequence generation process can then be expressed as follows: Pr(x) = Pr(x0 ) · Pr(x4 |e0 ) · Pr(x2 |e0 , e4 ) · Pr(x1 |e0 , e4 , e2 ) · Pr(x3 |e0 , e4 , e2 , e1 ) (1.14) This new prediction order allows for the generation of some tokens to be conditioned on a broader context, rather than being limited to just the preceding tokens as in standard language models. For example, in generating x3 , the model considers both its left-context (i.e., e0 , e1 , e2 ) and right-context (i.e., e4 ). The embeddings e0 , e1 , e2 , e4 incorporate the positional information of x0 , x1 , x2 , x4 , preserving the original order of the tokens. As a result, this approach is somewhat akin to masked language modeling: we mask out x3 and use its surrounding tokens x0 , x1 , x2 , x4 to predict this token. The implementation of permuted language models is relatively easy for Transformers. Because the self-attention model is insensitive to the order of inputs, we do not need to explicitly reorder the sequence to have a factorization like Eq. (1.14). Instead, permutation can be done by setting appropriate masks for self-attention. For example, consider the case of computing Pr(x1 |e0 , e4 , e2 ). We can place x0 , x1 , x2 , x3 , x4 in order and block the attention from x3 to x1 in self-attention, as illustrated below Pre-training 12 x0 x1 x2 x3 x4 Masks for Self-attention: Blue box = valid attention Gray box = blocked attention For a more illustrative example, we compare the self-attention masking results of causal language modeling, masked language modeling and permuted language modeling in Figure 1.3. 1.2.2.3 Pre-training Encoders as Classifiers Another commonly-used idea to train an encoder is to consider classification tasks. In selfsupervised learning, this is typically done by creating new classification challenges from the unlabeled text. There are many different ways to design the classification tasks. Here we present two popular tasks. A simple method, called next sentence prediction (NSP), is presented in BERT’s original paper [Devlin et al., 2019]. The assumption of NSP is that a good text encoder should capture the relationship between two sentences. To model such a relationship, in NSP we can use the output of encoding two consecutive sentences SentA and SentB to determine whether SentB is the next sentence following SentA . For example, suppose SentA = ’It is raining .’ and SentB = ’I need an umbrella .’. The input sequence of the encoder could be [CLS] It is raining . [SEP] I need an umbrella . [SEP] where [CLS] is the start symbol (i.e., x0 ) which is commonly used in encoder pre-training, and [SEP] is a separator that separates the two sentences. The processing of this sequence follows a standard procedure of Transformer encoding: we first represent each token xi as its corresponding embedding ei , and then feed the embedding sequence {e0 , ..., em } into the encoder to obtain the output sequence {h0 , ..., hm }. Since h0 is generally considered as the representation of the entire sequence, we add a Softmax layer on top of it to construct a binary classification system. This process is illustrated as follows token: [CLS] It embedding: encoding: is raining . [SEP] I need an umbrella . [SEP] e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ h0 h1 h2 ↓ Softmax ↓ Is Next or Not? h3 h4 Encoder h5 h6 h7 h8 h9 h10 h11 1.2 Self-supervised Pre-training Tasks x0 x1 x2 x3 13 x4 x0 Pr(x0 ) = 1 x1 Pr(x1 |e0 ) x2 Pr(x2 |e0 , e1 ) x3 Pr(x3 |e0 , e1 , e2 ) x4 Pr(x4 |e0 , e1 , e2 , e3 ) (a) Causal Language Modeling (order: x0 → x1 → x2 → x3 → x4 ) x0 masked x1 x2 masked x3 x4 x0 1 x1 Pr(x1 |e0 , emask , e2 , emask , e4 ) x2 1 x3 Pr(x3 |e0 , emask , e2 , emask , e4 ) x4 1 masked masked (b) Masked Language Modeling (order: x0 , [MASK], x2 , [MASK], x4 → x1 , x3 ) x0 x1 x2 x3 x4 x0 Pr(x0 ) = 1 x1 Pr(x1 |e0 , e4 , e2 ) x2 Pr(x2 |e0 , e4 ) x3 Pr(x3 |e0 , e4 , e2 , e1 ) x4 Pr(x4 |e0 ) (c) Permuted Language Modeling (order: x0 → x4 → x2 → x1 → x3 ) Fig. 1.3: Comparison of self-attention masking results of causal language modeling, masked language modeling and permuted language modeling. The gray cell denotes the token at position j does not attend to the token at position i. The blue cell (i, j) denotes that the token at position j attends to the token at position i. emask represents the embedding of the symbol [MASK], which is a combination of the token embedding and the positional embedding. In order to generate training samples, we need two sentences each time, one for SentA and the other for SentB . A simple way to do this is to utilize the natural sequence of two consecutive sentences in the text. For example, we obtain a positive sample by using actual consecutive sentences, and a negative sample by using randomly sampled sentences. Consequently, training this model is the same as training a classifier. Typically, NSP is used as an additional training loss Pre-training 14 function for pre-training based on masked language modeling. A second example of training Transformer encoders as classifiers is to apply classificationbased supervision signals to each output of an encoder. For example, Clark et al. [2019] in their ELECTRA model, propose training a Transformer encoder to identify whether each input token is identical to the original input or has been altered in some manner. The first step of this method is to generate a new sequence from a given sequence of tokens, where some of the tokens are altered. To do this, a small masked language model (call it the generator) is applied: we randomly mask some of the tokens, and train this model to predict the masked tokens. For each training sample, this masked language model outputs a token at each masked position, which might be different from the original token. At the same time, we train another Transformer encoder (call it the discriminator) to determine whether each predicted token is the same as the original token or altered. More specifically, we use the generator to generate a sequence where some of the tokens are replaced. Below is an illustration. original: masked: replaced: [CLS] The boy spent hours working on toys . ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ [CLS] The boy spent [MASK] working on [MASK] ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ [CLS] The Generator (small masked language model) boy spent decades working on . toys . Then, we use the discriminator to label each of these tokens as orginal or replaced, as follows replaced: [CLS] spent decades working The boy ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ on toys . ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Discriminator (the model we want) label: original original original original replaced original original original original For training, the generator is optimized as a masked language model with maximum likelihood estimation, and the discriminator is optimized as a classifier using a classification-based loss. In ELECTRA, the maximum likelihood-based loss and the classification-based loss are combined for jointly training both the generator and discriminator. An alternative approach is to use generative adversarial networks (GANs), that is, the generator is trained to fool the discriminator, and the discriminator is trained to distinguish the output of the generator from the true distribution. However, GAN-style training complicates the training task and is more difficult to scale up. Nevertheless, once training is complete, the generator is discarded, and the encoding part of the discriminator is applied as the pre-trained model for downstream tasks. 1.2 Self-supervised Pre-training Tasks 15 1.2.3 Encoder-Decoder Pre-training In NLP, encoder-decoder architectures are often used to model sequence-to-sequence problems, such as machine translation and question answering. In addition to these typical sequence-tosequence problems in NLP, encoder-decoder models can be extended to deal with many other problems. A simple idea is to consider text as both the input and output of a problem, and so we can directly apply encoder-decoder models. For example, given a text, we can ask a model to output a text describing the sentiment of the input text, such as positive, negative, and neutral. Such an idea allows us to develop a single text-to-text system to address any NLP problem. We can formulate different problems into the same text-to-text format. We first train an encoderdecoder model to gain general-purpose knowledge of language via self-supervision. This model is then fine-tuned for specific downstream tasks using targeted text-to-text data. 1.2.3.1 Masked Encoder-Decoder Pre-training In Raffel et al. [2020]’s T5 model, many different tasks are framed as the same text-to-text task. Each sample in T5 follows the format Source Text → Target Text Here → separates the source text, which consists of a task description or instruction and the input given to the system, from the target text, which is the response to the input task. As an example, consider a task of translating from Chinese to English. A training sample can be expressed as [CLS] Translate from Chinese to English: 你好! → hsi Hello! where [CLS] and hsi are the start symbols on the source and target sides, respectively5 . Likewise, we can express other tasks in the same way. For example [CLS] Answer: when was Albert Einstein born? → hsi He was born on March 14, 1879. [CLS] Simplify: the professor, who has has published numerous papers in his field, will be giving a lecture on the topic next week. → hsi The experienced professor will give a lecture next week. [CLS] Score the translation from English to Chinese. English: when in Rome, do as the Romans do. Chinese: 人 在 罗马 就 像 罗马 人 一样 做事 。 → hsi 0.81 where instructions are highlighted in gray. An interesting case is that in the last example we 5 We could use the same start symbol for different sequences. Here we use different symbols to distinguish the sequences on the encoder and decoder-sides. Pre-training 16 reframe the scoring problem as the text generation problem. Our goal is to generate a text representing the number 0.81, rather than outputting it as a numerical value. The approach described above provides a new framework of universal language understanding and generation. Both the task instructions and the problem inputs are provided to the system in text form. The system then follows the instructions to complete the task. This method puts different problems together, with the benefit of training a single model that can perform many tasks simultaneously. In general, fine-tuning is necessary for adapting the pre-trained model to a specific downstream task. In this process, one can use different ways to instruct the model for the task, such as using a short name of the task as the prefix to the actual input sequence or providing a detailed description of the task. Since the task instructions are expressed in text form and involved as part of the input, the general knowledge of instruction can be gained through learning the language understanding models in the pre-training phase. This may help enable zero-shot learning. For example, pretrained models can generalize to address new problems where the task instructions have never been encountered. There have been several powerful methods of self-supervised learning for either Transformer encoders or decoders. Applying these methods to pre-train encoder-decoder models is relatively straightforward. One common choice is to train encoder-decoder models as language models. For example, the encoder receives a sequence prefix, while the decoder generates the remaining sequence. However, this differs from standard causal language modeling, where the entire sequence is autoregressively generated from the first token. In our case, the encoder processes the prefix at once, and then the decoder predicts subsequent tokens in the manner of causal language modeling. Put more precisely, this is a prefix language modeling problem: a language model predicts the subsequent sequence given a prefix, which serves as the context for prediction. Consider the following example [CLS] The puppies are frolicking → hsi outside the house . | {z Prefix } | {z Subsequent Sequence } We can directly train an encoder-decoder model using examples like this. Then, the encoder learns to understand the prefix, and the decoder learns to continue writing based on this understanding. For large-scale pre-training, it is easy to create a large number of training examples from unlabeled text. It is worth noting that for pre-trained encoder-decoder models to be effective in multi-lingual and cross-lingual tasks, such as machine translation, they should be trained with multi-lingual data. This typically requires that the vocabulary includes tokens from all the languages. By doing so, the models can learn shared representations across different languages, thereby enabling capabilities in both language understanding and generation in a multi-lingual and cross-lingual context. A second approach to pre-training encoder-decoder models is masked language modeling. In this approach, as discussed in Section 1.2.2, tokens in a sequence are randomly replaced with a mask symbol, and the model is then trained to predict these masked tokens based on the entire masked sequence. As an illustration, consider the task of masking and reconstructing the sentence 1.2 Self-supervised Pre-training Tasks 17 The puppies are frolicking outside the house . By masking two tokens (say, frolicking and the), we have the BERT-style input and output of the model, as follows [CLS] The puppies are [MASK] outside [MASK] house . → hsi frolicking the Here denotes the masked position at which we do not make token predictions. By varying the percentage of the tokens in the text, this approach can be generalized towards either BERT-style training or language modeling-style training [Song et al., 2019]. For example, if we mask out all the tokens, then the model is trained to generate the entire sequence [CLS] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] → hsi The puppies are frolicking outside the house . In this case, we train the decoder as a language model. Note that, in the context of the encoder-decoder architecture, we can use the encoder to read the masked sequence, and use the decoder to predict the original sequence. With this objective, we essentially have a denoising autoencoder: the encoder transforms a corrupted input into some hidden representation, and the decoder reconstructs the uncorrupted input from this hidden representation. Here is an example of input and output for denoising training. [CLS] The puppies are [MASK] outside [MASK] house . → hsi The puppies are frolicking outside the house . By learning to map from this corrupted sequence to its uncorrupted counterpart, the model gains the ability to understand on the encoder side and to generate on the decoder side. See Figure 1.4 for an illustration of how an encoder-decoder model is trained with BERT-style and denoising autoencoding objectives. As we randomly select tokens for masking, we can certainly mask consecutive tokens [Joshi et al., 2020]. Here is an example. [CLS] The puppies are [MASK] outside [MASK] [MASK] . → hsi The puppies are frolicking outside the house . Another way to consider consecutive masked tokens is to represent them as spans. Here we follow Raffel et al. [2020]’s work, and use [X], [Y] and [Z] to denote sentinel tokens that cover one or more consecutive masked tokens. Using this notation, we can re-express the above training example as [CLS] The puppies are [X] outside [Y] . → hsi [X] frolicking [Y] the house [Z] Pre-training 18 Loss [M] [M] Encoder [CLS] The puppies are [M] the [M] [M] [M] frolicking [M] the [M] [M]frolicking [M] Decoder in [M] house . hsi [M] [M] (a) Training an encoder-decoder model with BERT-style masked language modeling Loss over the sequence the house . The puppies are frolicking in the house The puppies are frolicking in Encoder [CLS] The puppies are [M] Decoder in [M] house . hsi (b) Training an encoder-decoder model with denoising autoencoding Fig. 1.4: Training an encoder-decoder model using BERT-style and denoising autoencoding methods. In both methods, the input to the encoder is a corrupted token sequence where some tokens are masked and replaced with [MASK] (or [M] for short). The decoder predicts these masked tokens, but in different ways. In BERT-style training, the decoder only needs to compute the loss for the masked tokens, while the remaining tokens in the sequence can be simply treated as [MASK] tokens. In denoising autoencoding, the decoder predicts the sequence of all tokens in an autoregressive manner. As a result, the loss is obtained by accumulating the losses of all these tokens, as in standard language modeling. The idea is that we represent the corrupted sequence as a sequence containing placeholder slots. The training task is to fill these slots with the correct tokens using the surrounding context. An advantage of this approach is that the sequences used in training would be shorter, making the training more efficient. Note that masked language modeling provides a very general framework for training encoder-decoder models. Various settings can be adjusted to have different training versions, such as altering the percentage of tokens masked and the maximum length of the masked spans. 1.2.3.2 Denoising Training If we view the problem of training encoder-decoder models as a problem of training denoising autoencoders, there will typically be many different methods for introducing input corruption and reconstructing the input. For instance, beyond randomly masking tokens, we can also alter some of them or rearrange their order. Suppose we have an encoder-decoder model that can map an input sequence x to an output 1.2 Self-supervised Pre-training Tasks 19 sequence y y = Decodeω (Encodeθ (x)) = Modelθ,ω (x) (1.15) where θ and ω are the parameters of the encoder and the decoder, respectively. In denoising autoencoding problems, we add some noise to x to obtain a noisy, corrupted input xnoise . By feeding xnoise into the encoder, we wish the decoder to output the original input. The training objective can be defined as (θ̂, ω̂) = arg max Loss(Modelθ,ω (xnoise ), x) (1.16) θ,ω Here the loss function Loss(Modelθ,ω (xnoise ), x) evaluates how well the model Modelθ,ω (xnoise ) reconstructs the original input x. We can choose the cross-entropy loss as usual. As the model architecture and the training approach have been developed, the remaining issue is the corruption of the input. Lewis et al. [2020], in their BART model, propose corrupting the input sequence in several different ways. • Token Masking. This is the same masking method that we used in masked language modeling. The tokens in the input sequence are randomly selected and masked. • Token Deletion. This method is similar to token masking. However, rather than replacing the selected tokens with a special symbol [MASK], these tokens are removed from the sequence. See the following example for a comparison of the token masking and token deletion methods. Original (x): The puppies are frolicking outside the house . Token Masking (xnoise ): The puppies are [MASK] outside [MASK] house . Token Deletion (xnoise ): The puppies are frolicking outside the house . where the underlined tokens in the original sequence are masked or deleted. • Span Masking. Non-overlapping spans are randomly sampled over the sequence. Each span is masked by [MASK]. We also consider spans of length 0, and, in such cases, [MASK] is simply inserted at a position in the sequence. For example, we can use span masking to corrupt the above sequence as Original (x): Span Masking (xnoise ): The 0 puppies are frolicking outside the house . The [MASK] puppies are [MASK] house . Here the span frolicking outside the is replaced with a single [MASK]. 0 indicates a length0 span, and so we insert an [MASK] between The and puppies. Span masking introduces new prediction challenges in which the model needs to know how many tokens are generated from a span. This problem is very similar to fertility modeling in machine translation [Brown et al., 1993]. Pre-training 20 If we consider a sequence consisting of multiple sentences, additional methods of corruption can be applied. In the BART model, there are two such methods. • Sentence Reordering. This method randomly permutes the sentences so that the model can learn to reorder sentences in a document. Consider, for example, two consecutive sentences Hard work leads to success . Success brings happiness . We can reorder the two sentences to have a corrupted input sequence Success brings happiness . Hard work leads to success . • Document Rotation. The goal of this task is to identify the start token of the sequence. First, a token is randomly selected from the sequence. Then, the sequence is rotated so that the selected token is the first token. For example, suppose we select the token leads from the above sequence. The rotated sequence is selected Hard work leads to success . Success brings happiness . Hard work where the subsequence Hard work before leads is appended to the end of the sequence. For pre-training, we can apply multiple corruption methods to learn robust models, for example, we randomly choose one of them for each training sample. In practice, the outcome of encoder-decoder pre-training depends heavily on the input corruption methods used, and so we typically need to choose appropriate training objectives through careful experimentation. 1.2.4 Comparison of Pre-training Tasks So far, we have discussed a number of pre-training tasks. Since the same training objective can apply to different architectures (e.g., using masked language modeling for both encoder-only and encoder-decoder pre-training), categorizing pre-training tasks based solely on model architecture does not seem ideal. Instead, we summarize these tasks based on the training objectives. • Language Modeling. Typically, this approach refers to an auto-regressive generation procedure of sequences. At one time, it predicts the next token based on its previous context. • Masked Language Modeling. Masked Language Modeling belongs to a general maskpredict framework. It randomly masks tokens in a sequence and predicts these tokens using the entire masked sequence. 1.3 Example: BERT 21 • Permuted Language Modeling. Permuted language modeling follows a similar idea to masked language modeling, but considers the order of (masked) token prediction. It reorders the input sequence and predicts the tokens sequentially. Each prediction is based on some context tokens that are randomly selected. • Discriminative Training. In discriminative training, supervision signals are created from classification tasks. Models for pre-training are integrated into classifiers and trained together with the remaining parts of the classifiers to enhance their classification performance. • Denoising Autoencoding. This approach is applied to the pre-training of encoder-decoder models. The input is a corrupted sequence and the encoder-decoder models are trained to reconstruct the original sequence. Table 1.1 illustrates these methods and their variants using examples. The use of these examples does not distinguish between models, but we mark the model architectures where the pre-training tasks can be applied. In each example, the input consists of a token sequence, and the output is either a token sequence or some probabilities. For generation tasks, such as language modeling, superscripts are used to indicate the generation order on the target side. If the superscripts are omitted, it indicates that the output sequence can be generated either autoregressively or simultaneously. On the source side, we assume that the sequence undergoes a standard Transformer encoding process, meaning that each token can see the entire sequence in self-attention. The only exception is in permuted language modeling, where an autoregressive generation process is implemented by setting attention masks on the encoder side. To simplify the discussion, we remove the token hsi from the target-side of each example. While these pre-training tasks are different, it is possible to compare them in the same framework and experimental setup [Dong et al., 2019; Raffel et al., 2020; Lewis et al., 2020]. Note that we cannot list all the pre-training tasks here as there are many of them. For more discussions on pre-training tasks, the interested reader may refer to some surveys on this topic [Qiu et al., 2020; Han et al., 2021]. 1.3 Example: BERT In this section, we introduce BERT models, which are among the most popular and widely used pre-trained sequence encoding models in NLP. 1.3.1 The Standard Model The standard BERT model, which is proposed in Devlin et al. [2019]’s work, is a Transformer encoder trained using both masked language modeling and next sentence prediction tasks. The loss used in training this model is a sum of the loss of the two tasks. LossBERT = LossMLM + LossNSP (1.17) As is regular in training deep neural networks, we optimize the model parameters by minimizing this loss. To do this, a number of training samples are collected. During training, a batch of Pre-training 22 Method Enc Dec E-D Input Causal LM MASS-style BERT-style The1 kitten2 is3 chasing4 the5 ball6 .7 • • • • [C] The kitten is • • [C] The kitten [M] chasing the [M] . is • • [C] The kitten [M] [M] [M] ball . is chasing the • • [C] The kitten [M] playing the [M] . kitten is chasing Prefix LM Masked LM Output chasing1 the2 ball3 .4 ball ball Permuted LM • [C] The kitten is chasing the ball . The5 kitten7 is6 chasing1 the4 ball2 .3 Next Sentence • [C] The kitten is chasing the ball . Pr(IsNext | representation-of-[C]) • Encode a sentence as ha and • [C] The kitten is chasing the ball . Pr(·|The) Pr(·|kitten) ... Pr(·|.) [C] . kitten the chasing The is ball The1 kitten2 is3 chasing4 the5 ball6 .7 [C] The kitten is chasing the ball . The1 kitten2 is3 chasing4 the5 ball6 .7 [C] The kitten [M] is [M] . The1 kitten2 is3 chasing4 the5 ball6 .7 [C] The kitten [X] the [Y] [X] is2 chasing3 [Y] ball5 .6 Prediction Sentence Comparison Token Classification Token Reordering Token Deletion Span Masking Sentinel Masking Sentence Reordering Document Rotation Birds eat worms . Score(ha , hb ) another sentence as hb • • • • 1 4 • [C] The ball rolls away swiftly . The The1 kitten2 is3 chasing4 the5 ball6 .7 kitten is chasing the ball . The8 ball9 rolls10 away11 swiftly12 .13 • [C] chasing the ball . The ball rolls The1 kitten2 is3 chasing4 the5 ball6 .7 away swiftly . The kitten is The8 ball9 rolls10 away11 swiftly12 .13 Table 1.1: Comparison of pre-training tasks, including language modeling, masked language modeling, permuted language modeling, discriminative training, and denoising autoencoding. [C] = [CLS], [M] = [MASK], [X], [Y] = sentinel tokens. Enc, Dec and E-D indicate whether the approach can be applied to encoder-only, decoder-only, encoder-decoder models, respectively. For generation tasks, superscripts are used to represent the order of the tokens. training samples is randomly selected from this collection at a time, and LossBERT is accumulated over these training samples. Then, the model parameters are updated via gradient descent or its variants. This process is repeated many times until some stopping criterion is satisfied, such as when the training loss converges. 1.3.1.1 Loss Functions In general, BERT models are used to represent a single sentence or a pair of sentences, and thus can handle various downstream language understanding problems. In this section we assume that the input representation is a sequence containing two sentences SentA and SentB , expressed as [CLS] SentA [SEP] SentB [SEP] Here we follow the notation in BERT’s paper and use [SEP] to denote the separator. Given this sequence, we can obtain LossMLM and LossNSP separately. For masked language modeling, we predict a subset of the tokens in the sequence. Typically, a certain percentage of the 1.3 Example: BERT 23 tokens are randomly selected, for example, in the standard BERT model, 15% of the tokens in each sequence are selected. Then the sequence is modified in three ways • Token Masking. 80% of the selected tokens are masked and replaced with the symbol [MASK]. For example Original: [CLS] It is raining . [SEP] I need an umbrella . [SEP] Masked: [CLS] It is [MASK] . [SEP] I need [MASK] umbrella . [SEP] where the selected tokens are underlined. Predicting masked tokens makes the model learn to represent tokens from their surrounding context. • Random Replacement. 10% of the selected tokens are changed to a random token. For example, [CLS] It is raining . [SEP] I need an umbrella . [SEP] Original: [CLS] It is raining . [SEP] I need an hat . [SEP] Random Token: This helps the model learn to recover a token from a noisy input. • Unchanged. 10% of the selected tokens are kept unchanged. For example, Original: [CLS] It is raining . [SEP] I need an umbrella . [SEP] Unchanged Token: [CLS] It is raining . [SEP] I need an umbrella . [SEP] This is not a difficult prediction task, but can guide the model to use easier evidence for prediction. Let A(x) be the set of selected positions of a given token sequence x, and x̄ be the modified sequence of x. The loss function of masked language modeling can be defined as LossMLM = − X i∈A(x) log Pri (xi |x̄) (1.18) where Pri (xi |x̄) is the probability of predicting xi at the position i given x̄. Figure 1.5 shows a running example of computing LossMLM . For next sentence prediction, we follow the method described in Section 1.2.2.3. Each training sample is classified into a label set {IsNext, NotNext}, for example, Sequence: Label: Sequence: Label: [CLS] It is raining . [SEP] I need an umbrella . [SEP] IsNext [CLS] The cat sleeps on the windowsill . [SEP] Apples grow on trees . [SEP] NotNext Pre-training 24 Input: [CLS] It is raining . [SEP] I need an umbrella . [SEP] Select tokens with a probability of 15% Token Selection: [CLS] It is raining . [SEP] I need an umbrella . [SEP] Mask selected tokens with a probability of 80% Token Masking: [CLS] It is [MASK] . [SEP] I need [MASK] umbrella . [SEP] Alter selected tokens with a probability of 10% Token: [CLS] It is [MASK] . [SEP] I need [MASK] hat . [SEP] Replacement Keep selected tokens unchanged with a probability of 10% Unchanged: [CLS] It is [MASK] . [SEP] I need [MASK] hat . [SEP] Train the Transformer encoder with the modified sequence training h0 h1 h2 h3 an umbrella I h4 h5 h6 h7 h8 h9 h10 h11 e8 Transformer Encoder e0 e1 e2 e3 [CLS] It is [MASK] e4 e5 e6 e7 e9 e10 e11 . [SEP] I need [MASK] hat . [SEP] Fig. 1.5: A running example of BERT-style masked language modeling. First, 15% tokens are randomly selected. These selected tokens are then processed in one of three ways: replaced with a [MASK] token (80% of the time), replaced with a random token (10% of the time), or kept unchanged (10% of the time). The model is trained to predict these selected tokens based on the modified sequence. ei represents the embedding of the token at the position i. Gray boxes represent the Softmax layers. The output vector of the encoder for the first token [CLS] is viewed as the sequence representation, denoted by hcls (or h0 ). A classifier is built on top of hcls . Then, we can compute the probability of a label c given hcls , i.e., Pr(c|hcls ). There are many loss functions one can choose for classification problems. For example, in maximum likelihood training, we can define LossNSP as LossNSP = − log Pr(cgold |hcls ) (1.19) 1.3 Example: BERT 25 where cgold is the correct label for this sample. 1.3.1.2 Model Setup As shown in Figure 1.6, BERT models are based on the standard Transformer encoder architecture. The input is a sequence of embeddings, each being the sum of the token embedding, the positional embedding, and the segment embedding. (1.20) e = x + epos + eseg Both the token embedding (x) and positional embedding (epos ) are regular, as in Transformer models. The segment embedding (eseg ) is a new type of embedding that indicates whether a token belongs to SentA or SentB . This can be illustrated by the following example. Token [CLS] It is raining . [SEP] I need an umbrella . [SEP] x x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x0 epos PE(0) PE(1) PE(2) PE(3) PE(4) PE(5) PE(6) PE(7) PE(8) PE(9) PE(10) PE(11) eseg eA eA eA eA eA eA eB eB eB eB eB eB The main part of BERT models is a multi-layer Transformer network. A Transformer layer consists of a self-attention sub-layer and an FFN sub-layer. Both of them follow the post-norm architecture: output = LNorm(F (input) + input), where F (·) is the core function of the sublayer (either a self-attention model or an FFN), and LNorm(·) is the layer normalization unit. Typically, a number of Transformer layers are stacked to form a deep network. At each position of the sequence, the output representation is a real-valued vector which is produced by the last layer of the network. There are several aspects one may consider in developing BERT models. • Vocabulary Size (|V |). In Transformers, each input token is represented as an entry in a vocabulary V . Large vocabularies can cover more surface form variants of words, but may lead to increased storage requirements. • Embedding Size (de ). Every token is represented as a de -dimensional real-valued vector. As presented above, this vector is the sum of the token embedding, positional embedding, and segment embedding, all of which are also de -dimensional real-valued vectors. • Hidden Size (d). The input and output of a sub-layer are of d dimensions. Besides, most of the hidden states of a sub-layer are d-dimensional vectors. In general, d can be roughly viewed as the width of the network. • Number of Heads (nhead ). In self-attention sub-layers, one needs to specify the number of heads used in multi-head self-attention. The larger this number is, the more sub-spaces attention is performed. In practical systems, we often set nhead ≥ 4. • FFN Hidden Size (dffn ). The size of the hidden layer of the FFNs used in Transformers is typically larger than d. For example, a typical setting is dffn = 4d. For larger Transformers, such as recent large models, dffn may be set to a very large value. Pre-training 26 ... Output Layer h0 h1 ...hm Encoder Output hi ∈ Rd is the contextual representation of xi Layer Normalization FFN FFN Sub-layer hidden size: d FFN hidden size: dffn layers Layer Normalization Self-attention Self-attention Sub-layer hidden size: d number of heads: nhead Embedding Token Position Segment x0 x1 ...xm e = x + epos + eseg ∈ Rde Input xi corresponds to an entry of V Fig. 1.6: The model Architecture of BERT (Transformer encoder). The input tokens are first represented as embeddings, each of which is the sum of the corresponding token embedding, positional embedding and segment embedding. Then, the embedding sequence is processed by a stack of Transformer layers. Each layer in this stack includes a selfattention sub-layer and a FFN sub-layer. The output of the BERT model is a sequence of vectors produced by the final Transformer layer. • Model Depth (L). Using deep networks is an effective way to improve the expressive power of Transformers. For BERT models, L is typically set to 12 or 24. However, networks with even greater depth are also feasible and can be applied for further enhancements. Different settings of these hyper-parameters lead to different model sizes. There are two widely-used BERT models. • BERTbase: d = 768, L = 12, nhead = 12, total number of parameters = 110M. • BERTlarge: d = 1, 024, L = 14, nhead = 16, total number of parameters = 340M. Training BERT models follows the standard training process of Transformers. Training larger models such as BERTlarge requires more training effort and time. This is a common problem for pre-training, especially when a model is trained on a very large amount of data. In practice, 1.3 Example: BERT 27 there are often considerations of training efficiency. For example, a practice is to first train a BERT model on relatively short sequences for a large number of training steps, and then continue training it on full-length sequences for the remaining training steps. 1.3.2 More Training and Larger Models BERT is a milestone model in NLP, sparking many subsequent efforts to improve it. One direction is to scale up the model itself, including increasing training data and developing larger models. RoBERTa, an extension of the standard BERT model, is an example of such efforts [Liu et al., 2019]. It introduces two major improvements. First, simply using more training data and more compute can improve BERT models without need of changing the model architectures. Second, removing the NSP loss does not decrease the performance on downstream tasks if the training is scaled up. These findings suggest exploring a general direction of pre-training: we can continue to improve pre-training by scaling it up on simple pre-training tasks. A second approach to improving BERT models is to increase the number of model parameters. For example, in He et al. [2021]’s work, a 1.5 billion-parameter BERT-like model is built by increasing both the model depth and hidden size. However, scaling up BERT and various other pre-trained models introduces new challenges in training, for example, training very large models often becomes unstable and difficult to converge. This makes the problem more complicated, and requires careful consideration of various aspects, including model architecture, parallel computation, parameter initialization, and so on. In another example, Shoeybi et al. [2019] successfully trained a 3.9 billion-parameter BERT-like model, where hundreds of GPUs were used to manage the increased computational demands. 1.3.3 More Efficient Models Compared to its predecessors, BERT is a relatively large model for the time it was proposed. This increase in model size results in larger memory requirements and a consequent slowdown in system performance. Developing smaller and faster BERT models is part of the broader challenge of building efficient Transformers, which has been extensively discussed in Tay et al. [2020]’s work and Xiao and Zhu [2023]’s work. However, a deeper discussion of this general topic is beyond the scope of our current discussion. Here we instead consider a few efficient variants of BERT. Several threads of research are of interest to NLP researchers in developing efficient BERT models. First, work on knowledge distillation, such as training student models with the output of well-trained teacher models, shows that smaller BERT models can be obtained by transferring knowledge from larger BERT models. Given that BERT models are multi-layer networks with several different types of layers, knowledge distillation can be applied at different levels of representation. For example, beyond distilling knowledge from the output layers, it is also possible to incorporate training loss that measures the difference in output of hidden layers between teacher models and student models [Sun et al., 2020; Jiao et al., 2020]. Indeed, knowledge distillation has been one of the most widely-used techniques for learning small pre-trained models. Second, conventional model compression methods can be directly applied to compress BERT models. One common approach is to use general-purpose pruning methods to prune the Transformer encoding networks [Gale et al., 2019]. This generally involves removing entire layers Pre-training 28 [Fan et al., 2019] or a certain percentage of parameters in the networks [Sanh et al., 2020; Chen et al., 2020]. Pruning is also applicable to multi-head attention models. For example, Michel et al. [2019] show that removing some of the heads does not significantly decrease the performance of BERT models, but speeds up the inference of these models. Another approach to compressing BERT models is quantization [Shen et al., 2020]. By representing model parameters as lowprecision numbers, the models can be greatly compressed. While this method is not specific to BERT models, it proves effective for large Transformer-based architectures. Third, considering that BERT models are relatively deep and large networks, another thread of research uses dynamic networks to adapt these models for efficient inference. An idea in this paradigm is to dynamically choose the layers for processing a token, for example, in depthadaptive models we exit at some optimal depth and thus skip the rest of the layers in the layer stack [Xin et al., 2020; Zhou et al., 2020]. Similarly, we can develop length-adaptive models in which the length of the input sequence is dynamically adjusted. For example, we can skip some of the tokens in the input sequence so that the model can reduce computational load on less important tokens, enhancing overall efficiency. Fourth, it is also possible to share parameters across layers to reduce the size of BERT models. A simple way to do this is to share the parameters of a whole Transformer layer across the layer stack [Dehghani et al., 2018; Lan et al., 2020]. In addition to the reduced number of parameters, this enables reuse of the same layer in a multi-layer Transformer network, leading to savings of memory footprint at test time. 1.3.4 Multi-lingual Models The initial BERT model was primarily focused on English. Soon after this model was proposed, it was extended to many languages. One simple way to do this is to develop a separate model for each language. Another approach, which has become more popular in recent work on large language models, is to train multi-lingual models directly on data from all the languages. In response, multi-lingual BERT (mBERT) models were developed by training them on text from 104 languages 6 . The primary difference from monolingual BERT models is that mBERT models use larger vocabularies to cover tokens from multiple languages. As a result, the representations of tokens from different languages are mapped into the same space, allowing for the sharing of knowledge across languages via this universal representation model. One important application of multi-lingual pre-trained models is cross-lingual learning. In the cross-lingual setting, we learn a model on tasks in one language, and apply it to the same tasks in another language. In cross-lingual text classification, for example, we fine-tune a multi-lingual pre-trained model on English annotated documents. Then, we use the fine-tuned model to classify Chinese documents. An improvement to multi-lingual pre-trained models like mBERT is to introduce bilingual data into pre-training. Rather than training solely on monolingual data from multiple languages, bilingual training explicitly models the relationship between tokens in two languages. The resulting model will have innate cross-lingual transfer abilities, and thus can be easily adapted to different languages. Lample and Conneau [2019] propose an approach to pre-training cross-lingual language models (XLMs). In their work, a cross-lingual language model can be trained in either the causal language modeling or masked language modeling manner. For masked language modeling 6 https://github.com/google-research/bert/ 1.3 Example: BERT 29 pre-training, the model is treated as an encoder. The training objective is the same as BERT: we maximize the probabilities of some randomly selected tokens which are either masked, replaced with random tokens, or kept unchanged in the input. If we consider bilingual data in pre-training, we sample a pair of aligned sentences each time. Then, the two sentences are packed together to form a single sequence used for training. For example, consider an English-Chinese sentence pair 鲸鱼 是 哺乳 动物 。 ↔ Whales are mammals . We can pack them to obtain a sequence, like this [CLS] 鲸鱼 是 哺乳 动物 。 [SEP] Whales are mammals . [SEP] We then select a certain percentage of the tokens and replace them with [MASK]. [CLS] [MASK] 是 [MASK] 动物 。 [SEP] Whales [MASK] [MASK] . [SEP] The goal of pre-training is to maximize the product of the probabilities of the masked tokens given the above sequence. By performing training in this way, the model can learn to represent both the English and Chinese sequences, as well as to capture the correspondences between tokens in the two languages. For example, predicting the Chinese token 鲸鱼 may require the information from the English token Whales. Aligning the representations of the two languages essentially transforms the model into a “translation” model. So this training objective is also called translation language modeling. Figure 1.7 shows an illustration of this approach. A benefit of multi-lingual pre-trained models is their inherent capability of handling codeswitching. In NLP and linguistics, code-switching refers to switching among languages in a text. For example, the following is a mixed language text containing both Chinese and English: 周末 我们 打算 去 做 hiking , 你 想 一起 来 吗 ? (We plan to go hiking this weekend, would you like to join us?) For multi-lingual pre-trained models, we do not need to identify whether a token is Chinese or English. Instead, every token is just an entry of the shared vocabulary. This can be imagined as creating a “new” language that encompasses all the languages we want to process. The result of multi-lingual pre-training is influenced by several factors. Given that the model architecture is fixed, one needs to specify the size of the shared vocabulary, the number (or percentage) of samples in each language, the size of the model, and so on. Conneau et al. [2020] point out several interesting issues regarding large-scale multi-lingual pre-training for XLM-like models. First, as the number of supported languages increases, a larger model is needed to handle these languages. Second, a larger shared vocabulary is helpful for modeling the increased diversity in languages. Third, low-resource languages more easily benefit from cross-lingual transfer from high-resource languages, particularly when similar high-resource languages are involved in pre-training. However, interference may occur if the model is trained for an extended period, Pre-training 30 哺乳 鲸鱼 h0 h1 h2 are mammals h3 h4 h5 h6 h7 h8 h9 h10 h11 e8 e9 e10 e11 Transformer Encoder e0 e1 e2 e3 e4 [CLS] [MASK] 是 [MASK] 动物 。 [SEP] Whales [MASK] [MASK] . [SEP] (zh) (zh) (zh) (zh) (zh) (zh) (en) (en) (zh) e5 e6 e7 (en) (en) (en) Fig. 1.7: An illustration of translation language modeling. For ease of understanding, we present a simple example where all the selected tokens are masked. The model is trained to predict these masked tokens. As the sequence contains tokens in two languages, predicting a token in one language allows access to tokens in the other language, thereby enabling cross-lingual modeling. In Lample and Conneau [2019]’s work, an input embedding (i.e., ei ) is the sum of the token embedding, positional embedding, and language embedding. This requires that each token is assigned with a language label. Thus we can distinguish tokens in different languages. In multi-lingual pre-training, particularly in work using shared vocabularies, specifying the language to which a token belongs is not necessary. The use of language embeddings in turn makes it difficult to handle code-switching. Therefore, we assume here that all token representations are language-independent. meaning the overall performance of the pre-trained model starts decreasing at a certain point during pre-training. Thus, in practical systems, one may need to stop the pre-training early to prevent interference. 1.4 Applying BERT Models Once a BERT model is pre-trained, it can then be used to solve NLP problems. But BERT models are not immediately ready for performing specific downstream tasks. In general, additional finetuning work is required to make them adapt. As a first step, we need a predictor to align the output of the model with the problem of interest. Let BERTθ̂ (·) be a BERT model with pretrained parameters θ̂, and Predictω (·) be a prediction network with parameters ω. By integrating the prediction network with the output of the BERT model, we develop a model to tackle the downstream tasks. This model can be expressed as y = Predictω (BERTθ̂ (x)) (1.21) where x is the input and y is the output that fits the problem. For example, in classification problems, the model outputs a probability distribution over labels. Then, we collect a set of labeled samples D, and fine-tune the model by (ω̃, θ̃) = arg min ω,θ̂ + X (x,ygold )∈D Loss(yω,θ̂+ , ygold ) (1.22) 1.4 Applying BERT Models 31 where (x, ygold ) represents a tuple of an input and its corresponding output. The notation of this equation seems a bit complicated, but the training/tuning process is standard. We optimize the model by minimizing the loss over the tuning samples. The outcome is the optimized parameters ω̃ and θ̃. The optimization starts with the pre-trained parameters θ̂. Here we use θ̂ + to indicate that the parameters are initialized with θ̂, and use yω,θ̂+ to denote the model output computed using the parameters ω and θ̂ + . With the fine-tuned parameters ω̃ and θ̃, we can apply the model Predictω̃ (BERTθ̃ (·)) to new data of the same tasks for which the model was fine-tuned. The form of the downstream tasks determines the input and output formats of the model, as well as the architecture of the prediction network. In the following we list some tasks to which BERT models are generally suited. • Classification (Single Text). One of the most widely-used applications of BERT models is text classification. In this task, a BERT model receives a sequence of tokens and encodes it as a sequence of vectors. The first output vector hcls (or h0 ) is typically used as the representation of the entire text. The prediction network takes hcls as input to produce a distribution of labels. Let [CLS]x1 x2 ...xm be an input text. See below for an illustration of BERT-based text classification. Class hcls h1 h2 ... hm hm+1 BERT ecls e1 e2 ... em em+1 [CLS] x1 x2 ... xm [SEP] Here the gray box denotes the prediction network. Many NLP problems can be categorized as text classification tasks, and there have been several text classification benchmarks for evaluating pre-trained models. For example, we can classify texts by their grammatical correctness (grammaticality) or emotional tone (sentiment) [Socher et al., 2013; Warstadt et al., 2019]. Note that the prediction network could be any classification model, such as a deep neural network or a more traditional classification model. The entire model can then be trained or fine-tuned in the manner of a standard classification model. For example, the prediction network can be simply a Softmax layer and the model parameters can be optimized by maximizing the probabilities of the correct labels. • Classification (Pair of Texts). Classification can also be performed on a pair of texts. Suppose we have two texts, x1 ...xm and y1 ...yn . We can concatenate these texts to form a single sequence with a length len. Then, we predict a label for this combined text sequence based on the hcls vector, as follows Pre-training 32 Class hcls h1 h2 ... hm hm+1 hm+2 hm+3 ... hlen−1 hlen elen−1 elen BERT ecls e1 e2 ... em em+1 em+2 em+3 ... [CLS] x1 x2 ... xm [SEP] y1 ... Text 1 y2 yn [SEP] Text 2 where len = n + m + 2. Text pair classification covers several problems, including semantic equivalence judgement (determine whether two texts are semantically equivalent) [Dolan and Brockett, 2005], text entailment judgement (determine whether a hypothesis can be logically inferred or entailed from a premise) [Bentivogli and Giampiccolo, 2011; Williams et al., 2018], grounded commonsense inference (determine whether an event is likely to happen given its context) [Zellers et al., 2018], and question-answering inference (determine whether an answer corresponds to a given question). • Regression. Instead of generating a label distribution, we can have the prediction network output a real-valued score. For example, by adding a Sigmoid layer to the prediction network, the system can be employed to compute the similarity between two given sentences. The architecture is the same as that of BERT-based classification systems, with only the change of the output layer. Number (similarity, evaluation score, etc.) hcls h1 h2 ... hm hm+1 hm+2 hm+3 ... hlen−1 hlen elen−1 elen BERT ecls e1 e2 ... em em+1 em+2 em+3 ... [CLS] x1 x2 ... xm [SEP] y1 ... Text 1 y2 yn [SEP] Text 2 For training or fine-tuning, we can minimize the regression loss of the model output as usual. • Sequence Labeling. Sequence labeling is a machine learning approach applicable to a wide range of NLP problems. This approach assigns a label to each token in an input sequence, and some linguistic annotations can then be derived from this sequence of labels. An example of sequence labeling in NLP is part-of-speech (POS) tagging. We label each word in a sentence with its corresponding POS tag. Another example is named entity recognition (NER) in which we label each word with an NER tag, and named entities are identified using these tags. See below for an illustration of the model architecture for NER. 1.4 Applying BERT Models 33 Tag Tag Tag {B, I, O}{B, I, O} hcls h1 h2 {B, I, O} ... hm hm+1 BERT ecls e1 e2 ... em em+1 [CLS] x1 x2 ... xm [SEP] Here {B, I, O} is the tag set of NER. For example, B-ORG means the beginning of an organization, I-ORG means the word is inside an organization, and O means the word does not belong to any named entity. This NER model can output a distribution over the tag set at each position, denoted as pi . The training or fine-tuning of the model can be performed over these distributions {p1 , ..., pm }. For example, suppose pi (tagi ) is the probability of the correct tag at position i. The training loss can be defined to be the negative likelihood Loss = − m 1 X log pi (tagi ) m i=1 (1.23) Finding the best label sequence given a trained NER model is a well-studied issue in NLP. This is often achieved via dynamic programming, which, in the context of path finding over a lattice, has linear complexity [Huang, 2009]. • Span Prediction. Some NLP tasks require predicting a span in a text. A common example is reading comprehension. In this task, we are given a query x1 ...xm and a context text y1 ...yn . The goal is to identify a continuous span in y1 ...yn that best answers the query. This problem can be framed as a sequence labeling-like task in which we predict a label for each yj to indicate the beginning or ending of the span. Following Seo et al. [2017], we add two networks on top of the BERT output for yj : one for generating the probability of yj being the beginning of the span (denoted by pbeg j ), and one for generating the probability of yj being the ending of the span (denoted by pend j ). The resulting model architecture is shown as follows End End beg hcls h1 h2 ... hm (pend n ) Beg Beg (p1 End (pend ) 2 (pend ) 1 ) beg (p2 Beg beg ) hm+1 hm+2 hm+3 (pn ) ... hlen−1 hlen elen−1 elen BERT ecls e1 e2 ... em em+1 em+2 em+3 ... [CLS] x1 x2 ... xm [SEP] y1 ... Query y2 Context Text yn [SEP] Pre-training 34 We pack the query and context text together to obtain the input sequence. The prediction networks are only applied to outputs for the context text, generating the probabilities pbeg j at each position. The loss can be computed by summing the log likelihoods of the and pend j two models across the entire context text. Loss = − n  1X log pbeg + log pend j j n j=1 (1.24) At test time, we search for the best span by (ĵ1 , ĵ2 ) = arg max 1≤j1 ≤j2 ≤n end log pbeg j1 + log pj2  (1.25) • Encoding for Encoder-decoder Models. While our focus in this section has been primarily on language understanding problems, it is worth noting that BERT models can be applied to a broader range of NLP tasks. In fact, BERT models can be used in all the scenarios where we need to encode a piece of text. One application that we have not mentioned is text generation which includes a range of tasks such as machine translation, summarization, question answering, and dialogue generation. These tasks can be formulated as sequenceto-sequence problems: we use an encoder to represent the source text, and a decoder to generate the corresponding target text. A straightforward method to apply BERT models is to consider them as encoders. Before fine-tuning, we can initialize the parameters of the encoder with those from a pre-trained BERT model. Then, the encoder-decoder model can be fine-tuned on pairs of texts as usual. The following shows the architecture of a neural machine translation system where a BERT model is applied on the source side. Target Text Adapter y1 y2 BERT (Encoder) y3 ... yn Decoder excls ex1 ... exm exm+1 ey0 ey1 ey2 ... eyn−1 [CLS] x1 ... xm [SEP] hsi y1 y2 ... yn−1 Source Text Here x1 ...xm denotes the source sequence, y1 ...yn denotes the target sequence, ex1 ...exm denotes the embedding sequence of x1 ...xm , and ey1 ...eyn denotes the embedding sequence of y1 ...yn . The adapter, which is optional, maps the output of the BERT model to the form that is better suited to the decoder. Fine-tuning BERT models is a complicated engineering problem, influenced by many factors, such as the amount of fine-tuning data, the model size, and the optimizer used in fine-tuning. In general, we wish to fine-tune these models sufficiently so that they can perform well in the downstream tasks. However, fine-tuning BERT models for specific tasks may lead to overfitting, 1.5 Summary 35 which in turn reduces their ability to generalize to other tasks. For example, suppose we have a BERT model that performs well on a particular task. If we then fine-tune it for new tasks, this may decrease its performance on the original task. This problem is related to the catastrophic forgetting problem in continual training, where a neural network forgets previously learned information when updated on new samples. In practical applications, a common way to alleviate catastrophic forgetting is to add some old data into fine-tuning and train the model with more diverse data. Also, one may use methods specialized to catastrophic forgetting, such as experience replay [Rolnick et al., 2019] and elastic weight consolidation [Kirkpatrick et al., 2017]. The interested reader can refer to some surveys for more detailed discussions of this issue in continual learning [Parisi et al., 2019; Wang et al., 2023a;e]. 1.5 Summary In this chapter we have discussed the general idea of pre-training in NLP. In particular, we have discussed self-supervised pre-training and its application to encode-only, decoder-only, and encoderdecoder architectures. Moreover, we have presented and compared a variety of pre-training tasks for these architectures. As an example, BERT is used to illustrate how sequence models are pretrained via masked language modeling and applied to different downstream tasks. Recent years have shown remarkable progress in NLP, led by the large-scale use of selfsupervised pre-training. And sweeping advances are being made across many tasks, not only in NLP but also in computer vision and other areas of AI. One idea behind these advances is that a significant amount of knowledge about the world can be learned by simply training these AI systems on huge amounts of unlabeled data. For example, a language model can learn some general knowledge of a language by repeatedly predicting masked words in large-scale text. As a result, this pre-trained language model can serve as a foundation model, which can be easily adapted to address specific downstream NLP tasks. This paradigm shift in NLP has enabled the development of incredibly powerful systems for language understanding, generation, and reasoning [Manning, 2022]. However, it is important to recognize that we are still in the early stages of creating truly intelligent systems, and there is a long way to go. Nevertheless, large-scale pre-training has opened a door to intelligent systems that researchers have long aspired to develop, though several key research areas remain open for exploration, such as learning intelligence efficiently using reasonably small-sized data and acquiring complex reasoning and planning abilities. Note that this chapter is mostly introductory and cannot cover all aspects of pre-training. For example, there are many methods to fine-tune a pre-trained model, offering different ways to better adapt the model to diverse situations. Moreover, large language models, which are considered one of the most significant achievements in AI in recent years, are skipped in this section. We leave the discussion of these topics to the following chapters. C HAPTER 2 Generative Models One of the most significant advances in NLP in recent years might be the development of large language models (LLMs). This has helped create systems that can understand and generate natural languages like humans. These systems have even been found to be able to reason, which is considered a very challenging AI problem. With these achievements, NLP made big strides and entered a new era of research in which difficult problems are being solved, such as building conversational systems that can communicate with humans smoothly. The concept of language modeling or probabilistic language modeling dates back to early experiments conducted by Shannon [1951]. In his work, a language model was designed to estimate the predictability of English — how well can the next letter of a text be predicted when the preceding N letters are known. Although Shannon’s experiments were preliminary, the fundamental goals and methods of language modeling have remained largely unchanged over the decades since then. For quite a long period, particularly before 2010, the dominant approach to language modeling was the n-gram approach [Jurafsky and Martin, 2008]. In n-gram language modeling, we estimate the probability of a word given its preceding n − 1 words, and thus the probability of a sequence can be approximated by the product of a series of n-gram probabilities. These probabilities are typically estimated by collecting smoothed relative counts of n-grams in text. While such an approach is straightforward and simple, it has been extensively used in NLP. For example, the success of modern statistical speech recognition and machine translation systems has largely depended on the utilization of n-gram language models [Jelinek, 1998; Koehn, 2010]. Applying neural networks to language modeling has long been attractive, but a real breakthrough appeared as deep learning techniques advanced. A widely cited study is Bengio et al. [2003]’s work where n-gram probabilities are modeled via a feed-forward network and learned by training the network in an end-to-end fashion. A by-product of this neural language model is the distributed representations of words, known as word embeddings. Rather than representing words as discrete variables, word embeddings map words into low-dimensional real-valued vectors, making it possible to compute the meanings of words and word n-grams in a continuous representation space. As a result, language models are no longer burdened with the curse of dimensionality, but can represent exponentially many n-grams via a compact and dense neural model. The idea of learning word representations through neural language models inspired subsequent research in representation learning in NLP. However, this approach did not attract significant interest in developing NLP systems in the first few years after its proposal. Starting in about 2012, though, advances were made in learning word embeddings from large-scale text via simple word prediction tasks. Several methods, such as Word2Vec, were proposed to effectively learn such embeddings, which were then successfully applied in a variety of NLP systems [Mikolov et al., 2013a;b]. As a result of these advances, researchers began to think of learning representations of sequences using more powerful language models, such as LSTM-based models [Sutskever et al., 2014; Peters et al., 2018]. And further progress and interest in sequence representation exploded after Transformer was proposed. Alongside the rise of Transformer, the concept of language modeling was generalized to encompass models that learn to predict words in various ways. Many 36 2.1 A Brief Introduction to LLMs 37 powerful Transformer-based models were pre-trained using these word prediction tasks, and successfully applied to a variety of downstream tasks [Devlin et al., 2019]. Indeed, training language models on large-scale data has led NLP research to exciting times. While language modeling has long been seen as a foundational technique with no direct link to the goals of artificial intelligence that researchers had hoped for, it helps us see the emergence of intelligent systems that can learn a certain degree of general knowledge from repeatedly predicting words in text. Recent research demonstrates that a single, well-trained LLM can handle a large number of tasks and generalize to perform new tasks with a small adaptation effort [Bubeck et al., 2023]. This suggests a step towards more advanced forms of artificial intelligence, and inspires further exploration into developing more powerful language models as foundation models. In this chapter, we consider the basic concepts of generative LLMs. For simplicity, we use the terms large language models or LLMs to refer to generative models like GPT, though this term can broadly cover other types of models like BERT. We begin by giving a general introduction to LLMs, including the key steps of building such models. We then discuss two scaling issues of LLMs: how LLMs are trained at scale, and how LLMs can be improved to handle very long texts. Finally, we give a summary of these discussions. 2.1 A Brief Introduction to LLMs In this section we give an introduction to the basic ideas of LLMs as required for the rest of this chapter and the following chapters. We will use terms word and token interchangeably. Both of them refer to the basic units used in language modeling, though their original meanings are different. Before presenting details, let us first consider how language models work. The goal of language modeling is to predict the probability of a sequence of tokens occurring. Let {x0 , x1 , ..., xm } be a sequence of tokens, where x0 is the start symbol hsi (or hSOSi)1 . The probability of this sequence can be defined using the chain rule Pr(x0 , ..., xm ) = Pr(x0 ) · Pr(x1 |x0 ) · Pr(x2 |x0 , x1 ) · · · Pr(xm |x0 , ..., xm−1 ) = m Y i=0 (2.1) Pr(xi |x0 , ..., xi−1 ) or alternatively in a logarithmic form log Pr(x0 , ..., xm ) = m X i=0 log Pr(xi |x0 , ..., xi−1 ) (2.2) Here Pr(xi |x0 , ..., xi−1 ) is the probability of the token xi given all its previous tokens {x0 , ..., xi−1 } 2 . In the era of deep learning, a typical approach to language modeling is to estimate this 1 The start symbol can also be [CLS] following BERT models. We assume that when i = 0, Pr(xi |x0 , ..., xi−1 ) = Pr(x0 ) Pr(x1 , ..., xm |x0 ) = Pr(x1 , ..., xm |x0 ). 2 Pr(x0 ) = 1. Hence Pr(x0 , ..., xm ) = Generative Models 38 Context Predict Decision Rule Sequence Probability hsi a b hsi a b c arg maxx2 ∈V Pr(x2 |hsi a) Pr(hsi) · Pr(a|hsi)· Pr(b|hsi a) hsi a b c d arg maxx4 ∈V Pr(x4 |hsi a b c) Pr(hsi) · Pr(a|hsi) · Pr(b|hsi a)· Pr(c|hsi a b)· Pr(d|hsi a b c) arg maxx3 ∈V Pr(x3 |hsi a b) Pr(hsi) · Pr(a|hsi) · Pr(b|hsi a)· Pr(c|hsi a b) Table 2.1: Illustration of generating the three tokens b c d given the prefix hsi a via a language model. In each step, the model picks a token xi from V so that Pr(xi |x0 , ..., xi−1 ) is maximized. This token is then appended to the end of the context sequence. In the next step, we repeat the same process, but based on the new context. probability using a deep neural network. Neural networks trained to accomplish this task receive a sequence of tokens x0 , ..., xi−1 and produce a distribution over the vocabulary V (denoted by Pr(·|x0 , ..., xi−1 )). The probability Pr(xi |x0 , ..., xi−1 ) is the value of the i-th entry of Pr(·|x0 , ..., xi−1 ). When applying a trained language model, a common task is to find the most likely token given its previous context tokens. This token prediction task can be described as x̂i = arg max Pr(xi |x0 , ..., xi−1 ) (2.3) xi ∈V We can perform word prediction multiple times to generate a continuous text: each time we predict the best token x̂i , and then add this predicted token to the context for predicting the next token x̂i+1 . This results in a left-to-right generation process implementing Eqs. (2.1) and (2.2). To illustrate, consider the generation of the following three words given the prefix ‘hsi a’, as shown in Table 2.1. Now we discuss how LLMs are constructed, trained, and applied. 2.1.1 Decoder-only Transformers As is standard practice, the input of a language model is a sequence of tokens (denoted by {x0 , ..., xm−1 }). For each step, an output token is generated, shifting the sequence one position forward for the next prediction. To do this, the language model outputs a distribution Pr(·|x0 , ..., xi−1 ) at each position i, and the token xi is selected according to this distribution. P 3 This model is trained by maximizing the log likelihood m i=1 log Pr(xi |x0 , ..., xi−1 ) . Here, we focus on the decoder-only Transformer architecture, as it is one of the most popular model architectures used in LLMs. The input sequence of tokens is represented by a sequence of de -dimensional vectors {e0 , ..., em−1 }. ei is the sum of the token embedding of xi and the positional embedding of i. The major body of the model is a stack of Transformer blocks (or layers). Each Transformer block has two stacked sub-layers, one for self-attention modeling and one for FFN modeling. These sub-layers can be defined using the post-norm architecture output = LNorm(F (input) + input) 3 Note that Pm i=1 log Pr(xi |x0 , ..., xi−1 ) = Pm i=0 log Pr(xi |x0 , ..., xi−1 ) since log Pr(x0 ) = 0. (2.4) 2.1 A Brief Introduction to LLMs 39 or the pre-norm architecture output = LNorm(F (input)) + input (2.5) where input and output denote the input and output, both being an m × d matrix. The i-th rows of input and output can be seen as contextual representations of the i-th token in the sequence. F (·) is the core function of a sub-layer. For FFN sub-layers, F (·) is a multi-layer FFN. For self-attention sub-layers, F (·) is a multi-head self-attention function. In general, self-attention is expressed in a form of QKV attention QKT Attqkv (Q, K, V) = Softmax( √ + Mask)V d (2.6) where Q, K and V ∈ Rm×d are the queries, keys, and values, respectively. It is important to note that only previous tokens are considered when predicting a token. So a masking variable Mask ∈ Rm×m is incorporated into self-attention to achieve this. The entry (i, k) of Mask has a value of 0 if i ≤ k, and a value of − inf otherwise. Given a representation H ∈ Rm×d , the multi-head self-attention function can be defined as F (H) = Merge(head1 , ..., headτ )Whead (2.7) where Merge(·) representees a concatenation of its inputs, and Whead ∈ Rd×d represents a parameter matrix. headj is the output of QKV attention on a sub-space of representation headj = Attqkv (Q[j] , K[j] , V[j] ) (2.8) Q[j],K[j] ,and V[j] are the queries, keys, and values projected onto the j-th sub-space via linear transformations Q[j] = HWqj K [j] = [j] = V HWkj HWvj (2.9) (2.10) (2.11) d where Wqj , Wkj , and Wvj ∈ Rd× τ are the parameter matrices of the transformations. Suppose we have L Transformer blocks. A Softmax layer is built on top of the output of the last block. The Softmax layer outputs a sequence of m distributions over the vocabulary, like this   Pr(·|x0 , ..., xm−1 )   ..   .    Pr(·|x0 , x1 ) Pr(·|x0 )  =   Softmax(HL Wo ) (2.12) where HL is the output of the last Transformer block, and Wo ∈ Rd×|V | is the parameter matrix. Figure 2.1 shows the Transformer architecture for language modeling. Applying this language Generative Models 40 Post-norm or Pre-norm FFN x2 ... xm L Blocks x1 Pr(xm |x0 x1 ...xm−1 ) Pr(x2 |x0 x1 ) Pr(x1 |x0 ) Post-norm or Pre-norm ... hL 0 hL 1 ... Self-attention hL m−1 Language Model e0 e1 ... em−1 x0 x1 ... xm−1 z0 z1 ... zm−1 Fig. 2.1: The Transformer-decoder architecture for language modeling. The central components are L stacked Transformer blocks, each comprising a self-attention sub-layer and an FFN sub-layer. To prevent the model from accessing the right-context, a masking variable is incorporated into self-attention. The output layer uses a Softmax function to generate a probability distribution for the next token, given the sequence of previous tokens. During inference, the model takes the previously predicted token to predict the next one, repeating this process until the end of the sequence L is reached. {z0 , ..., zm−1 } denote the inputs of a Transformer block, and {hL 0 , ..., hm−1 } denote the outputs of the last Transformer block. model follows an autoregressive process. Each time the language model takes a token xi−1 as input and predicts a token xi that maximizes the probability Pr(xi |x0 , ..., xi−1 ). It is important to note that, despite different implementation details, many LLMs share the same architecture described above. These models are called large because both their depth and width are significant. Table 2.2 shows the model sizes for a few LLMs, as well as their model setups. 2.1.2 Training LLMs Now suppose that we are given a training set D comprising K sequences. The log-likelihood of each sequence x = x0 ...xm in D can be calculated using a language model Lθ (x) = m X i=1 log Prθ (xi |x0 , ..., xi−1 ) (2.13) Here the subscript θ affixed to L(·) and Pr(·) denotes the parameters of the language model. Then, the objective of maximum likelihood training is defined as θ̂ = arg max θ X x∈D Lθ (x) (2.14) Training Transformer-based language models with the above objective is commonly viewed as a standard optimization process for neural networks. This can be achieved using gradient descent algorithms, which are widely supported by off-the-shelf deep learning toolkits. Somewhat 2.1 A Brief Introduction to LLMs 41 # of Parameters Depth L Width d # of Heads (Q/KV) 0.117B 1.5B 175B 12 48 96 768 1,600 12,288 12/12 25/25 96/96 LLaMA2 [Touvron et al., 2023b] 7B 13B 70B 32 40 80 4,096 5,120 8,192 32/32 40/40 64/64 LLaMA3/3.1 [Dubey et al., 2024] 8B 70B 405B 32 80 126 4,096 8,192 16,384 32/8 64/8 128/8 Gemma2 [Team et al., 2024] 2B 9B 37B 26 42 46 2,304 3,584 4,608 8/4 16/8 32/16 Qwen2.5 [Yang et al., 2024] 0.5B 7B 72B 24 28 80 896 3,584 8,192 14/2 28/4 64/8 DeepSeek-V3 [Liu et al., 2024a] 671B 61 7,168 128/128 Falcon [Penedo et al., 2023] 7B 40B 180B 32 60 80 4,544 8,192 14,848 71/71 128/128 232/232 Mistral [Jiang et al., 2023a] 7B 32 4,096 32/32 LLM GPT-1 [Radford et al., 2018] GPT-2 [Radford et al., 2019] GPT-3 [Brown et al., 2020] Table 2.2: Comparison of some LLMs in terms of model size, model depth, model width, and number of heads (a/b means a heads for queries and b heads for both keys and values). surprisingly, better results were continuously yielded as language models were evolved into more computationally intensive models and trained on larger datasets [Kaplan et al., 2020]. These successes have led NLP researchers to continue increasing both the training data and model size in order to build more powerful language models. However, as language models become larger, we confront new training challenges, which significantly change the problem compared to training relatively small models. One of these challenges arises from the need for large-scale distributed systems to manage the data, model parameters, training routines, and so on. Developing and maintaining such systems requires a significant amount of work in both software and hardware engineering, as well as expertise in deep learning. A related issue is that when the training is scaled up, we need more computing resources to ensure the training process can be completed in an acceptable time. For example, it generally requires hundreds or thousands of GPUs to train an LLM with tens of billions of parameters from scratch. This requirement drastically increases the cost of training such models, especially considering that many training runs are needed as these models are developed. Also, from the perspective of deep learning, the training process can become unstable if the neural networks are very deep and/or the model size is very large. In response, we typically need to modify the model architecture to adapt LLMs to large-scale training. In Section 2.2 we will present more discussions on these issues. Generative Models 42 2.1.3 Fine-tuning LLMs Once we have pre-trained an LLM, we can then apply it to perform various NLP tasks. Traditionally language models are used as components of other systems, for example, they are widely applied to score translations in statistical machine translation systems. By contrast, in generative AI, LLMs are considered complete systems and are employed to address NLP problems by making use of their generation nature. A common approach is to describe the task we want to address in text and then prompt LLMs to generate text based on this description. This is a standard text generation task where we continue or complete the text starting from a given context. More formally, let x = x0 ...xm denote a token sequence of context given by users, and y = y1 ...yn denote a token sequence following the context. Then, the inference of LLMs can be defined as a problem of finding the most likely sequence y based on x: ŷ = arg max log Pr(y|x) y = arg max y n X i=1 log Pr(yi |x0 , ..., xm , y1 , ..., yi−1 ) (2.15) P Here ni=1 log Pr(yi |x0 , ..., xm , y1 , ..., yi−1 ) essentially expresses the same thing as the righthand side of Eq. (2.2). It models the log probability of predicting tokens from position m + 1, rather than position 0. Throughout this chapter and subsequent ones, we will employ separate variables x and y to distinguish the input and output of an LLM, though they can be seen as subsequences from the same sequence. By adopting such notation, we see that the form of the above equation closely resembles those used in other text generation models in NLP, such as neural machine translation models. To illustrate how LLMs are applied, consider the problem of determining the grammaticality for a given sentence. We can define a template like this {*sentence*} Question: Is this sentence grammatically correct? Answer: Here represents the text we intend to generate. {*sentence*} is a placeholder variable that will be replaced by the actual sentence provided by the users. For example, suppose we have a sentence “John seems happy today.”. We can replace the {*sentence*} in the template with this sentence to have an input to the language model John seems happy today. Question: Is this sentence grammatically correct? Answer: To perform the task, the language model is given the context x =“John seems happy today .\n Question : Is this sentence grammatically correct?\n Answer :”4 . It then generates the following 4 \n is a special character used for line breaks. 2.1 A Brief Introduction to LLMs 43 text as the answer, based on the context. For example, the language model may output “Yes” (i.e., y = “Yes”) if this text is the one with the maximum probability of prediction given this context. Likewise, we can define more templates to address other tasks. For example, we can translate an English sentence into Chinese using the following template {*sentence*} Question: What is the Chinese translation of this English sentence? Answer: or using an instruction-like template {*sentence*} Translate this sentence from English into Chinese. or using a code-like template. [src-lang] = English [tgt-lang] = Chinese [input] = {*sentence*} [output] = The above templates provide a simple but effective method to “prompt” a single LLM to perform various tasks without adapting the structure of the model. However, this approach requires that the LLM can recognize and follow the instructions or questions. One way to do this is to incorporate training samples with instructions and their corresponding responses into the pre-training dataset. While this method is straightforward, building and training LLMs from scratch is computationally expensive. Moreover, making instruction-following data effective for pre-training requires a significant amount of such data, but collecting large-scale labeled data for all tasks of interest is very difficult. A second method, which has been a de facto standard in recent research, is to adapt LLMs via fine-tuning. As such, the token prediction ability learned in the pre-training phase can be generalized to accomplish new tasks. The idea behind fine-tuning is that some general knowledge of language has been acquired in pre-training, but we need a mechanism to activate this knowledge for applying it to new tasks. To achieve this, we can slightly fine-tune the model parameters using instruction-following data. This approach is called instruction fine-tuning. An instruction fine-tuning sample, which is represented by a sequence of tokens, can be seen as a tuple consisting of an input and the desired output. Here, the input includes instructions, system information (or system prefix), and any other user-provided information5 . To illustrate, consider the following examples (blue text = input and underlined text = output). 5 System information refers to a sequence of tokens added at the beginning of an input in order to guide the behavior of an LLM, such as, you are a helpful assistant and should not output toxic content. Generative Models 44 Read the following sentence, and answer whether it is grammatically correct. LLMs are powerful models but are expensive to build. Yes LLMs are powerful models but are expensive to build. Does this sentence make sense grammatically? Answer Option - Yes - No Yes Yes or No? Yes = the following sentence is grammatically correct. No = it contains grammatical errors. LLMs are powerful models but are expensive to build. Yes All these samples describe the same binary classification task, but with different instructions. To increase the diversity in the fine-tuning data and make LLMs generalize, we can define more tasks using instructions. Write a short story about three characters. Character 1: A boy whose name is Andy. Character 2: Andy’s dog. Character 3: Andy’s friend, Jane. In the heart of a bustling city, Andy, his loyal dog Rocket, and his friend Jane discovered a hidden garden beneath the glow of twilight. Surrounded by ancient trees and whispering winds, they shared dreams and secrets, forging an unbreakable bond. As the night deepened, they made a pact under the stars to always find magic in the mundane. Together, they realized that true adventure lay in the moments spent with friends, and every day was a treasure waiting to be discovered. 2.1 A Brief Introduction to LLMs 45 Show me a recipe for making ice cream. Ingredients: 2 cups heavy cream; 1 cup milk; 3/4 cup sugar; 2 tsp vanilla extract; Pinch of salt. Steps: 1. Whisk all ingredients until sugar dissolves. 2. Chill mixture for 2 hours in the fridge. 3. Freeze in a shallow dish, stirring every 30 minutes, 3-4 times. 4. Freeze solid for 2 hours. 5. Serve your homemade ice cream. If you buy 5 apples and each apple costs $1.20, how much do you spend in total? $6.00 Write a Python program to calculate the sum of squares of the following numbers. 1 , 2 , 10 , -9 , 78 numbers = [1,2,10,-9 ,78] sum_of_squares = sum(x**2 for x in numbers) print(sum_of_squares) To acquire instruction-following abilities, a certain amount of fine-tuning data is required. This data may include diverse instructions and possible responses. It has been found that scaling the number of fine-tuning tasks is beneficial for improving the performance of LLMs [Chung et al., 2022]. Note that although more fine-tuning data is favorable, the amount of this data is generally orders of magnitude smaller than that of the pre-training data. For example, LLMs can be finetuned with tens or hundreds of thousands of samples, or even fewer if these samples are of high quality [Zhou et al., 2023a; Chen et al., 2023b], whereas pre-training such models may require billions or trillions of tokens, resulting in significantly larger computational demands and longer training times [Touvron et al., 2023a]. It is also worth noting that we should not expect the fine-tuning data to cover all the downstream tasks to which we intend to apply LLMs. A common understanding of how the pre-training + fine-tuning approach works is that LLMs have gained knowledge for understanding instructions and generating responses in the pre-training phase. However, these abilities are not fully activated until we introduce some form of supervision. The general instruction-following behavior emerges as we fine-tune the models with a relatively small amount of labeled data. As a result, we can achieve some level of zero-shot learning: the fine-tuned models can handle new tasks that they have not been explicitly trained or fine-tuned for [Sanh et al., 2022; Wei et al., 2022a]. This zeroshot learning ability distinguishes generative LLMs from earlier pre-trained models like BERT, which are primarily fine-tuned for specific tasks. Once we have prepared a collection of instruction-described data, the fine-tuning process is relatively simple. This process can be viewed as a standard training process as pre-training, but on a much smaller training dataset. Let Dtune be the fine-tuning dataset and θ̂ be the model parameters Generative Models 46 optimized via pre-training. We can modify Eq. (2.14) to obtain the objective of fine-tuning θ̃ = arg max θ̂ + X sample∈Dtune (2.16) Lθ̂+ (sample) Here θ̃ denotes the optimal parameters. The use of notation θ̂ + means that the fine-tuning starts with the pre-trained parameters θ̂. For each sample ∈ Dtune , we divide it into an input segment xsample and an output segment ysample , that is, (2.17) sample = [ysample , xsample ] We then define the loss function to be Lθ̂+ (sample) = − log Prθ̂+ (ysample |xsample ) (2.18) In other words, we compute the loss over the sub-sequence ysample , rather than the entire sequence. In a practical implementation of back-propagation for this equation, the sequence [ysample , xsample ] is constructed in the forward pass as usual. However, in the backward pass, error gradients are propagated back only through the parts of the network that correspond to ysample , leaving the rest of the network unchanged. As an example, consider a sequence hsi Square this number . 2 . The result is 4 . | {z Context (Input) } | {z } Prediction (Output) The loss is calculated and back propagated only for The result is 4 .. Instruction fine-tuning also requires substantial engineering work. In order to achieve satisfactory results, one may experiment with different settings of the learning rate, batch size, number of fine-tuning steps, and so on. This typically requires many fine-tuning runs and evaluations. The cost and experimental effort of fine-tuning remain critical and should not be overlooked, though they are much lower than those of the pre-training phase. While we focus on instruction fine-tuning for an illustrative example here, fine-tuning techniques play an important role in developing various LLMs and are more widely used. Examples include fine-tuning LLMs as chatbots using dialog data, and adapting these models to handle very long sequences. The wide application of fine-tuning has led researchers to improve these techniques, such as designing more efficient fine-tuning algorithms. While the research on fine-tuning is fruitful, in this section we just give a flavour of the key steps involved. We will see more detailed discussions on this topic in the following chapters. 2.1.4 Aligning LLMs with the World Instruction fine-tuning provides a simple way to adapt LLMs to tasks that can be well defined. This problem can broadly be categorized as an alignment problem. Here, alignment is referred to as a process of guiding LLMs to behave in ways that align with human intentions. The guidance can come from labeled data, human feedback, or any other form of human preferences. For example, 2.1 A Brief Introduction to LLMs 47 we want LLMs not only to be accurate in following instructions, but also to be unbiased, truthful, and harmless. So we need to supervise the models towards human values and expectations. A common example is that when we ask an LLM how to build a weapon, it may provide a list of key steps to do so if it is not carefully aligned. However, a responsible model should recognize and avoid responding to requests for harmful or illegal information. Alignment in this case is crucial for ensuring that LLMs act responsibly and in accordance with ethical guidelines. A related concept to alignment is AI safety. One ultimate goal of AI is to build intelligent systems that are safe and socially beneficial. To achieve this goal we should keep these systems robust, secure, and subjective, in any conditions of real-world use, even in conditions of misuse or adverse use. For LLMs, the safety can be increased by aligning them with appropriate human guidance, such as human labeled data and interactions with users during application. Alignment is difficult as human values and expectations are diverse and shifting. Sometimes, it is hard to describe precisely what humans want, unless we see the response of LLMs to user requests. This makes alignment no longer a problem of tuning LLMs on predefined tasks, but a bigger problem of training them with the interactions with the real world. As a result of the concerns with controlling AI systems, there has been a surge in research on the alignment issue for LLMs. Typically, two alignment steps are adopted after LLMs are pre-trained on large-scale unlabeled data. • Supervised Fine-tuning (SFT). This involves continuing the training of pre-trained LLMs on new, task-oriented, labelled data. A commonly used SFT technique is instruction finetuning. As described in the previous subsection, by learning from instruction-response annotated data, LLMs can align with the intended behaviors for following instructions, thereby becoming capable of performing various instruction-described tasks. Supervised fine-tuning can be seen as following the pre-training + fine-tuning paradigm, and offers a relatively straightforward method to adapt LLMs. • Learning from Human Feedback. After an LLM finishes pre-training and supervised finetuning, it can be used to respond to user requests if appropriately prompted. But this model may generate content that is unfactual, biased, or harmful. To make the LLM more aligned with the users, one simple approach is to directly learn from human feedback. For example, given some instructions and inputs provided by the users, experts are asked to evaluate how well the model responds in accordance with their preferences and interests. This feedback is then used to further train the LLM for better alignment. A typical method for learning from human feedback is to consider it as a reinforcement learning (RL) problem, known as reinforcement learning from human feedback (RLHF) [Ouyang et al., 2022]. The RLHF method was initially proposed to address general sequential decision-making problems [Christiano et al., 2017], and was later successfully employed in the development of the GPT series models [Stiennon et al., 2020]. As a reinforcement learning approach, the goal of RLHF is to learn a policy by maximizing some reward from the environment. Specifically, two components are built in RLHF: • Agent. An agent, also called an LM agent, is the LLM that we want to train. This agent operates by interacting with its environment: it receives a text from the environment and Generative Models 48 outputs another text that is sent back to the environment. The policy of the agent is the function defined by the LLM, that is, Pr(y|x). • Reward Model. A reward model is a proxy of the environment. Each time the agent produces an output sequence, the reward model assigns this output sequence a numerical score (i.e., the reward). This score tells the agent how good the output sequence is. In RLHF, we need to perform two learning tasks: 1) reward model learning, which involves training a reward model using human feedback on the output of the agent, and 2) policy learning, which involves optimizing a policy guided by the reward model using reinforcement learning algorithms. Here is a brief outline of the key steps involved in RLHF. • Build an initial policy using pre-training and instruction fine-tuning. • Use the policy to generate multiple outputs for each input, and then collect human feedback on these outputs (e.g., comparisons of the outputs). • Learn a reward model from the human feedback. • Fine-tune the policy with the supervision from the reward model. Figure 2.2 shows an overview of RLHF. Given that this section serves only as a brief introduction to concepts of LLMs, a detailed discussion of RLHF techniques will not be included. We instead illustrate the basic ideas behind RLHF using a simple example. Suppose we have trained an LLM via pre-training and instruction fine-tuning. This LLM is deployed to respond to requests from users. For example, a user may input How can I live a more environmentally friendly life? We use the LLM to generate 4 different outputs (denoted by {y1 , ..., y4 }) by sampling the output space Output 1 (y1 ): Consider switching to an electric vehicle or bicycle instead of traditional cars to reduce carbon emissions and protect our planet. Output 2 (y2 ): Adopt a minimalist lifestyle. Own fewer possessions to reduce consumption and the environmental impact of manufacturing and disposal. Output 3 (y3 ): Go off-grid. Generate your own renewable energy and collect rainwater to become completely self-sufficient and reduce reliance on non-renewable resources. Output 4 (y4 ): Support local farm products to reduce the carbon footprint of transporting food, while enjoying fresh, healthy food. 2.1 A Brief Introduction to LLMs 49 Comparisons y1 ≻ y4 ≻ y2 ≻ y3 SFT Data Write a poem about the weather in London . ... Annotating Data with Human Preferences Pre-training Data Model Output How can I get there? ... I love the food here! ... 1. ............ 3. ............ Pre-training & Supervised fine-tuning 2. ............ 4. ............ Predicting User Input LLM LLM How can I live more environmentally friendly? (b) Annotating Data with Human Preferences (a) Learning an Initial LLM Reward Scores {r(x, y)} Evaluate the Input-output Pairs Comparison Data {(x, yk1 ≻ yk2 )} RL Fine-tuning Reward Model Input-output Pairs {x, y} Training Reward Model (c) Training the Reward Model Sampling y via the Policy Pr(y|x) LLM (Policy) Dataset D x∼D (d) Training/Fine-tuning the Policy Fig. 2.2: An overview of RLHF. There are 4 key steps involved: a) training an initial LLM (i.e., policy) using pretraining and supervised fine-tuning; b) collecting human preference data by ranking the outputs of the LLM; c) training a reward model using the ranking results; d) RL fine-tuning of the policy based on the reward model. Double line arrows mean training or fine-tuning. We then ask annotators to evaluate these outputs. One straightforward way is to assign a rating score to each output. In this case, the reward model learning problem can be framed as a task of training a regression model. But giving numerical scores to LLM outputs is not an easy task for annotators. It is usually difficult to design an annotation standard that all annotators can agree on and easily follow. An alternative method, which is more popular in the development of LLMs, is to rank these outputs. For example, a possible ranking of the above outputs is y1 ≻ y4 ≻ y2 ≻ y3 Generative Models 50 A reward model is then trained using this ranking result. In general, a reward model in RLHF is a language model that shares the same architecture as the target LLM, but with a smaller model size. Given the input x and output yk , we concatenate them to form a sequence seq k = [x, yk ]. This sequence is processed from left to right using forced decoding. Since each position can only access its left context in language modeling, the output of the top-most Transformer layer at the first position cannot be used as the representation of the sequence. Instead, a special symbol (e.g., h\si) is added to the end of the sequence, and the corresponding output of the Transformer layer stack is considered as the representation of the entire sequence. An output layer, such as a linear transformation layer, is built on top of this representation to generate the reward, denoted by R(seq k ) or R(x, yk ). We train this reward model using ranking loss. For example, a pair-wise ranking loss function can be written in the form Lossω (Dr ) = −E(x,yk1 ,yk2 )∼Dr log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.19) where ω represents the parameters of the reward model, and Dr represents a set of tuples of an input and a pair of outputs. (x, yk1 , yk2 ) ∼ Dr is a sampling operation which draws a sample (x, yk1 , yk2 ) from Dr with some probability. As an example, suppose we first draw a model input x with a uniform distribution and then draw a pair of model outputs with a probability of yk1 ≻ yk2 given x (denoted by Pr(yk1 ≻ yk2 |x)). The corresponding loss function is given by Lossω (Dr ) = − = − X Pr(x) · Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) 1 X K Pr(yk1 ≻ yk2 |x) · log(Sigmoid(Rω (x, yk1 ) − Rω (x, yk2 ))) (2.20) where K represents the number of model inputs involved in sampling. While the form of these functions may seem complex, their idea is simple: we penalize the model if the predicted ranking of two outputs differs from the human-labeled ranking. By contrast, the model receives a bonus, if the predicted ranking matches the human-labeled ranking. We can train the reward model by minimizing the above ranking loss ω̂ = arg min Lossω (Dr ) ω (2.21) The resulting model Rω̂ (·) can be employed to evaluate any given pair of input and output. Note that although the reward model is trained using a ranking-based objective, it is used for scoring. This allows it to provide continuous supervision signals, which is very beneficial for training other models. We now turn to the policy learning problem. A commonly adopted objective is to maximize the reward on a set of input-output pairs. Following an analogous form of Eq. (2.16), we obtain a simple training objective for RL fine-tuning θ̃ = arg max E(x,yθ̂+ )∼Drlft Rω̂ (x, yθ̂+ ) (2.22) θ̂ + where the optimal parameters θ̃ are obtained by fine-tuning the pre-trained parameters θ̂. Drlft is 2.1 A Brief Introduction to LLMs 51 the RL fine-tuning dataset. For each sample (x, yθ̂+ ), x is sampled from a prepared dataset of input sequences, and yθ̂+ is sampled from the distribution Prθ̂+ (y|x) given by the policy. In practice, more advanced reinforcement learning algorithms, such as proximal policy optimization (PPO), are often used for achieving more stable training, as well as better performance. We leave the detailed discussion of reinforcement learning algorithms to the following parts of this book where RLHF is extensively used for alignment. An interesting question arises here: why not consider learning from human preferences as a standard supervised learning problem? This question is closely related to our aforementioned discussion on the difficulty of data annotation. Often, describing human values and goals is challenging, and it is even more difficult for humans to provide outputs that are well aligned. As an alternative, annotating the preferences of a given list of model outputs offers a simpler task. By doing so, we can create a model that understands human preferences, which can then be used as a reward model for training policies. From the perspective of machine learning, RLHF is particularly useful for scenarios where the desired behavior of an agent is difficult to demonstrate but can be easily recognized by humans. Another advantage of RLHF is its ability to explore the sample space. By employing sampling techniques, models trained with reinforcement learning can venture beyond the annotated data set to explore additional samples. This exploratory ability allows RLHF to discover potentially beneficial policies that are not immediately apparent from the labeled data alone. 2.1.5 Prompting LLMs We have so far shown that LLMs can be used to perform various tasks by giving them appropriate prompts. There are no restrictions on these prompts, which can include any information we wish to ask or communicate with LLMs, such as natural language instructions and the context of conversations. Since this approach requires no additional training or tuning, adapting LLMs becomes highly efficient once they are developed. This somewhat influences the paradigms in NLP: we no longer need to develop specific systems for individual tasks but can instead use a single, welltrained LLM to perform different tasks by prompting it. An appealing aspect of LLM prompting arises as a result: users can easily have “customized” systems by designing their own prompts for LLMs. Given the important role played by prompting in LLMs, prompting engineering has become a very active area of research in NLP. The term prompt is used in many different ways in the literature. In this chapter, this term refers to the entire input to LLMs, and so we use the terms prompt and model input interchangeably. Before discussing prompting further, let us first see a few examples where the prompts are more complex than those presented in the previous subsections. Note that this subsection is not aimed at writing high-quality prompts but rather at highlighting some interesting issues in prompting LLMs. One of the popular ways to use LLMs is to assign them a “role” played in generating responses. For example, LLMs can act as a psychologist when answering questions. Generative Models 52 Please explain what delayed gratification is? Note, you are a researcher with a deep background in psychology. Your own field of study is developmental psychology in children, with a focus on aspects of control and intelligence. You are very familiar with the concept of delayed gratification and have applied it to your own research. You have published papers on delayed gratification and conducted related experiments. Delayed gratification is the process of resisting an immediate reward in anticipation of receiving a more valuable reward in the future. It is a significant concept in developmental psychology, especially in relation to how children learn self-control and develop cognitive abilities. Research shows that the ability to delay gratification is associated with positive outcomes in academic performance, social competence, and emotional well-being. Studies and experiments in this area highlight the importance of cultivating this skill early in life to support long-term success and resilience. Another example is the use of LLMs in detecting and correcting errors such as syntactic or semantic mistakes in text. For an LLM which is trained on both code and natural language data, we may use it for code debugging6 . Fix the bugs in this C language program. #include int main() { printg("Hello, World!") return 0; } 1. The function name ’printg’ is incorrect . It should be ’printf’. 2. There is a missing semicolon at the end of the ’printf’ function call. Here is the corrected version of the program: #include int main() { printf("Hello, World!"); return 0; } Also, LLMs can be fine-tuned to act as a conversation partner 7 . 6 In this example, the code is not tokenized for easier reading. To fine-tune an LLM for multi-turn dialogue, one needs to consider conversation history in the context for predicting the response in the current round of conversation. This makes the actual prompt used in response generation relatively longer than that used in single-turn dialogue. 7 2.1 A Brief Introduction to LLMs 53 Hi! I’m an assistant. How can I help you? Who won the FIFA World Cup 2022? Argentina won the FIFA World Cup 2022. Where was it held? The 2022 FIFA World Cup was held in Qatar. How many times has Argentina won the World Cup? Argentina has won the FIFA World Cup three times. ... These examples and previous ones have shown that appropriate responses can be generated via prompts involving clear instructions and questions. However, when problem solving requires knowledge that is not explicitly specified, LLMs may make mistakes, even though the instructions are sufficiently clear and precise. A family of challenging tasks for LLMs involves arithmetic reasoning and commonsense reasoning. For example, we can ask an LLM to solve primary school math problems presented in natural language. Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more apples. The next day, Jack gave 3 apples to his friend John. How many apples does Jack have left in the end? The answer is 10. The correct answer should be 7, so the model output is incorrect. One approach to addressing such issues is to incorporate learning into prompts, called incontext learning or (ICL). The idea of ICL is to demonstrate the ways to solve problems in prompts, and condition predictions on these demonstrations. Here is an example where a similar problem and the corresponding answer are presented in the prompt (green = demonstrations). Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then loses 5 marbles the next day. His brother gives him another 3 marbles as a gift. How many marbles does Tom have now? The answer is 17. Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more apples. The next day, Jack gave 3 apples to his friend John. How many apples does Jack have left in the end? The answer is 12. But the LLM still made mistakes this time. A reason for this might be that solving math problems does not only involve problem-answer mappings but also, to a larger extent, the underlying logical inference in multiple steps. A method to improve the inference abilities of LLMs is chain-of-thought prompting (COT prompting) [Wei et al., 2022c]. In COT prompting, we decompose complex reasoning problems into multiple problem-solving intermediate steps. These Generative Models 54 steps are demonstrated in prompts so that LLMs can be prompted to learn to mimic the underlying thought processes in solving the problems. COT prompting has been found to be very useful for achieving good results on challenging tasks, such as the GSM8K mathematical reasoning benchmark. Consider the above demonstration example Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then loses 5 marbles the next day. His brother gives him another 3 marbles as a gift. How many marbles does Tom have now ?. The corresponding logical reasoning steps can be structured: 1. Tom starts with 12 marbles. 2. He wins 7 more marbles in a game, so we add those to his initial count: 12 marbles + 7 marbles = 19 marbles. 3. The next day, Tom loses 5 marbles, so we subtract those from his current total: 19 marbles - 5 marbles = 14 marbles. 4. His brother gives him 3 more marbles, so we add those to his current total: 14 marbles + 3 marbles = 17 marbles. 5. After all these events, Tom has 17 marbles. We can add these reasoning steps into the prompt, and get the correct answer. Tom has 12 marbles. He wins 7 more marbles in a game with his friend but then loses 5 marbles the next day. His brother gives him another 3 marbles as a gift. How many marbles does Tom have now? Tom initially has 12 marbles. After winning a game, he gains 7 more, increasing his total to 19 marbles. He loses 5 marbles the next day, bringing his total down to 14 marbles. His brother gifts him 3 more marbles, increasing his total to 17 marbles. Therefore, Tom now has 17 marbles. So the answer is 17. Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more apples. The next day, Jack gave 3 apples to his friend John. How many apples does Jack have left in the end? Jack starts with 7 apples. He eats 2 apples for dinner, so we subtract 2 from 7, leaving him with 5 apples. His mom gives him 5 more apples, adding 5 to his current total. Now he has 5 + 5 = 10 apples. The next day, Jack gives 3 apples to his friend John. Subtracting 3 from 10, Jack is left with 7 apples. Therefore, Jack has 7 apples left. So the answer is 7. This approach is also called one-shot COT prompting. Likewise, we can add more demonstrations to have a few-shot COT prompting approach. Furthermore, a special kind of COT, zero-shot COT, introduces instructions that provoke LLMs to think, without presenting any intermediate reasoning steps in the prompts [Kojima et al., 2.1 A Brief Introduction to LLMs 55 2022]. An interesting example is that by appending a simple sentence “Let’s think step by step.” to the end of a prompt, LLMs can generate reasoning steps to reach the final answer. Jack has 7 apples. He ate 2 of them for dinner, but then his mom gave him 5 more apples. The next day, Jack gave 3 apples to his friend John. How many apples does Jack have left in the end? Let’s think step by step. 1. Initial Quantity: Jack starts with 7 apples. 2. After Dinner: He eats 2 apples, so 7 − 2 = 5 apples remain. 3. His Mom Gives More: His mom gives him 5 more apples, so 5 + 5 = 10 apples are now with him. 4. Giving to John: The next day, Jack gives 3 apples to his friend John, so 10 − 3 = 7 apples are left. In the end, Jack has 7 apples left. Zero-shot, one-shot, and few-shot learning are common concepts in the area of in-context learning for LLMs and are not restricted to COT prompting. Broadly speaking, any prompting that involves only simple instructions without any demonstrations can be considered a form of zero-shot learning. This zero-shot learning ability emerges as LLMs are pre-trained and/or finetuned. Also, one-shot and few-shot learning methods are more often considered when LLMs do not acquire the corresponding zero-shot learning ability. These methods are therefore important for in-context learning when addressing new tasks. Examples include those for performing various NLP tasks by demonstrating task-formatted samples. See the following examples for sentiment sentence classification and phrase translation via few-shot learning. Given the following text snippets, classify their sentiment as Positive, Negative, or Neutral. Example 1: “I had an amazing day at the park!” Sentiment: Positive Example 2: “The service at the restaurant was terrible.” Sentiment: Negative Example 3: “I think it’s going to rain today.” Sentiment: Neutral Text: “This movie was a fantastic journey through imagination.” Sentiment: Positive Generative Models 56 Translate the following Chinese phrases into English. Example 1: “你好” Translation: “Hello” Example 2: “谢谢你” Translation: “Thank you” Phrase to translate: “早上好” Translation: “Good Morning” Above, we have presented examples to illustrate the fundamental in-context learning capabilities of prompting LLMs. This section, however, does not include more advanced prompting techniques in order to keep the content concise and compact. More discussions on prompting can be found in Chapter 3. 2.2 Training at Scale As a first step in developing LLMs, we need to train these models on large amounts of data. The training task is itself standard: the objective is to maximize the likelihood, which can be achieved via gradient descent. However, as we scale up both the model size and the amount of data, the problem becomes very challenging, for example, large models generally make the training unstable. In this section, we discuss several issues of large-scale training for LLMs, including data preparation, model modification, and distributed training. We also discuss the scaling laws for LLMs, which help us understand their training efficiency and effectiveness. 2.2.1 Data Preparation The importance of data cannot be overstated in NLP. As larger neural networks are developed, the demand for data continues to increase. For example, developing LLMs may require trillions of tokens in pre-training (see Table 2.3), orders of magnitude larger than those used in training conventional NLP models. In general, we may want to gather as much training data as possible. However, larger training datasets do not mean better training results, and the development of LLMs raises new issues in creating or collecting these datasets. A first issue is the quality of data. High-quality data has long been seen as crucial for training data-driven NLP systems. Directly using raw text from various sources is in general undesirable. For example, a significant portion of the data used to train recent LLMs comes from web scraping, which may contain errors and inappropriate content, such as toxic information and fabricated facts. Also, the internet is flooded with machine-generated content due to the widespread use of AI, presenting further challenges for processing and using web-scraped data. Researchers have found that training LLMs on unfiltered data is harmful [Raffel et al., 2020]. Improving data quality typically involves incorporating filtering and cleaning steps in the data processing workflow. For example, Penedo et al. [2023] show that by adopting a number of data processing techniques, 90% of their web-scraped data can be removed for LLM training. In addition to large-scale web-scraped data, LLM training data often includes books, papers, user-generated data on social media, and so on. Most of the latest LLMs are trained on such combined datasets, which are found to be 2.2 Training at Scale 57 # of Tokens LLM Data GPT3-175B [Brown et al., 2020] 0.5T Webpages, Books, Wikipedia Falcon-180B [Almazrouei et al., 2023] 3.5T Webpages, Books, Conversations, Code, Technical Articles LLaMA2-65B [Touvron et al., 2023a] 1.0T ∼ 1.4T PaLM-450B [Chowdhery et al., 2022] 0.78T Webpages, Books, Conversations, Code, Wikipedia, News 6T Webpages, Mathematics, Code Gemma-7B [Gemma Team, 2024] Webpages, Code, Wikipedia, Books, Papers, Q&As Table 2.3: Amounts of training data used in some LLMs in terms of the number of tokens. important for the strong performance of the resulting models. A second issue is the diversity of data. We want the training data to cover as many types of data as possible, so that the trained models can adapt to different downstream tasks easily. It has been widely recognized that the quality and diversity of training data both play very important roles in LLMs. An interesting example is that incorporating programming code into training data has been found to be beneficial for LLMs. The benefits are demonstrated not only in enhancing the programming abilities of LLMs, but also in improving reasoning for complex problems, especially those requiring COT prompting. The concept “diversity” can be extended to include language diversity as well. For example, many LLMs are trained on multi-lingual data, and therefore we can handle multiple languages using a single model. While this approach shows strong abilities in multi-lingual and cross-lingual tasks, its performance on specific languages largely depends on the volume and quality of the data for those languages. It has been shown in some cases to provide poor results for low-resource languages. A third issue is the bias in training data. This is not a problem that is specific to LLMs but exists in many NLP systems. A common example is gender bias, where LLMs show a preference for one gender over another. This can partly be attributed to class imbalance in the training data, for example, the term nurses is more often associated with women. In order to debias the data, it is common practice to balance the categories of different language phenomena, such as gender, ethnicity, and dialects. The bias in data is also related to the diversity issue mentioned above. For example, since many LLMs are trained and aligned with English-centric data, they are biased towards the cultural values and perspectives prevalent among English-speaking populations. Increasing language diversity in training data can somewhat mitigate the bias. Another issue with collecting large-scale data is the privacy concern. If LLMs are trained on data from extensive sources, this potentially leads to risks regarding the exposure of sensitive information, such as intellectual property and personal data. This is particularly concerning given the capacity of LLMs to represent patterns from the data they are trained on, which might inadvertently involve memorizing and reproducing specific details. A simple approach to privacy protection is to remove or anonymize sensitive information. For example, anonymization techniques can be applied to remove personally identifiable information from training data to prevent LLMs from learning from such data. However, in practice, erasing or redacting all sensitive data is difficult. Therefore, many LLMs, particularly those launched for public service, typically work with systems that can detect the potential exposure of sensitive data, or are fine-tuned to reject Generative Models 58 certain requests that could lead to information leakage. 2.2.2 Model Modifications Training LLMs is difficult. A commonly encountered problem is that the training process becomes more unstable as LLMs get bigger. For example, one needs to choose a small learning rate to achieve stable training with gradient descent, but this in turn results in much longer training times. Sometimes, even when the training configuration is carefully designed, training may diverge at certain points during optimization. The training of LLMs is generally influenced by many factors, such as parameter initialization, batching, and regularization. Here, we focus on common modifications and improvements to the standard Transformer architecture, which are considered important in developing trainable LLMs. 2.2.2.1 Layer Normalization with Residual Connections Layer normalization is used to stabilize training for deep neural networks. It is a process of subtracting the mean and dividing by the standard deviation. By normalizing layer output in this way, we can effectively reduce the covariate shift problem and improve the training stability. In Transformers, layer normalization is typically used together with residual connections. As described in Section 2.1.1, a sub-layer can be based on either the post-norm architecture, in which layer normalization is performed right after a residual block, or the pre-norm architecture, in which layer normalization is performed inside a residual block. While both of these architectures are widely used in Transformer-based systems [Wang et al., 2019], the pre-norm architecture has proven to be especially useful in training deep Transformers. Given this, most LLMs are based on the pre-norm architecture, expressed as output = LNorm(F (input)) + input. A widely-used form of the layer normalization function is given by LNorm(h) = α · h−µ +β σ+ǫ (2.23) where h is a d-dimensional real-valued vector, µ is the mean of all the entries of h, and σ is the corresponding standard deviation. ǫ is introduced for the sake of numerical stability. α ∈ Rd and β ∈ Rd are the gain and bias terms. A variant of layer normalization, called root mean square (RMS) layer normalization, only re-scales the input vector but does not re-center it [Zhang and Sennrich, 2019]. The RMS layer normalization function is given by LNorm(h) = α · h σrms + ǫ where σrms is the root mean square of h, that is, σrms = ( 1d function is used in LLMs like the LLaMA series. +β Pd (2.24) 2 1 k=1 hk ) 2 . This layer normalization 2.2 Training at Scale 59 2.2.2.2 Activation Functions in FFNs In Transformers, FFN sub-layers are designed to introduce non-linearities into representation learning, and are found to be useful for preventing the representations learned by self-attention from degeneration8 [Dong et al., 2021]. A standard form of the FFNs used in these sub-layers can be expressed as FFN(h) = σ(hWh + bh )Wf + bf (2.25) where Wh ∈ Rd×dh , bh ∈ Rdh , Wf ∈ Rdh ×d , and bf ∈ Rd are the parameters, and dh is the hidden size. σ(·) is the activation function of the hidden layer. A common choice for σ(·) is the rectified linear unit (ReLU), given by σrelu (h) = max(0, h) (2.26) In practical implementations, increasing dh is helpful and thus it is often set to a larger number in LLMs. But a very large hidden size poses challenges for both training and deployment. In this case, the design of the activation function plays a relatively more important role in wide FFNs. There are several alternatives to the ReLU in LLMs. One of these is the gaussian error linear unit (GeLU) which can be seen as a smoothed version of the ReLU. Rather than controlling the output by the sign of the input, the GeLU function weights its input by the percentile Pr(h ≤ h). Here h is a d-dimensional vector whose entries are drawn from the standard normal distribution Gaussian(0, 1)9 . Specifically, the GeLU function is defined to be σgelu (h) = h Pr(h ≤ h) = hΦ(h) (2.27) where Φ(h) is the cumulative distribution function of Gaussian(0, 1), which can be implemented in convenient ways [Hendrycks and Gimpel, 2016]. The GeLU function has been adopted in several LLMs, such as BERT, GPT-3, and BLOOM. Another family of activation functions which is popular in LLMs is gated linear unit (GLU)based functions. The basic form of GLUs is given by σglu (h) = σ(hW1 + b1 ) ⊙ (W2 + b2 ) (2.28) where W1 ∈ Rd×d , b1 ∈ Rd , W2 ∈ Rd×d , and b2 ∈ Rd are model parameters. Different choices of σ(·) result in different versions of GLU functions. For example, if σ(·) is defined to be the GeLU function, we will have the GeGLU function σgeglu (h) = σgelu (hW1 + b1 ) ⊙ (W2 + b2 ) (2.29) This activation function has been successfully applied in LLMs like Gemma. As another example, consider σ(·) to be the Swish function σswish (h) = h ⊙ Sigmoid(ch) 8 Here degeneration refers to the phenomenon in which the rank of a matrix is reduced after some processing. Pr(h ≤ h) is an informal notation. It refers to a vector, with each entry representing the percentile for the corresponding entry of h. 9 Generative Models 60 [Ramachandran et al., 2017]. Then, the SwiGLU function is given by σswiglu (h) = σswish (hW1 + b1 ) ⊙ (W2 + b2 ) (2.30) Both the PaLM and LLaMA series are based on the SwiGLU function. For more discussions of GLUs, the reader can refer to Shazeer [2020]’s work. 2.2.2.3 Removing Bias Terms Another popular model design is to remove the bias terms in affine transformations used in LLMs. This treatment can be applied to layer normalization, transformations of the inputs to QKV attention, and FFNs. For example, we can modify Eq. (2.25) to obtain an FFN with no bias terms FFN(h) = σ(hWh )Wf (2.31) Chowdhery et al. [2022] report that removing bias terms helps improve the training stability of LLMs. This method has been used in several recent LLMs, such as LLaMA and Gemma. 2.2.2.4 Other Issues Many LLMs also involve modifications to their positional embedding models. For example, one can replace sinusoidal positional encodings with rotary position embeddings so that the learned LLMs can handle long sequences better. These models will be discussed in Section 2.3. Note that while model modifications are common in training LLMs, the stability of training can be improved in many different ways. For example, increasing the batch size as the training proceeds has been found to be useful for some LLMs. In general, achieving stable and efficient large-scale LLM training requires carefully designed setups, including learning schedules, optimizer choices, training parallelism, mixed precision training, and so on. Some of these issues are highly engineered, and therefore, we typically need a number of training runs to obtain satisfactory LLMs. 2.2.3 Distributed Training Training LLMs requires significant amounts of computational resources. A common approach to improving training efficiency is to use large-scale distributed systems. Fortunately, alongside the rise of neural networks in AI, deep learning-oriented software and hardware have been developed, making it easier to implement LLMs and perform computations. For example, one can now easily fine-tune an LLM using deep learning software frameworks and a machine with multiple GPUs. However, scaling up the training of LLMs is still challenging, and requires significant efforts in developing hardware and software systems for stable and efficient distributed training. An important consideration of distributed training is parallelism. There are several forms of parallelism: data parallelism, model parallelism, tensor parallelism, and pipeline parallelism. Despite different ways to distribute computations across devices, these parallelism methods are based on a similar idea: the training problem can be divided into smaller tasks that can be executed simultaneously. The issue of parallelism in training LLMs has been extensively studied 2.2 Training at Scale 61 [Narayanan et al., 2021; Fedus et al., 2022]. Here we sketch the basic concepts. • Data Parallelism. This method is one of the most widely used parallelism methods for training neural networks. To illustrate, consider the simplest case where the standard delta rule is used in gradient descent θt+1 = θt − lr · ∂Lθt (Dmini ) ∂θt (2.32) where the new parameters θt+1 is obtained by updating the latest parameters θt with a small ∂Lθt (Dmini ) is the gradient of the loss step lr in the direction of the negative loss gradient. ∂θt with respect to the parameters θt , and is computed on a minibatch of training sample Dmini . In data parallelism, we divide Dmini into N smaller batches, denoted by {D 1 , ..., D N }. Then, we distribute these batches to N workers, each with a corresponding batch. Once the data is distributed, these workers can work at the same time. The gradient of the entire minibatch is obtained by aggregating the gradients computed by the workers, like this ∂Lθt (Dmini ) ∂θt = ∂Lθt (D 1 ) ∂Lθt (D 2 ) ∂Lθt (D N ) + +··· + ∂θt ∂θt ∂θt | {z worker 1 } | {z worker 2 | } {z worker N (2.33) } In ideal cases where the workers coordinate well and the communication overhead is small, data parallelism can achieve nearly an N -fold speed-up for training. • Model Parallelism. Although data parallelism is simple and effective, it requires each worker to run the entire LLM and perform the complete forward and backward process. As LLMs grow larger, it sometimes becomes unfeasible to load and execute an LLM on a single device. In this case, we can decouple the LLM into smaller components and run these components on different devices. One simple way to do this is to group consecutive layers in the layer stack and assign each group to a worker. The workers operate in the order of the layers in the stack, that is, in the forward pass we process the input from lower-level to upper-level layers, and in the backward pass we propagate the error gradients from upperlevel to lower-level layers. Consider, for example, a Transformer decoder with L stacked blocks. To distribute the computation load, each block is assigned to a worker. See the following illustration for a single run of the forward and backward passes of this model. BL (↑) BL (↓) Worker L ... Worker 2 Worker 1 B1 (↑) ... B2 (↑) ... B2 (↓) B1 (↓) Here Bl denotes the computation of block l, and the symbols ↑ and ↓ denote the forward and backward passes, respectively. Note that this parallelism method forces the workers to run in sequence, so a worker has to wait for the previous worker to finish their job. This results in the devices being idle for most of the time. In practical systems, model parallelism is generally used together with other parallelism mechanisms to maximize the use of devices. Generative Models 62 • Tensor Parallelism. Parallelism can also be performed in a single computation step. A common example is splitting a large parameter matrix into chunks, multiplying an input tensor with each of these chunks separately, and then concatenating the results of these multiplications to form the output. For example, consider the multiplication of the representation h ∈ Rd with the parameter matrix Wh ∈ Rd×dh in an FFN sub-layer (see Eq. (2.25)). We can slice the matrix Wh ∈ Rd×dh vertically to a sequence of M sub-matrices h W1h W2h ... WM h Wh = i (2.34) where each sub-matrix Wkh has a shape of d × dMh . The multiplication of h with Wh can be expressed as h hWh = h W1h W2h ... WM h = h i hW1h hW2h ... hWM h i (2.35) We can perform matrix multiplications {hW1h , hW2h , ..., hWM h } on M devices separately. As a result, we distribute a large matrix multiplication across multiple devices, each of which may have relatively small memory. From the perspective of the design of modern GPUs, tensor parallelism over GPUs provides a two-level, tile-based approach to parallel computing. First, at a higher level, we decompose a matrix multiplication into sub-matrix multiplications that can directly fit into the memory of GPUs. Then, at a lower level, we execute these sub-matrix multiplications on GPUs using tile-based parallel algorithms that are specifically optimized for GPUs. • Pipeline Parallelism. Above, in model parallelism, we have described a simple approach to spreading groups of model components across multiple devices. But this method is inefficient because only one device is activated at a time during processing. Pipeline parallelism addresses this issue by introducing overlaps between computations on different devices [Harlap et al., 2018; Huang et al., 2019]. To do this, a batch of samples is divided into a number of micro-batches, and then these micro-batches are processed by each worker as usual. Once a micro-batch is processed by a worker and passed to the next one, the following micro-batch immediately occupies the same worker. In other words, we create a pipeline in which different computation steps can overlap if multiple jobs are given to the pipeline. The following shows an illustration of pipeline parallelism for processing 3 micro-batches. Worker L BL,1 ... ... Worker 2 Worker 1 B1,1 B2,1 B2,2 B1,2 B1,3 B2,3 BL,2 BL,3 BL,1 BL,2 BL,3 ... B2,1 B2,2 B2,3 B1,1 B1,2 B1,3 Here Bl,k represents the processing of the k-th micro-batch by the l-th worker. Ideally we would like to maximize the number of micro-batches, and thus minimize the idle time of the 2.2 Training at Scale 63 workers. However, in practice, using small micro-batches often reduces GPU utilization and increases task-switching costs. This may, in turn, decrease the overall system throughput. The ultimate goal of parallel processing is to achieve linear growth in efficiency, that is, the number of samples that can be processed per unit of time increases linearly with the number of devices. However, distributed training is complicated, and influenced by many factors in addition to the parallelism method we choose. One problem, which is often associated with distributed systems, is the cost of communication. We can think of a distributed system as a group of networked nodes. Each of these nodes can perform local computation or pass data to other nodes. If there are a large number of such nodes, it will be expensive to distribute and collect data across them. Sometimes, the time savings brought about by parallelism are offset by the communication overhead of a large network. Another problem with large-scale distributed systems is that the synchronization of nodes introduces additional costs. As is often the case, some nodes may take longer to work, causing others to wait for the slowest ones. While we can use asynchronous training to handle heterogeneity in computational resources, this may lead to stale gradients and non-guaranteed convergence. Moreover, as more nodes are added to the network, there is more chance to have crashed nodes during training. In this case, we need to ensure that the whole system is fault tolerant. In many practical settings, to increase scalability, one needs to take into account additional issues, including architecture design, data transfer and computation overlap, load balancing, memory bandwidth and so on. Training LLMs is so computationally expensive that, even though distributed training is already in use, researchers and engineers often still employ various model compression and speedup methods to improve training efficiency [Weng, 2021]. One example is mixed precision training, in which low precision data (such as FP16 and FP8 data) is used for gradient computation on each individual node, and single or double precision data (such as FP32/FP64 data) is used for updating the model [Micikevicius et al., 2018]. A key operation in this approach is gradient accumulation where gradients need to be accumulated and synchronized across nodes. However, due to the non-associativity of floating-point addition, this can lead to slight numerical differences in accumulated gradients on different nodes, which may affect model convergence and final performance. This problem is more obvious if there are a large number of nodes involved in distributed training, especially given that low-precision numerical computations may encounter overflow and underflow issues, as well as inconsistencies across different hardware devices. Therefore, the design of distributed systems needs to consider these numerical computation issues to ensure satisfactory results and convergence. 2.2.4 Scaling Laws The success of LLMs reveals that training larger language models using more resources can lead to improved model performance. Researchers have explained this as scaling laws of LLMs. More specifically, scaling laws describe the relationships between the performance of LLMs and the attributes of LLM training, such as the model size, the amount of computation used for training, and the amount of training data. For example, Hestness et al. [2017] show that the performance of deep neural networks is a power-law-like function of the training data size. In the beginning, when the amount of training data is not large, the performance of the model improves slowly. Afterward, when more training data is used, the model enters a phase of rapid performance improvement, and the performance curve resembles a power-law curve. Ultimately, the improvement in performance Generative Models Number of Test Errors (Log-scale) 64 Slow Reduction Phase Power-law Reduction Phase Convergence Phase (Irreducible Error) Training Dataset Size (Log-scale) Fig. 2.3: A scaling law of test error against a variable of interest (e.g., training dataset size) [Hestness et al., 2017]. The curve of the scaling law can be divided into three phases. At the beginning, the number of test errors decreases slowly when more training data is used, but this only lasts for a short period. In the second phase, the number of test errors decreases drastically, and the curve becomes a power law curve. After that, the error reduction slows down again in the third phase. Note that there are irreducible errors that cannot be eliminated, regardless of the amount of training data. becomes slow again, and more data does not lead to significant gains. Figure 2.3 shows an example of such curves. In NLP, a traditional view holds that the performance gains will disappear at a certain point as the training is scaled up. However, recent results show that, if we consider the problem on a larger scale, scaling up training is still a very effective method for obtaining stronger LLMs. For example, both closed-source and open-source LLMs can benefit from more data, even though trillions of tokens have already been used for training. With the increase in the scale of model training, LLMs exhibit new capabilities, known as the emergent abilities of LLMs. For example, Wei et al. [2022b] studied the scaling properties of LLMs across different model sizes and amounts of computational resources. Their work shows that some abilities emerge when we scale the model size to certain level. The appearance of emergent abilities has demonstrated the role of scaled training in enhancing the performance of LLMs, and it has also, to some extent, motivated researchers to continuously attempt to train larger models. As larger and stronger LMs continue to appear, our understanding of the scaling laws continues to mature. This helps researchers predict the performance of LLMs during training and estimate the minimal computational resources required to achieve a given level of performance. To understand how model performance scales with various factors considered during training, it is common to express the model performance as a function of these factors. For example, in the simplest case, we can express the loss or error of an LLM as a function of a single variable of interest. However, there are no universal scaling laws that can describe this relationship. Instead, different functions are proposed to fit the learning curves of LLMs. Let x be the variable of interest (such as the number of model parameters) and L(x) be the loss of the model given x (such as the cross-entropy loss on test data). The simplest form of L(x) is a power law L(x) = axb (2.36) 2.2 Training at Scale 65 4.2 N −0.076 L(N ) = ( 8.8·10 13 ) 5.6 3.9 Test Loss 4.8 Test Loss D −0.095 L(D) = ( 5.4·10 13 ) 4.0 3.2 3.6 3.3 3 2.4 2.7 5 10 7 9 10 108 10 Number of Parameters 109 Dataset Size Fig. 2.4: Test loss against model size (N ) and training dataset size (D) (data points are plotted for illustrative purposes). −0.076 N We plot test loss as a function of N , which is defined as L(N ) = 8.8×10 , and a function of D, which is 13 defined as L(D) = D 5.4×1013 −0.095 [Kaplan et al., 2020]. where a and b are parameters that are estimated empirically. Despite its simplicity, this function has successfully interpreted the scaling ability of language models and machine translation systems in terms of model size (denoted by N ) and training dataset size (denoted by D) [Gordon et al., 2021; Hestness et al., 2017]. For example, Kaplan et al. [2020] found that the performance of their language model improves as a power law of either N or D after an initial −0.076 N and L(D) = transient period, and expressed these relationships using L(N ) = 8.8×10 13 −0.095 D (see Figure 2.4). 5.4×1013 An improvement to this scaling law is to add an irreducible error term to the power law. The form of L(x) is then given by L(x) = axb + ǫ∞ (2.37) where ǫ∞ is the irreducible error that accounts for the error due to unknown variables, which is present even as x → ∞. Eq. (2.37) is one of the most widely used forms for designing scaling laws of LLMs. For example, Rosenfeld et al. [2020] developed a scaling law that involves both model scaling and dataset scaling, like this L(N, D) = aN b + cDd + ǫ∞ (2.38) An example of such formulation is the Chinchilla scaling law. It states that the test loss per token is the sum of the inverse proportion functions of N and D, with an additional irreducible error term. Hoffmann et al. [2022] express this scaling law as L(N, D) = 406.4 0.34 |N{z } model scaling + 410.7 0.28 |D{z } dataset scaling + 1.69 |{z} (2.39) irreducible error All the scaling laws mentioned above are based on monotonic functions. So they cannot cover functions with inflection points, such as double descent curves. In response, researchers have explored more sophisticated functions to fit the learning curves. Examples of such functions can Generative Models 66 be found in Alabdulmohsin et al. [2022] and Caballero et al. [2023]’s work. The significance of scaling laws lies in providing directional guidance for LLM research: if we are still in the region of the power law curve, using more resources to train larger models is a very promising direction. While this result “forces” big research groups and companies to invest more in computational resources to train larger models, which is very expensive, scaling laws continuously push the boundaries of AI further away. On the other hand, understanding scaling laws helps researchers make decisions in training LLMs. For example, given the computational resources at hand, the performance of LLMs may be predicted. One last note on scaling laws in this section. For LLMs, a lower test loss does not always imply better performance on all downstream tasks. To adapt LLMs, there are several steps such as fine-tuning and prompting that may influence the final result. Therefore, the scaling laws for different downstream tasks might be different in practice. 2.3 Long Sequence Modeling We have already seen that, in large-scale training, larger language models can be developed by using more data and computational resources. However, scaling up can also occur in other directions. For instance, in many applications, LLMs are adapted to process significantly long sequences. An interesting example is that we pre-train an LLM on extensive texts of normal length and then apply it to deal with very long token sequences, far beyond the length encountered in pre-training. Here we use Pr(y|x) to denote the text generation probability where x is the context and y is the generated text. There are broadly three types of long sequence modeling problems. • Text generation based on long context (i.e., x is a long sequence). For example, we generate a short summary for a very long text. • Long text generation (i.e., y is a long sequence). For example, we generate a long story based on a few keywords. • Long text generation based on long context (i.e., both x and y are long sequences). For example, we translate a long document from Chinese to English. Recently, NLP researchers have been more interested in applying and evaluating LLMs on tasks where extremely long input texts are involved. Imagine an LLM, which reads a C++ source file containing tens of thousands of lines, and outlines the functionality of the program corresponding to the source file. Such models, capable of handling extensive textual contexts, are sometimes called long-context LLMs. In this section we will restrict ourselves to long-context LLMs, but the methods discussed here can be applicable to other problems. For Transformers, dealing with long sequences is computationally expensive, as the computational cost of self-attention grows quadratically with the sequence length. This makes it infeasible to train and deploy such models for very long inputs. Two strands of research have tried to adapt Transformers to long-context language modeling. • The first explores efficient training methods and model architectures to learn self-attention models from long-sequence data. 2.3 Long Sequence Modeling 67 • The other adapts pre-trained LLMs to handle long sequences with modest or no fine-tuning efforts. Here, we will discuss the former briefly since it can be found in general discussions of efficient Transformer architectures [Tay et al., 2020; Xiao and Zhu, 2023]. We will focus on the latter, highlighting popular methods in recent LLMs. We will also discuss the strengths and limitations of these long-sequence models. 2.3.1 Optimization from HPC Perspectives We begin our discussion by considering improvements to standard Transformer models from the perspectives of high-performance computing. Most of these improvements, though not specifically designed for LLMs, have been widely applied across various deep learning models [Kim et al., 2023]. A commonly used approach is to adopt a low-precision implementation of Transformers. For example, we can use 8-bit or 16-bit fixed-point data types for arithmetic operations, instead of 32-bit or 64-bit floating-point data types. Using these low-precision data types can increase the efficiency and memory throughput, so that longer sequences can be processed more easily. An alternative approach is to improve Transformers by using hardware-aware techniques. For example, on modern GPUs, the efficiency of Transformers can be improved by using IO-aware implementations of the self-attention function [Dao et al., 2022; Kwon et al., 2023]. Another way to handle long sequences is through sequence parallelism [Li et al., 2023b; Korthikanti et al., 2023]. Specifically, consider the general problem of attending the query qi at the position i to the keys K and values V. We can divide K by rows and obtain a set of submatrices {K[1] , ..., K[nu ] }, each corresponding to a segment of the sequence. Similarly, we can obtain the sub-matrices of V, denoted by {V[1] , ..., V[nu ] }. Then, we assign each pair of K[u] and V[u] to a computing node (e.g., a GPU of a GPU cluster). The assigned nodes can run in parallel, thereby parallelizing the attention operation. Recall that the output of the self-attention model can be written as Attqkv (qi , K, V) = m−1 X αi,j vj (2.40) j=0 where αi,j is the attention weight between positions i and j. In Transformers, αi,j is obtained by normalizing the rescaled version of the dot product between qi and kj . Let βi,j denote the attention score between qi and kj . We have βi,j = qi · kj √ + Mask(i, j) d (2.41) where Mask(i, j) is the masking variable for (i, j). Then, we define the attention weight αi,j to be αi,j = Softmax(βi,j ) exp(βi,j ) = P j ′ exp(βi,j ′ ) (2.42) Generative Models 68 On each computing node, we need to implement these equations. Given the keys and values assigned to this node, computing the numerator of the right-hand side of Eq. (2.42) (i.e., exp(βi,j )) is straightforward, as all the required information is stored on the node. However, computing the denominator of the right-hand side of Eq. (2.42) involves a sum of exp(βi,j ′ ) over all j ′ s, which requires transferring data to and from other nodes. To illustrate, suppose that vj and kj are placed on node u. We can rewrite Eq. (2.42) as αi,j = }| { exp(βi,j ) X kj ′ ∈K[1] | node u z exp(βi,j ′ ) + · · · + {z node 1 kj ′ ∈K[u] | } X exp(βi,j ′ ) + · · · + {z node u exp(βi,j ′ ) kj ′ ∈K[nu ] | } X {z (2.43) } node nu where the notation kj ′ ∈ K[u] represents that kj ′ is a row vector of K[u] . In a straightforward P implementation, we first perform the summations { kj′ ∈K[u] exp(βi,j ′ )} separately on the corresponding nodes. Then, we collect these summation results from different nodes to combine them into a final result. This corresponds to a collective operation in the context of parallel processing. There are many efficient implementations of such operations, such as the all-reduce algorithms. Hence the sum of all exp(βi,j ) values can be computed using optimized routines in collective communication toolkits. Given the attention weights {αi,j }, we then compute the attention results using Eq. (2.40). The problem can be re-expressed as Attqkv (qi , K, V) = X vj ′ | ∈V[1] αi,j ′ vj ′ + · · · + {z node 1 } X vj ′ | ∈V[u] αi,j ′ vj ′ + · · · + {z node u } X αi,j ′ vj ′ vj ′ ∈V[nu ] | {z node nu (2.44) } Like Eq. (2.43), Eq. (2.44) can be implemented as a summation program in parallel processing. First, perform the weighted summations of values on different nodes simultaneously. Then, we collect the results from these nodes via collective operations. Note that, although this section primarily focuses on long sequence modeling, much of the motivation for sequence parallelism comes from the distributed training methods of deep networks, as discussed in Section 2.2.3. As a result, the implementation of these methods can be based on the same parallel processing library. 2.3.2 Efficient Architectures One difficulty of applying Transformers to long sequences is that self-attention has a quadratic time complexity with respect to the sequence length. Moreover, a key-value cache (or KV cache for short) is maintained during inference, and its size increases as more tokens are processed. Although the KV cache grows linearly with the sequence length, for extremely long input sequences, the memory footprint becomes significant and it is even infeasible to deploy LLMs for such tasks. As a result, the model architecture of long-context LLMs generally moves away from the standard 2.3 Long Sequence Modeling 69 Transformer, turning instead to the development of more efficient variants and alternatives. One approach is to use sparse attention instead of standard self-attention. This family of models is based on the idea that only a small number of tokens are considered important when attending to a given token, and so most of the attention weights between tokens are close to zero. As a consequence, we can prune most of the attention weights and represent the attention model in a compressed form. To illustrate, consider the self-attention model (2.45) Attqkv (Q, K, V) = α(Q, K)V where the attention weight matrix α(Q, K) ∈ Rm×m is obtained by QKT α(Q, K) = Softmax( √ + Mask) d  α0,0 0 0  α α 0 1,1  1,0  α2,0 α2,1 α2,2 =   .. ..  ..  . . . αm−1,0 αm−1,1 αm−1,2 h ... ... ... .. . 0 0 0 .. . ... αm−1,m−1         (2.46) i Each row vector αi,0 ... αi,i 0 ... 0 corresponds to a distribution of attending the i-th token to every token of the sequence. Since language models predict next tokens only based on their left-context, we normally write the output of the attention model at position i as Attqkv (qi , K≤i , V≤i ) = = v i  0  ... αi,i  ...   h αi,0 i X   vi (2.47) αi,j vj j=0     v0 k0  .   .     where K≤i =  ..  and V≤i =  ..   are the keys and values up to position i. ki vi h i In the original version of self-attention αi,0 ... αi,i is assumed to be dense, that is, most of h i the values are non-zero. In sparse attention, some of the entries of αi,0 ... αi,i are considered non-zero, and the remaining entries are simply ignored in computation. Suppose G ⊆ {0, ..., i} is the set of indices of the non-zero entries. For language models, the output of the sparse attention model at position i is given by Attsparse (qi , K≤i , V≤i ) = X α′i,j vj (2.48) j∈G Here {α′i,j } are normalized over G. Hence their values are different from the original attention weights (in fact we have α′i,j > αi,j ). The sparsity of the model is determined by how large G is. Sparse attention models differ in the way we define G. One simple approach is to define G based Generative Models 70 on heuristically designed patterns. For example, a widely-used pattern involves having G cover a window of tokens located near position i [Parmar et al., 2018]. While sparse attention reduces the computation through the use of sparse operations, such models still have significant limitations as we must keep the entire KV cache (i.e., K≤i and V≤i ) during inference. If the sequence is very long, storing this cache will become highly memoryintensive. To address this, we can consider a different form of attention models where the KV cache is not explicitly retained. Linear attention is one such approach [Katharopoulos et al., 2020]. It uses a kernel function φ(·) to project each query and key onto points qi′ = φ(qi ) and ki′ = φ(ki ), respectively. By removing the Softmax function under such transformations10 , the form of the resulting attention model is given by Attqkv (qi , K≤i , V≤i ) ≈ Attlinear (qi′ , K′≤i , V≤i ) = qi′ µi qi′ νi (2.49) where µi and νi are variables that are computed in the recurrent forms T (2.50) T νi−1 + k′ i (2.51) µi = µi−1 + k′ i vi νi = µi and νi can be seen as representations of the history up to position i. A benefit of this model is that we need not keep all past queries and values. Instead only the latest representations µi and νi are used. So the computational cost of each step is a constant, and the model can be easily extended to deal with long sequences. In fact, this sequential approach to long sequence modeling arises naturally when we adopt a viewpoint of recurrent models. Such models read one token (or a small number of tokens) at a time, update the recurrent state using these inputs, and then discard them before the next token arrives. The output at each step is generated based only on the recurrent state, rather than on all the previous states. The memory footprint is determined by the recurrent state which has a fixed size. Recurrent models can be used in real-time learning scenarios where data arrives in a stream and predictions can be made at any time step. In NLP, applying recurrent models to language modeling is one of the earliest successful attempts to learn representations of sequences. Although Transformer has been used as the foundational architecture in LLMs, recurrent models are still powerful models, especially for developing efficient LLMs. More recently, recurrent models have started their resurgence in language modeling and have been reconsidered as a promising alternative to Transformers [Gu and Dao, 2023]. Figure 2.5 shows a comparison of the models discussed in this subsection. 10 In the new space after this transformation, the Softmax normalization can be transformed into the simple scaling normalization. 2.3 Long Sequence Modeling 71 Attqkv (qi , K≤i , V≤i ) k0 k1 ··· ki−2 ki−1 ki v0 v1 ··· vi−2 vi−1 vi qi (a) Standard Self-attention Attqkv (qi , {k1 , ki }, {v1 , vi }) k0 k1 ··· ki−2 ki−1 ki v0 v1 ··· vi−2 vi−1 vi qi (b) Sparse Attention T ⇒ µi νi = νi−1 + k′ i ⇒ νi µi = µi−1 + k′ i vi T q′ µ Attlinear (qi , K≤i , V≤i ) = qi′ νii i k0 k1 ··· ki−2 ki−1 ki v0 v1 ··· vi−2 vi−1 vi qi (c) Linear Attention hi = f (hi−1 , inputi ) h0 h1 ··· hi−3 hi−2 (d) Recurrent Models hi−1 hi inputi Fig. 2.5: Illustrations of self-attention, sparse attention, linear attention and recurrent models. Blue boxes = cached states for producing the output at position i. f (·) = a recurrent cell. 2.3.3 Cache and Memory LLMs based on the standard Transformer architecture are global models. The inference for these models involves storing the entire left-context in order to make predictions for future tokens. This requires a KV cache where the representations (i.e., keys and values) of all previously-generated Generative Models 72 tokens are kept, and the cost of caching grows as the inference proceeds. Above, we have discussed methods for optimizing this cache via efficient attention approaches, such as sparse attention and linear attention. Another idea, which may have overlap with the previous discussion, is to explicitly encode the context via an additional memory model. 2.3.3.1 Fixed-size KV Cache A straightforward approach is to represent the keys and values using a fixed-size memory model. Suppose we have a memory Mem which retains the contextual information. We can write the attention operation at position i in a general form (2.52) Att(qi , Mem) = Attqkv (qi , K≤i , V≤i ) In this model, Mem is simply the KV cache, i.e., Mem = (K≤i , V≤i ). Thus the size of Mem is determined by i. If we define Mem as a fixed-size variable, then the cost of performing Att(qi , Mem) will be fixed. There are several alternative ways to design Mem. • One of the simplest methods is to consider a fixed-size window of previous keys and values. Mem is therefore given by (2.53) Mem = (K[i−nc +1,i] , V[i−nc +1,i] ) where nc denotes the size of the window. The notation K[i−nc +1,i] and V[i−nc +1,i] denote the keys and values over positions from i − nc + 1 to i.11 This model can be seen as a type of local attention model. • It is also possible to define Mem as a pair of summary vectors, which leads to a more compressed representation of the history. A simple way to summarize the previous keys and values is to use the moving average of them. For example, Mem can be defined as the unweighted moving average of the previous nc keys and values  Pi j=i−nc +1 kj Mem = nc , Pi j=i−nc +1 vj nc  (2.54) Alternatively, we can use a weighted version of moving average Mem =  Pi j=i−nc +1 βj−i+nc kj P nc , j=1 βj Pi j=i−nc +1 βj−i+nc vj Pnc j=1 βj  (2.55) Here {β1 , ..., βnc } are the coefficients, which can be either learned as model parameters or determined via heuristics. For example, they can be set to increasing coefficients (i.e., β1 < β2 < ... < βnc −1 < βnc ) in order to give larger weight to positions that are closer to     vi−nc +1 ki−nc +1     . . 11 .. More formally, we write K[i−nc +1,i] =   and V[i−nc +1,i] =  .. . Sometimes we denote vi ki K[i−nc +1,i] by {ki−nc +1 , ..., ki } and V[i−nc +1,i] by {vi−nc +1 , ..., vi } for notation simplicity. 2.3 Long Sequence Modeling 73 i. We can extend the moving average to include all the positions up to i. This leads to the cumulative average of the keys and values, given in the form Mem =  Pi j=0 kj i+1 , Pi j=0 vj i+1  (2.56) In general, the cumulative average can be written using a recursive formula Memi = (ki , vi ) + i · Memi−1 i+1 (2.57) where Memi and Memi−1 denote the cumulative averages of the current and previous positions, respectively. An advantage of this model is that we only need to store a single key-value pair during inference, rather than storing all the key-value pairs. Note that the above memory models are related to recurrent models, and more advanced techniques have been used to develop alternatives to self-attention mechanisms in Transformers [Ma et al., 2023]. • The memory Mem can also be a neural network. At each step, it takes both the previous output of the memory and the current states of the model as input, and produces the new output of the memory. This neural network can be formulated as the function Mem = Update(Skv , Mempre ) (2.58) Here Mem and Mempre represent the outputs of the memory at the current step and the previous step, respectively. Skv is a set of key-value pairs, representing the recent states of the model. This formulation is general and allows us to develop various memory models by selecting different Update(·) and Skv configurations. For example, if Skv only contains the latest key-value pair (ki , vi ) and Update(·) is defined as a recurrent cell, then Eq. (2.58) can be expressed as an RNN-like model Mem = f ((ki , vi ), Mempre ) (2.59) where f (·) is a recurrent cell. Recurrence can also be applied to segment-level modeling for efficiency consideration. A simple approach is that we can divide the sequence into segments, and treat Skv as a segment. Applying recurrent models to Update(·) will result in memory models that operate on segments. A special example is that we define Update(·) as an FIFO function that adds Skv into the memory and removes the oldest key-value segment from the memory, given by Mem = FIFO(Skv , Mempre ) (2.60) Consider a memory which includes two segments, one for current segment, and one for the previous segment. In the attention operation, each position can access the history key-value pairs in two closest consecutive segments. This essentially defines a local memory, but it and its variants have been widely used segment-level recurrent models [Dai et al., 2019; Hutchins et al., 2022; Bulatov et al., 2022]. • The above memory models can be extended to involve multiple memories. An example Generative Models 74 of this approach is compressive Transformer [Rae et al., 2019]. It employs two distinct fixed-size memories: one for modeling local context (denoted by Mem), and the other for modeling and compressing long-term history (denoted by CMem). The KV cache in this model is the combination of Mem and CMem. The attention function can be written as Attcom (qi , Mem, CMem) = Attqkv (qi , [Mem, CMem]) (2.61) where [Mem, CMem] is a combined memory of Mem and CMem. As with other segmentlevel models, the compressive Transformer model operates on segments of the sequence. k as the key-value Each segment is a sequence of ns consecutive tokens, and we denote Skv pairs corresponding to the tokens of the k-th segment. When a new segment arrives, Mem k to Mem, and then is updated in an FIFO fashion: we append the nc key-value pairs in Skv pop the ns oldest key-value pairs from Mem, which is given by k Mem = FIFO(Skv , Mempre ) (2.62) The popped key-value pairs are then used to update the compressive memory CMem. These ns key-value pairs are compressed into ncs key-value pairs via a compression network. CMem is an FIFO which appends the compressed ncs key-value pairs to the tail of the queue, and drops the first ncs key-value pairs of the queue. It is given by k CMem = FIFO(Ckv , CMempre ) (2.63) k represents the set of compressed key-value pairs. Implicit in the compressive where Ckv Transformer model is that local context should be represented explicitly with minimal information loss, while long-range context can be more compressed. • We have already seen that both global and local contexts are useful and can be modeled using attention models. This view motivates the extension to attention models for combining both local and long-term memories [Ainslie et al., 2020; Zaheer et al., 2020; Gupta and Berant, 2020]. A simple but widely-used approach is to involve the first few tokens of the sequence in attention, serving as global tokens. This approach is usually applied along with other sparse attention models. An advantage of incorporating global tokens of the sequence is that it helps smooth the output distribution of the Softmax function used in attention weight computation, and thus stabilizes model performance when the context size is very large [Xiao et al., 2024]. One drawback, however, is that using a fixed-size global memory may result in information loss. When dealing with long sequences, we need to enlarge the KV cache for sufficient representations of the context, but this in turn increases the computational cost. Figure 2.6 shows illustrations of the above approaches. Note that, while we focus on optimization of the KV cache here, this issue is closely related to those discussed in the previous section. All of the methods we have mentioned so far can broadly be categorized as efficient attention approaches, which are widely used in various Transformer variants. 2.3 Long Sequence Modeling 75 Memory Size = 4 × 2 ··· Keys ··· Values i−7 i−6 i−5 i−4 i−3 i−2 i−1 i (a) Window-based Cache ki−3 +ki−2 +ki−1 +ki 4 ⇒ vi−3 +vi−2 +vi−1 +vi 4 ⇒ Memory Size = 1 × 2 ··· Keys ··· Values i−7 i−6 i−5 i−4 i−3 i−2 i−1 i (b) Moving Average-based Cache Mem = Update( Skv , Mempre ) Memory Size = 1 × 2 ⇒ ··· Keys ··· Values i−7 i−6 i−5 i−4 i−3 i−2 i−1 i (c) Recurrent Network as Cache Compressed Memory Size = 2 × 2 Memory Size = 4 × 2 ··· Keys ··· Values i−7 i−6 i−5 i−4 i−3 i−2 i−1 i (d) Hybrid Cache (Compressed Memory + Local Memory) Fig. 2.6: Illustrations of fixed-size KV caches in LLMs. Blue boxes represent the keys and values generated during LLM inference, green boxes represent the keys and values stored or encoded in the primary memory, and orange boxes represent the keys and values stored or encoded in the compressed memory. Generative Models 76 2.3.3.2 Memory-based Models The modeling of memories discussed above was based on updates to the KV cache, and the resulting models are typically referred to as internal memories. We now consider another family of models, called external memories, which operate as independent models to access large-scale contexts for LLMs. Many such models are based on memory-based methods which have been extensively discussed in machine learning [Bishop, 2006]. A common example is nearest neighbor algorithms: we store context representations in a datastore, and try to find the most similar stored representations to match a given query. The retrieved context representations are then used to improve attention for this query. Here, we consider the k-nearest neighbors (k-NN) method which is one of the most popular memory-based methods. Since our focus is language modeling in this section, we define a sample in the datastore as a key-value pair corresponding to some context state. Note that “context” is a broad concept here, not just a sequence prefix in text generation. One might, for example, view the entire dataset as the context for predicting tokens. This allows us to retrieve the closest context situation in a set of sequences, rather than a given sequence prefix. Although we will restrict ourselves to context modeling for a single sequence, in this subsection, we discuss a relatively more general case. Suppose we have a set of keys {kj } with corresponding values {vj }, and suppose we store these key-value pairs in a vector database12 . For each query qi , we find its k nearest neighbours by growing the radius of the sphere centered as qi until it contains k data points in {kj }. This results in a set of k keys along with their corresponding values, denoted by Memknn . As before, we denote Mem as the local memory for the query, such as the KV cache of neighboring tokens. Our goal is to attend query qi to both the local memory Mem and the long-term memory Memknn . There are, of course, several ways to incorporate Mem and Memknn into the attention model. For example, we might simply combine them to form a single KV cache [Mem, Memknn ], and attend qi to [Mem, Memknn ] via standard QKV attention. Or we might use Mem and Memknn in separate attention steps. An example of such approaches is the model developed by Wu et al. [2021]. It linearly combines the two types of attention, given by Att(qi , Mem, Memknn ) = g ⊙ Attlocal + (1 − g) ⊙ Attknn (2.64) Attlocal = Att(qi , Mem) (2.65) Attknn = Att(qi , Memknn ) (2.66) Here g ∈ Rd is the coefficient vector, which can be the output of a learned gate. Given the k-NN-based memory model described above, the remaining task is to determine which key-value pairs are retained in the datastore. For standard language modeling tasks, we consider the previously seen tokens in a sequence as the context, so we can add the keys and values of all these tokens into the datastore. In this case, the resulting k-NN-based attention model is essentially equivalent to a sparse attention model [Gupta et al., 2021]. Alternatively, we can extend the context from one sequence to a collection of sequences. For example, we might collect all key-value pairs across the sequences in a training dataset and add them to the datastore to model a larger context. Thus, LLMs can predict tokens based on a 12 A vector database, or vector store, is a database that provides highly optimized retrieval interfaces for finding stored vectors that closely match a query vector. 2.3 Long Sequence Modeling 77 generalized context. A problem with this approach is that the computational cost would be large if many sequences are involved. Since these sequences are part of our training data, we can build and optimize an index for the vectors in the datastore before running the LLMs. As a result, the retrieval of similar vectors can be very efficient, as in most vector databases. In fact, all the above-mentioned methods can be viewed as instances of a retrieval-based approach. Instead of using retrieval results to improve attention, we can apply this approach in other ways as well. One application of k-NN-based search is k-NN language modeling (or k-NN LM) [Khandelwal et al., 2020]. The idea is that, although it is attempting to extend the context used in self-attention by incorporating nearest neighbors in representation learning, in practice, similar hidden states in Transformers are often highly predictive of similar tokens in subsequent positions. In k-NN LM, each item in the datastore is a key-value tuple (z, w), where z represents a hidden state of the LLM at a position, and w represents the corresponding prediction. A typical way to create the datastore is to collect the output vector of the Transformer layer stack and the corresponding next token for each position of each sequence in a training dataset. During inference, we have a representation hi given a prefix. Given this representation, we first search the datastore for k closest matching data items {(z1 , w1 ), ..., (zk , wk )}. Here {w1 , ..., wk } are thought of as reference tokens for prediction, and thus can be used to guide the token prediction based on hi . One common way to make use of reference tokens is to define a distribution over the vocabulary V, h i Prknn (·|hi ) = Softmax( −d0 · · · −d|V | ) (2.67) where dv equals the distance between hi and zj if wj equals the v-th entry of V , and equals 0 otherwise. We use a linear function with a coefficient λ that interpolates between the retrievalbased distribution Prknn (·|hi ) and the LLM output distribution Prlm (·|hi ) Pr(·|hi ) = λ · Prknn (·|hi ) + (1 − λ) · Prlm (·|hi ) (2.68) Then, as usual, we can choose the next token y by maximizing the probability Pr(y|hi ). As with information retrieval (IR) systems, the datastore can also manage texts and provide access to relevant texts for a query. For example, we can store a collection of text documents in a search engine with full-text indexing, and then search it for documents that match a given text-based query. Applying IR techniques to LLMs leads to a general framework called retrievalaugmented generation (RAG). The RAG framework works as follows. We use the context x as the query and find the k most relevant document pieces {c1 , ..., ck } from the datastore via efficient IR techniques13 . These search results are combined with the original context via a prompting 13 In piratical applications, queries are typically generated using a query generation system, which may expand it with variations of tokens and query intent. Generative Models 78 template g(·)14 , resulting in an augmented input for the LLM x′ = g(c1 , ..., ck , x) (2.69) Then, we use x′ as the context and predict the following text using the model Pr(y|x′ ). One advantage of RAG is that we need not modify the architecture of LLMs, but instead augment the input to LLMs via an additional IR system. Figure 2.7 shows a comparison of the use of different external memories in LLMs. 2.3.3.3 Memory Capacity A memory model in LLMs, in the form of a simple key-value cache or a datastore, can broadly be seen as an encoder of contextual information. Ideally, before we say that a memory model is representative of the entire context in token prediction, we need to make sure that the model can accurately represent any part of the context. The standard KV cache is one such model that completely stores all past history. In this case, the model is said to have adequate capacity for memorizing the context. In many practical applications, however, complete memorization is not required. Instead, the goal is to enable LLMs to access important contextual information. As a result, efficient and compressed memory models are developed, as described in this section. Note that, the longer the sequence, the more difficult it becomes for a low-capacity memory model to capture important contextual information. It is therefore common practice to simply increase the model capacity when processing long contexts. While high-capacity models are generally favorable, they are difficult to train and deploy. A challenging scenario is that the tokens arrive in a stream and the context continuously grows. Developing LLMs for such tasks is difficult as we need to train Transformers on extremely long sequences. A possible way to address this difficulty is to use non-parametric methods, such as retrieval-based methods. For example, as discussed above, we can use a vector database to store previously generated key-value pairs, and thus represent the context by this external memory model. Although this approach side-steps the challenge of representing long context in Transformers, building and updating external memory models are computationally expensive. These models are more often used in problems where the context is given in advance and fixed during inference, and hence unsuitable for streaming context modeling. In cases where the size of the context continuously grows, applying fixed-size memory models is a commonly used approach. For example, in recurrent models, a sequence of arbitrary length can be summarized into a set of hidden states by which we have a fixed computational cost per step. While recurrent models were initially found to be not very good at handling long-distance dependencies in sequence modeling in early applications of deep learning to NLP, recent advancements have shown that their variants are now effective in modeling extremely long sequences. [Bulatov et al., 2022; Hutchins et al., 2022; Munkhdalai et al., 2024; Ma et al., 2024]. 14 For example, the template could be: message = {*c1 *} ... {*ck *} input: {*x*} output: 2.3 Long Sequence Modeling 79 g ⊙ Att(qi , Mem) + (1 − g) ⊙ Att(qi , Memknn ) Att(qi , Memknn ) Att(qi , Mem) ··· ··· qi KV Cache k Nearest Neighbors Keys/values in LLM Datastore Search Keys/values in Datastore (a) k-NN Search Augmented Attention Output Distribution Distribution Pr(·) Distribution Prknn (·) ··· Att(qi , Mem) Att(qi , Mem) ··· ··· qi KV Cache k Nearest Neighbors Keys/values in LLM Datastore Keys in Datastore Search Predicted Tokens (b) k-NN Language Modeling LLM c1 = Deep network is ... Message: deep network ... machine learning ... c2 = Machine learning is ... What is deep learning? ··· k Nearest Neighbors Datastore Search Input Context: x = What is deep learning? (c) Retrieval-augmented Generation Fig. 2.7: Illustrations of external memories (or datastores) for language modeling. Generative Models 80 There is no general definition of memory capacity in LLMs. A simple approach might consider how much storage is used to retain contextual information. For example, memory capacity could be defined by the size of the KV cache in Transformers or the vector database used in retrievalbased methods. A related concept is model complexity. In machine learning, there are several ways to define the model complexity of a model. One of the simplest methods is by counting the number of parameters. However, it should be emphasized that the memory models discussed here primarily serve to store information, rather than add trainable parameters. Therefore, a model with a large memory capacity is not necessarily more complex. Nevertheless, in practice determining the capacity of a memory model is not straightforward. In general, we need to control the trade-off between maximizing the performance and controlling the memory footprint. 2.3.4 Sharing across Heads and Layers In Transformers, the KV cache is a data structure that can be dynamically adjusted along multiple dimensions, such as heads, layers, and sequence length. For example, consider an LLM with L layers. Each layer has τ attention heads, and each head produces a dh -dimensional output. During inference, we store the keys and values for up to m tokens. The space complexity of this caching mechanism is O(L · τ · dh · m). As we have seen previously, this complexity can be reduced by caching the keys and values for fewer tokens. For example, in sliding window attention, a fixedsize window is used to cache the keys and values in local context. And this model has a space complexity of O(L · τ · dh · mw ), with mw being the size of the window. In addition to reducing m, we can also decrease the size of the KV cache along other dimensions. A widely-used approach is to enable sharing across heads in multi-head self-attention. Recall from Section 2.1.1 that multi-head self-attention uses multiple sets of queries, keys, and values (each set is called a head), each performing the QKV attention mechanism as usual. This can be expressed as Output = Merge(head1 , ..., headτ )Whead (2.70) where headj ∈ Rdh is computed using the standard QKV attention function [j] [j] [j] headj = Attqkv (qi , K≤i , V≤i ) [j] [j] (2.71) [j] Here, qi , K≤i , and V≤i are the query, keys, and values that are projected onto the j-th feature sub-space. So this model can be interpreted as performing attention on a group of feature subspaces in parallel (see Figure 2.8 (b)). The KV cache needs to retain the keys and values for all [1] [1] [τ ] [τ ] these heads, that is, {(K≤i , V≤i ), ..., (K≤i , V≤i )}. One refinement to the multi-head attention model, called multi-query attention (MQA), is to share keys and values across heads, while allowing queries to be unique for each head [Shazeer, 2019]. In MQA, there is a single set of keys and values (K≤i , V≤i ). In addition, there are τ [1] [τ ] queries {qi , ..., qi }, each corresponding to a different head. For each head, we have [j] headj = Attqkv (qi , K≤i , V≤i ) (2.72) Figure 2.8 (c) illustrates this model. By sharing keys and values, the size of the KV cache would 2.3 Long Sequence Modeling value key 81 query (b) Multi-head Attention (a) Single-head Attention value key query query key value (c) Multi-query Attention value query key value (d) Grouped Query Attention query key Layer l Sharing Layer l − 1 (e) Cross-layer Multi-head Attention Fig. 2.8: Illustration of QKV attention based on different multi-head and sharing mechanisms. (a) = single-head attention, and (b-e) = attention with multiple heads. be O(L · dh · m). Grouped query attention (GQA) is a natural extension to multi-head attention and MQA [Ainslie et al., 2023]. In GQA, heads are divided into ng groups, each corresponding to a shared [n ] [n ] [1] [1] set of keys and values. Hence we have ng sets of keys and values {(K≤i , V≤i ), ..., (K≤ig , V≤ig )}. See Figure 2.8 (d) for an illustration. Let g(j) be the group id for the j-th head. The GQA model can be expressed as [j] [g(j)] [g(j)] headj = Attqkv (qi , K≤i , V≤i ) (2.73) The size of the KV cache of GQA is O(L·ng ·dh ·m). One benefit of GQA is that we can trade-off between computational efficiency and model expressiveness by adjusting ng . When ng = τ , the model becomes the standard multi-head attention model. By contrast, when ng = 1, it becomes Generative Models 82 the GQA model. Sharing can also be performed across layers. Such a method falls into the family of shared weight and shared activation methods, which have been extensively used in Transformers [Dehghani et al., 2018; Lan et al., 2020]. For example, one can share KV activations or attention weights across layers to reduce both computation and memory footprints [Xiao et al., 2019; Brandon et al., 2024]. Figure 2.8 (e) shows an illustration of this method, where a query in a layer directly accesses the KV cache of a lower-level layer. 2.3.5 Position Extrapolation and Interpolation Since Transformer layers are order-insensitive to input, we need some way to encode positional information in the input tokens. To do this, it is common to add positional embeddings to token embeddings, and then feed these combined embeddings into the Transformer layer stack as input. In this case, the embedding at position i can be expressed as ei = xi + PE(i) (2.74) where xi ∈ Rd denotes the token embedding, and PE(i) ∈ Rd denotes the positional embedding. In general, the token embedding xi is a position-independent vector, and so the positional embedding PE(i) is used to encode the positional context. A straightforward approach is to treat PE(i) as a learnable variable and train it alongside other model parameters. In this way, we can learn a unique representation for each position, and thus distinguish the tokens appearing at different positions of a sequence. Representations of positions using learned vectors can work well in tasks where the sequences at training and test times are of similar lengths. In practice, however, we often impose length restrictions on sequences during training to prevent excessive computational costs, but wish to apply the trained models to much longer sequences during inference. In this case, using learned positional embeddings has obvious drawbacks, as there are no trained embeddings for positions that are not observed in the training phase. An alternative approach to modeling positional information is to develop positional embeddings that can generalize: once trained, the embedding model can be used to handle longer sequences. Suppose that we train a positional embedding model on sequences with a maximum length of ml , and we wish to apply the trained model to a sequence of length m (m >> ml ). If the embedding model is limited in the range of positions that we can observe from training data, then this model will simply fail to deal with new data outside that range. See Figure 2.9 (a) for an illustration where the learned embedding model cannot model data points outside the training domain if it lacks the ability to extrapolate. There are several approaches to making positional embedding models generalize. They can be grouped into two classes. • Extrapolation. The model learned on observed data points (i.e., positions) can be directly employed to assign meaningful values to data points beyond the original range. For example, suppose we have a series of numbers 1, 2, ..., 10, and we want to understand the meaning of a new number, 15. Knowing that these numbers are natural numbers used for ordering, we can easily infer that 15 is a number that follows 10, even though 15 has not 2.3 Long Sequence Modeling 83 Value 1 0 −1 0 1,024 Sequence Length (a) Encoding with No Generalization 2,048 0 1,024 Sequence Length (b) Extrapolation 2,048 0 1,024 Sequence Length (c) Interpolation 2,048 Value 1 0 −1 Value 1 0 −1 Fig. 2.9: Illustrations of different positional embedding methods for a range of positions. Blue points represent the positions that have been observed during training, and red points represent the positions that are newly observed at test time. In sub-figure (a), the encoding model only memorizes the points seen during training, and cannot generalize. In sub-figures (b) and (c), the model can generalize through extrapolation and interpolation. been observed before. Figure 2.9 (b) shows an example of this approach, where a function is learned to fit the data points within a specific range and then applied to estimate the values of data points outside that range. • Interpolation. This approach maps a larger range of data points into the original observation range. For example, suppose we have a model designed for numbers in the range [1, 10]. When given a new range of [1, 20], we can scale this down by dividing every number by 2, thereby fitting all numbers into [1, 10]. This scaling allows us to use the model trained on the range [1, 10] to describe data points in the expanded range of [1, 20]. See Figure 2.9 (c) for an illustration of this approach. In fact, positional embeddings in many systems have achieved some level of generalization. For example, sinusoidal encoding, the most common positional embedding method, employs sine and cosine functions that can naturally extend to sequences of any length. Although this approach might seem direct and simple, it does not perform well when we significantly extend the sequences for processing. In this subsection, we will discuss several alternative methods based on either extrapolation or interpolation. Generative Models 84 2.3.5.1 Attention with Learnable Biases One problem with Eq. (2.74) is that the embedding model treats each token independently and therefore ignores the distance between different tokens. A common improvement to this model, called relative positional embedding, is to consider the pairwise relationship between tokens [Shaw et al., 2018]. The general idea behind this is to obtain the offset between any pair of positions and incorporate it into the self-attention model. One of the simplest forms of self-attention with relative positional embedding is given by Attqkv (qi , K≤i , V≤i ) = i X α(i, j)vj (2.75) j=0 α(i, j) = Softmax( qi kjT + PE(i, j) √ + Mask(i, j)) d (2.76) The only difference between this model and the original self-attention model is that a bias term PE(i, j) is added to the query-key product in this new model. Intuitively, PE(i, j) can be interpreted as a distance penalty for the pair of positions i and j. As i moves away from j, the value of PE(i, j) decreases. PE(i, j) can be defined in several different ways. Here, we consider the T5 version of relative positional embedding, called the T5 bias [Raffel et al., 2020]. For each pair of query qi and key kj , the offset between them is defined to be15 d(i, j) = i − j (2.77) A simple design for the bias PE(i, j) is to share the same learnable variable for all query-key pairs with the same offset, i.e., PE(i, j) = ui−j , where ui−j is the variable corresponding to the offset i − j. However, simply assigning a unique value to each offset will restrict this model to observed offsets. When i − j is larger than the maximum trained offset, the model cannot generalize. The T5 bias instead adopts a generalization of this model. Rather than assigning each querykey offset a unique bias term, it groups difference offsets into “buckets”, each corresponding to one learnable parameter. More specifically, the bias terms for nb + 1 buckets are given as follows. • For buckets 0 to nb2+1 − 1, each bucket corresponds to one offset, that is, bucket 0 ↔ offset 0, bucket 1 ↔ offset 1, bucket 2 ↔ offset 2, and so on. We express this as b(i − j) = i − j. • For buckets nb2+1 to nb , the size of each bucket increases logarithmically. For example, the bucket number for a given offset i − j ≥ nb2+1 can be defined as b(i − j) = log(i − j) − log( nb2+1 ) nb + 1 nb + 1 +⌊ ⌋ · 2 2 log(distmax ) − log( nb2+1 ) (2.78) where the parameter distmax is typically set to a relatively large number to indicate the 15 For language modeling, a query is only allowed to attend to its left-context, and so we have i − j ≥ 0. In the more general case of self-attention, where a token can attend to all tokens in the sequence, we may have negative offsets when i < j. 2.3 Long Sequence Modeling 85 logarithmically increased bucket size fixed bucket size Bucket 0 1 2 3 Offset 0 1 2 3 ··· 14 15 16 17 14 15 16 ∼ 20 21 ∼ 26 18 27 ∼ 33 ··· 32 802 ∼ ∞ (i − j) Fig. 2.10: Illustration of distributing query-key offsets into buckets in the T5 model (nb = 32 and distmax = 1024). Boxes represent buckets. In the first half of the buckets, we use a fixed bucket size. In the second half of the buckets, we increase the bucket size logarithmically. The last bucket contains all the query-key offsets that are not covered by previous buckets. maximum offset we may encounter. • When i − j > distmax , we place i − j in the last bucket. In other words, bucket nb contains all the offsets that are not assigned to the previous buckets. Together, these can be expressed as the function b(i − j) =   i − j n +1  min(nb , b2 + ⌊ n +1 log(i−j)−log( b2 ) n +1 log(distmax )−log( b2 ) · nb2+1 ⌋) 0 ≤ i − j < nb2+1 i − j ≥ nb2+1 (2.79) Figure 2.10 shows an illustration of these buckets. We see that in the first half of the buckets, each bucket is associated with only one value of i − j, while in the second half, the bucket size increases as i − j grows. The last bucket is designed to handle sequences of arbitrarily long lengths. All PE(i, j)s in a bucket share the same bias term ub(i−j) . Substituting PE(i, j) = ub(i−j) into Eq. (2.76), the attention weight for qi and kj becomes16 α(i, j) = Softmax( qi kjT + ub(i−j) √ + Mask(i, j)) d (2.81) The parameters {u0 , ..., unb } are learned as common parameters during training. It should be emphasized that this model can generalize to long sequences. This is because PE(i, j)s with similar query-key offsets share the same parameter, and this sharing strategy is particularly important for achieving good generalization, given that large query-key offsets are rare in training. In practice, we often set nb to a moderate number, and thus it can help control the overfitting of positional embedding models. 16 Note that, in Raffel et al. [2020]’s T5 model, the rescaling operation for the query-key product is removed. The attention weight α(i, j) is then given by α(i, j) = Softmax(qi kjT + ub(i−j) + Mask(i, j)) (2.80) Generative Models 86 2.3.5.2 Attention with Non-learned Biases Relative positional embedding models are based on a set of learned biases for the query-key product in self-attention. An alternative approach is to give these biases fixed values via heuristics, rather than training them on a particular dataset. One benefit of this heuristics-based approach is that it does not rely on a training process and thus can be directly applied to any sequences once the biases are set. One example of such an approach is Press et al. [2022]’s approach, called attention with linear biases or ALiBi for short. In the ALiBi approach, the bias term is defined as the negative scaled query-key offset PE(i, j) = −β · (i − j) = β · (j − i) (2.82) where β is the scaling factor. Adding this term to the query-key product, we obtain a new form of attention weights α(i, j) = Softmax( qi kjT + β · (j − i) √ + Mask(i, j)) d (2.83) This model can be interpreted as adding a fixed penalty to qi kjT whenever j moves one step away from i. So we do not need to adapt it to a range of sequence lengths, and can employ it to model arbitrarily long sequences. See Figure 2.11 for a comparison of the T5 bias and the ALiBi bias. In general, the scalar β should be tuned on a validation dataset. However, Press et al. [2022] found that setting β to values decreasing geometrically by a factor of 21a for multi-head attention performs well on a variety of tasks. Specifically, for a self-attention sub-layer involving nhead heads, the scalar for the k-th head is given by βk = 1 8 2k (2.84) The ALiBi approach provides a simple form of relative positional embeddings. There are other similar methods for designing query-key biases using the offset i − j. Table 2.4 shows a comparison of such biases. As an aside it is worth noting that the form of the right-hand side of Eq. (2.82) is very similar to length features used in conventional feature-based systems. For example, in statistical machine translation systems, such features are widely used to model word reordering problems, resulting in models that can generalize well across different translation tasks [Koehn, 2010]. 2.3.5.3 Rotary Positional Embedding As with sinusoidal embeddings, rotary positional embeddings are based on hard-coded values for all dimensions of an embedding [Su et al., 2024]. Recall that in the sinusoidal embedding model, positions are represented as combinations of sine and cosine functions with different frequencies. These embeddings are then added to token embeddings to form the inputs to the Transformer 2.3 Long Sequence Modeling 87 qi kjT Bias (ub(i−j) ) q0 kT 0 u0 T q1 kT 0 q1 k1 u1 u0 T T q2 kT 0 q2 k1 q2 k2 u2 u1 u0 u2 u2 u1 u0 T T T T q4 kT 0 q4 k1 q4 k2 q4 k3 q4 k4 u3 u2 u2 u1 u0 T T T T T q5 kT 0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5 u3 u3 u2 u2 u1 u0 T T T T T T q6 kT 0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6 u3 u3 u3 u2 u2 u1 + T T T q3 kT 0 q3 k1 q3 k2 q3 k3 u0 (a) The T5 bias (nb = 3 and distmax = 5) qi kjT Bias (−β(i − j)) q0 kT 0 0 T q1 kT 0 q1 k1 −1β T T q2 kT 0 q2 k1 q2 k2 −2β −1β + T T T q3 kT 0 q3 k1 q3 k2 q3 k3 0 0 −3β −2β −1β 0 T T T T q4 kT 0 q4 k1 q4 k2 q4 k3 q4 k4 −4β −3β −2β −1β 0 T T T T T q5 kT 0 q5 k1 q5 k2 q5 k3 q5 k4 q5 k5 −5β −4β −3β −2β −β 0 T T T T T T q6 kT 0 q6 k1 q6 k2 q6 k3 q6 k4 q6 k5 q6 k6 −6β −5β −4β −3β −2β −β 0 (b) The ALiBi bias Fig. 2.11: Query-key products with biases (above = the T5 bias and below = the ALiBi bias). The color scale of the biases ranges from light blue denoting small absolute values to deep blue denoting large absolute values. layer stack. Rotary positional embeddings instead model positional context as rotations to token embeddings in a complex space. This leads to a model expressed in the form of multiplicative embeddings ei = xi R(i) (2.85) where R(i) ∈ Rd×d is the rotation matrix representing the rotations performed on the token embedding xi ∈ Rd . For simplicity, we will first consider embeddings with only two dimensions and return to a discussion hof the more i general formulation later. Suppose we have a 2-dimensional token embedding x = x1 x2 . We can represent it as a vector in a plane, originating at the origin (0, 0) and terminating at (x1 , x2 ). A counterclockwise rotation of this vector refers to an operation of Generative Models 88 Entry Query-Key Bias (PE(i, j)) T5 [Raffel et al., 2020] ALiBi [Press et al., 2022] ub(i−j) −β · ( i − j ) −β1 ( i − j )β2 Kerple [Chi et al., 2022] (power) −β1 log(1 + β2 ( i − j )) ¯ Pd/2 Sandwich [Chi et al., 2023] k=1 cos FIRE [Li et al., 2024] (logarithmic)  ¯ ( i − j )/100002k/d  f ψ( i − j )/ψ(max(mlen , i)) ¯ and mlen are hyper-parameters. In the T5 Table 2.4: Query-key biases as relative positional embeddings. β, β1 , β2 , d, model, b(i − j) denotes the bucket assigned to i − j. In the FIRE model, ψ(·) is a monotonically increasing function such as ψ(x) = log(cx + 1), and f (·) is an FFN. moving the vector around the origin while maintaining its magnitude, as shown in Figure 2.12 (a). The degree of rotation is usually defined by a specific angle, denoted by θ. The rotation can be expressed mathematically in the form Ro(x, θ) = xRθ h = x1 x2 h = " i " cos θ sin θ − sin θ cos θ # cos θ · x1 − sin θ · x2 sin θ · x1 + cos θ · x2 # i (2.86) cos θ sin θ is the rotation matrix. If two or more rotations are performed on the where Rθ = − sin θ cos θ same vector, we can rotate the vector further. This follows from the fact that the composition of successive rotations is itself a rotation. More formally, rotating a vector by an angle θ for t times can be expressed as Ro(x, tθ) = xRtθ = h cos tθ · x1 − sin tθ · x2 sin tθ · x1 + cos tθ · x2 i (2.87) If we interpret t as the position of a token represented by x in a sequence, then we will find that the above equation defines a simple positional embedding model. As shown in Figure 2.12 (b), we start moving the token from position 0. Each time we move one step forward, the vector is rotated by the angle θ. Upon arriving at the position t, the representation of the token with positional context is given by Ro(x, iθ). As the rotations do not change the magnitude of the embedding, the original “meaning” of the token is retained. The positional information is injected into the embedding, when it gets rotated. A popular way to understand h vectori rotation is to define it in complex spaces. It is easy to transform each vector x = x1 x2 in the 2D Euclidean space R2 to a complex number x′ = x1 + ix2 in the complex space C via a bijective linear map. Then, the rotation of x with the angle tθ corresponds to the multiplication by eitθ . Given that eitθ = cos tθ + i sin tθ, the rotation 2.3 Long Sequence Modeling 89 x2 x2 vector x rotated vector xRθ x xRθ θ θ θ x1 xR2θ x1 θ xR3θ (b) Multi-step Rotation (a) Single-step Rotation The1 cat2 is3 sleeping4 peacefully5 in6 the7 warm8 sunlight9 .10 x2 sleeping4 7θ cat2 7θ x1 sleeping11 Every1 afternoon2 ,3 you4 ’ll5 find6 that7 the8 cat9 is10 sleeping11 on12 my13 bed14 .15 cat9 (c) Angles between embeddings of two tokens at different positions Fig. 2.12: Illustrations of vector rotations in a plane. Sub-figures (a) and (b) show rotations of a vector in a single step and multiple steps, respectively. Sub-figure (c) shows the embeddings of tokens cat and sleeping in two different sentences. We show these sentences with a subscript affixed to each token to indicate its position. If we represent tokens as vectors, we can add positional information by rotating these vectors. This rotation preserves the “distances” between the vectors. For example, given that the distance between cat and sleeping is the same in both sentences, the angle between their embeddings also remains the same during rotation. operation can be re-expressed in the form xRtθ 7→ x′ eitθ = (x1 + ix2 )(cos tθ + i sin tθ) = cos tθ · x1 − sin tθ · x2 + i(sin tθ · x1 + cos tθ · x2 ) (2.88) Here we denote the token representation x′ eitθ by C(x, tθ). The inner product of the representations of the tokens at positions t and s can be written as hC(x, tθ), C(y, sθ)i = (x′ y′ )ei(t−s)θ (2.89) where y′ is the complex conjugate of y′ . As can be seen, the result of this inner product involves a term t − s, and so it can model the offset between the two tokens. Generative Models 90 Now we go back to representations in the 2D Euclidean space. The dot-product of Ro(x, tθ) and Ro(y, sθ) is can be written as a function of (t − s)θ Ro(x, tθ)[Ro(y, sθ)]T = xRtθ [yRsθ ]T = xRtθ [Rsθ ]T yT = xR(t−s)θ yT (2.90) Given this result, if we consider Ro(x, tθ) and Ro(y, sθ) as the query and the key, then the selfattention operation will implicitly involve the modeling of relative positional context. This rotary positional embedding canh be extended to imulti-dimensional embeddings. For a d-dimensional token embedding x = x1 x2 ... xd , we can treat it as a d2 -dimensional h i h i complex vector x′ = x′1 x′2 ... x′d/2 = x1 + ix2 x3 + ix4 ... xd−1 + ixd , where each consecutive pair of items forms a complex number. Then, the rotary positional embedding in the complex space is given by C(x, tθ) = d/2 X x′k eitθk ~ek (2.91) k=1 where ~ek is the standard basis vector with a single non-zero value in the k-th coordinate and 0’s elsewhere [Biderman et al., 2021]. Although this formula involves a complicated expression, its equivalent form in the d-dimensional Euclidean space is relatively easy to understand. We can write it as Ro(x, tθ) =  i  h Rtθ2 x1 x2 ... xd   "  Rtθ1  .. . Rtθd/2      (2.92) # i h cos tθk sin tθk where Rtθk = . θ = θ1 , ..., θd/2 are the parameters for controlling the an− sin tθk cos tθk gles of rotations in different dimensions. Typically, θk is set to 10000− to the setting in sinusoidal embeddings. 2(k−1) d , which is analogous In a practical implementation, Eq. (2.92) can be rewritten into a form that relies solely on the element-wise product and addition of vectors. T   T  T  T sin tθ1 −x2 cos tθ1 x1  sin tθ   x   cos tθ   x  1  1    1    2     .     .  . .         . . . . Ro(x, tθ) =  .  ⊙  . .   + .  ⊙   xd−1  xd   cos tθd/2  cos tθd/2    −xd  xd−1   sin tθd/2  sin tθd/2 Finally, we rewrite Eq. (2.85) to obtain the form of the embedding at position i (2.93) 2.3 Long Sequence Modeling 91 (2.94) ei = Ro(xi , iθ) 2.3.5.4 Position Interpolation In position interpolation, our goal is to map the positions in the new sequence to match the observed range in training. Suppose the sequence length for training ranges from 0 to ml . When m > ml at test time, we represent the positions in [0, m] such that our representations fit [0, ml ]. To illustrate, consider the rotary positional embedding model described above. The embedding i h of each token is described by a model Ro(xi , iθ) in which θ = θ1 , ..., θd/2 are the parameters. Ro(xi , iθ) can be cast in the form of a linear combination of two periodic functions (see Eq. (2.93)) cos iθ = sin iθ = h cos iθ1 ... cos iθd/2 h sin iθ1 ... sin iθd/2 i i (2.95) (2.96) θk is a exponential function of k and takes the form θ k = b− 2(k−1) d (2.97) where b is the base. The period of cos iθk and sin iθk is Tk = 2π · b 2(k−1) d (2.98) The key idea behind position interpolation is to adjust this period so that the new positions can m be encoded within the range [0, ml ]. One way to achieve this is to scale up Tk by m , given by l Tk′ = 2(k−1) m · 2π · b d ml (2.99) Hence all points in [0, m] are compressed into [0, ml ]. This linear scaling can be easily realized by modifying the input to the embedding model [Chen et al., 2023c]. The new model with linear positional interpolation is given by Ro′ (xi , iθ) = Ro(xi , ml iθ) m (2.100) Another method of positional interpolation is to scale the base17 . Suppose that the base b is scaled by λ. We wish the period of this new model in the last dimension of θ (i.e., dimension d2 ) to be equal to that of the linear positional interpolation model. This can be expressed as 2π · (λb) 2( d 2 −1) d = 2( d m 2 −1) · 2π · b d ml (2.101) 17 This method was first proposed in https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ ntkaware_scaled_rope_allows_llama_models_to_have/ Generative Models 92 Solving this equation, we obtain λ = = m  2( dd−1) 2 ml d m  d−2 ml (2.102) This gives an embedding model Ro′ (xi , iθ) = Ro(xi , iθ ′ ) (2.103) where h 0 2 d−2 θ ′ = (λb)− d , (λb)− d , ..., (λb)− d i (2.104) Note that scaling the base provides a non-uniform method for scaling the periods across different dimensions of θ. This method has been found to be helpful for extending LLMs to longer sequences, and several improvements have been developed [Peng et al., 2024; Ding et al., 2024]. 2.3.6 Remarks In this section, we have presented a variety of methods for long-context language modeling. We close this section by discussing some interesting issues related to these methods. 2.3.6.1 Need for Long Context One of the ultimate goals of long-context LLMs is that these models can precisely encode infinite context. The so-called infinite context refers more to the fact that an LLM can continuously read words. This motivates LLMs that can handle extremely long context or stream data. As discussed in Section 2.3.3, it is common to use fixed-size memory models to process continuously expanding context. Many such systems are based on recurrent architectures or their variants, because they are inherently suited to model time series problems where the effects of past inputs continue indefinitely. Another way to achieve infinite memory is to develop alternatives to self-attention models, for example, one can use continuous-space attention models to encode context, which removes the dependency on context length [Martins et al., 2022]. When studying long-context LLMs, it is natural to wonder what mechanisms may explain the use of long context in language modeling. Can we compress the representation of infinite context into a relatively small-sized model? Are all context tokens useful for predicting next tokens? How do LLMs prepare for token prediction when they see the context? Can we know in advance which contextual information will be critical for prediction? General answers to all these questions are not obvious, but they inspire follow-on research of explainable models, and some interesting results have been found. For example, Deletang et al. [2024] conducted extensive experiments to show that LLMs are powerful in-context compressors. Although viewing predictive models as compression models has long been studied in machine learning, it also provides insights into our understanding of the LLM scaling laws. Pal et al. [2023] and Wu et al. [2024] investigated whether the features learned up to the current step, though not intentionally, are already sufficient 2.3 Long Sequence Modeling 93 for predicting tokens at the following steps. Note that the need for long-context in language modeling is highly dependent on the problem that we address. A related issue is where to apply LLMs and how to evaluate them. For example, in summarization tasks we may only need to distill and focus on a few key aspects of the text, while in retrieval-like tasks we need to “memorize” the entire context so that the relevant information can be accessed. We will discuss the evaluation issue later in this subsection. 2.3.6.2 Pre-training or Adapting LLMs? Training LLMs requires significant computational costs. Although it is straightforward to train LLMs on long sequence data, the training becomes computationally unwieldy for large data sets. It is common practice to pre-train LLMs on general datasets, and then adapt them with modest finetuning effort. For example, LLMs with relative or rotary positional embeddings can be directly trained on large-scale data in the pre-training phase. While the resulting models may exhibit some abilities to extrapolate lengths in the inference phase, it may be more effective to fine-tune them on longer sequences. Ideally, we would like to pre-train LLMs with standard Transformer architectures and adapt them to new tasks. This allows us to use many off-the-shelf LLMs and efficiently adapt them to handle long sequences. However, when new architectures are adopted, it seems inevitable that we need to train these models from scratch. This poses practical difficulties for developing longcontext LLMs, as we cannot leverage well-developed, pre-trained models and must instead train them ourselves. On the other hand, fine-tuning is still an effective way to adapt LLMs with certain architectures that are different from those in pre-training. An example is models augmented with external memories. In these models, the pre-trained LLMs are fixed, and the focus is on how to make these LLMs collaborate with the memory models. In RAG, for instance, it is common to fine-tune LLMs to improve their use of retrieval-augmented inputs. Another example of finetuning LLMs for long-context modeling is that we train an LLM with full attention models, and then replace them with sparse attention models in the fine-tuning phase. The pre-trained LLM provides initial values of model parameters used in a different model, and this model is then finetuned as usual. 2.3.6.3 Evaluating Long-context LLMs Evaluating long-context LLMs is important, but it is a new issue in NLP. The general idea is that, if we input a long context to an LLM, then we can check from the output of the LLM whether it understands the entire context and makes use of it in predicting following tokens. In conventional research of NLP, such evaluations are often aimed at examining the ability of NLP models in handling long-range dependencies. However, the size of contexts used in recent LLMs is much larger than that used in NLP systems a few years ago. This motivates researchers to develop new evaluation benchmarks and metrics for long-context LLMs. One approach is to use the perplexity metric. However, in spite of its apparent simplicity, this method tends to reflect more on the LLMs’ ability to make use of local context rather than global context. It is therefore tempting to develop evaluation methods that are specific to long-context LLMs. Popular methods include various synthetic tasks where artificially generated or modified Generative Models 94 data is used to evaluate specific capabilities of long-context LLMs. In needle-in-a-haystack18 and passkey retrieval tasks [Mohtashami and Jaggi, 2024; Chen et al., 2023c], for instance, LLMs are required to identify and extract a small, relevant piece of information from a large volume of given text. The assumption here is that an LLM with sufficient memory should remember earlier parts of the text as it processes new information. This LLM can thus pick out the relevant details, which might be sparse and hidden among much irrelevant information, from the text. Alternatively, in copy memory tasks (or copy tasks for short), LLMs are used to repeat the input text or a specific segment multiple times. These tasks were initially proposed to test the extent to which recurrent models can retain and recall previously seen tokens [Hochreiter and Schmidhuber, 1997; Arjovsky et al., 2016], and have been adopted in evaluating recent LLMs [Bulatov et al., 2022; Gu and Dao, 2023]. Another approach to evaluating long-context LLMs is to test them on NLP tasks that involve very long input sequences. Examples include long-document or multi-document summarization, long-document question answering, code completion, and so on. A benefit of this approach is that it can align evaluations with user expectations. Although many methods have been developed, there is still no general way to evaluate longcontext LLMs [Liu et al., 2024c]. One problem is that most of these methods focus on specific aspects of LLMs, rather than their fundamental ability to model very long contexts. Even though an LLM can pick out the appropriate piece of text from the input, we cannot say that it truly understands the entire context. Instead, it might just remember some important parts of the context, or even simply recall the answer via the model learned in pre-training. Moreover, the data used in many tasks is small-scale and relatively preliminary, leading to discrepancies between evaluation results and actual application performance. A more interesting issue is that the results of LLMs are influenced by many other factors and experimental setups, for example, using different prompts can lead to very different outcomes. This makes evaluation even more challenging because improvements may not solely result from better modeling of long contexts, and there is a risk of overclaiming our results. Nevertheless, many open questions remain in the development and evaluation of long-context LLMs. For example, these models still suffer from limitations such as restricted context length and high latency. Studying these issues is likely to prove valuable future directions. 2.4 Summary In this chapter, we have discussed the concept of LLMs and related techniques. This can be considered a general, though not comprehensive, introduction to LLMs, laying the foundation for further discussions on more advanced topics in subsequent chapters. Furthermore, we have explored two ways to scale up LLMs. The first focuses on the large-scale pre-training of LLMs, which is crucial for developing state-of-the-art models. The second focuses on methods for adapting LLMs to long inputs, including optimizing attention models, designing more efficient and compressed KV caches, incorporating memory models, and exploring better positional embeddings. The strength of LLMs lies in their ability to break the constraints of training NLP models for a limited number of specific tasks. Instead, LLMs learn from large amounts of text through the simple task of token prediction — we predict the next token in a sentence given its prior tokens. 18 https://github.com/gkamradt/LLMTest_NeedleInAHaystack 2.4 Summary 95 A general view is that, by repeating this token prediction task a large number of times, LLMs can acquire some knowledge of the world and language, which can then be applied to new tasks. As a result, LLMs can be prompted to perform any task by framing it as a task of predicting subsequent tokens given prompts. This emergent ability in language models comes from several dimensions, such as scaling up training, model size, and context size. It is undeniable that scaling laws are currently the fundamental principle adopted in developing large language models, although simply increasing model size has yet to prove sufficient for achieving AGI. These continuously scaled LLMs have been found to show capabilities in general-purpose language understanding, generation, and reasoning. More recently, it has been found that scaling up the compute at inference time can also lead to significant improvements in complex reasoning tasks [OpenAI, 2024]. Given their amazing power, LLMs have attracted considerable interest, both in terms of techniques and applications. As a result, the explosion of research interest in LLMs has also led to a vast number of new techniques and models. However, we do not attempt to provide a comprehensive literature review on all aspects of LLMs, given the rapid evolution of the field. Nevertheless, one can still gain knowledge about LLMs from general reviews [Zhao et al., 2023; Minaee et al., 2024] or more focused discussions on specific topics [Ruan et al., 2024]. C HAPTER 3 Prompting In the context of LLMs, prompting refers to the method of providing an LLM with a specific input or cue to generate a desired output or perform a task. For example, if we want the LLM to translate a sentence from English to Chinese, we can prompt it like this Translate the text from English to Chinese. Text: The early bird catches the worm. Translation: Prompting is crucial for LLMs because it directly influences how effectively these models understand and respond to user queries. A well-crafted prompt can guide an LLM to generate more accurate, relevant, and contextually appropriate responses. Furthermore, this process can be iteratively refined. By analyzing the responses of the LLM, users can adjust their prompts to align more closely with their specific needs. Given the importance of prompting in applying LLMs, prompt design has become an essential skill for users and developers working with LLMs. This leads to an active research area, called prompt engineering, in which we design effective prompts to make better use of LLMs and enhance their practical utility in real-world applications. An important concept related to prompting is in-context learning. When prompting an LLM, we can add new information to the context, such as demonstrations of problem-solving. This allows the LLM to learn from this context how to solve the problem. Here is an example of prompting LLMs with a few demonstrations of how to classify text based on sentiment polarity. Here are some examples of text classification. Example 1: We had a delightful dinner together. → Label: Positive Example 2: I’m frustrated with the delays. → Label: Negative What is the label for “That comment was quite hurtful.”? Label: In-context learning is often seen as an emergent ability of LLMs that arises after pre-training. Though LLMs can be trained or tuned to perform new tasks, in-context learning provides a very efficient way to adapt these models without any training or tuning effort. Perhaps this is one of the most notable features of LLMs: they indeed learn general knowledge about the world and language during pre-training, which we can easily apply to new challenges. Moreover, in-context learning reflects the broader trend of making AI systems more generalizable and user-friendly. Instead of requiring specialized engineers to fine-tune models for every unique task, users can interact with LLMs in a more intuitive way, simply providing examples or adjusting the context as needed. In this chapter, we focus on prompting techniques in LLMs. We begin by considering several interesting prompt designs commonly used in prompt engineering. Then, we discuss a series of 96 3.1 General Prompt Design 97 refinements to these methods. Finally, we explore approaches for automating prompt design. 3.1 General Prompt Design This section presents basic concepts in prompt design, along with examples of how to prompt LLMs for various NLP tasks. Since the effectiveness of prompting is highly dependent on the LLMs being used, prompts often vary across different LLMs, making it difficult to provide a comprehensive list of prompts for all LLMs and downstream tasks. Therefore, this discussion is not focused on any specific LLM. Instead, the goal is to provide guiding principles for prompt design. 3.1.1 Basics The term prompt is used in many different ways. In this chapter we define a prompt as the input text to an LLM, denoted by x. The LLM generates a text y by maximizing the probability Pr(y|x). In this generation process, the prompt acts as the condition on which we make predictions, and it can contain any information that helps describe and solve the problem. A prompt can be obtained using a prompt template (or template for short) [Liu et al., 2023a]. A template is a piece of text containing placeholders or variables, where each placeholder can be filled with specific information. Here are two templates for asking the LLM for weekend suggestions. Please give me some suggestions for a fun weekend. If {∗premise∗}, what are your suggestions for a fun weekend. In the first template, we simply instruct the LLM to return some suggestions. So the template is just a piece of text with no variables. In the second template, the variable {∗premise∗} needs to be specified by the users to provide a premise for making suggestions. For example, if we input premise = the weather is nice this weekend then we can generate a prompt If the weather is nice this weekend, what are your suggestions for a fun weekend. We can also design a template with multiple variables. Here is an example in which we compare the two sentences in terms of their semantic similarity. Prompting 98 Here is a sentence {∗sentence1∗} Here is another sentence {∗sentence2∗} Compute the semantic similarity between the two sentences A popular way to format prompts is to write each input or output in a “name:content” style. For example, we can describe a conversation between two people, named John and David, and use the LLM to continue the conversation. A template of such prompts is given by John: {∗utterance1∗} David: {∗utterance2∗} John: {∗utterance3∗} David: {∗utterance4∗} John: {∗utterance5∗} David: {∗utterance6∗} John: {∗utterance7∗} David: The “name:content” format can be used to define the task that we want the LLM to perform. For example, given that “Q” and “A” are commonly used abbreviations for “Question” and “Answer”, respectively, we can use the following template to do question-answering. Q: {∗question∗} A: This format can be used to describe more complex tasks. For example, the following is an example of providing a specification for a translation task Task: Translation Source language: English Target language: Chinese Style: Formal text Template: Translate the following sentence: {∗sentence∗} In practical systems, it is common to represent and store such data in key-value pairs, such as the JSON format1 . When the problem is difficult to describe in an attribute-based manner, it is more common to instruct LLMs with a clear and detailed description. There are many ways to do this. One 1 The JSON representation is 3.1 General Prompt Design 99 example is to assign a role to LLMs and provide sufficient context. The following is a template that instructs an LLM to act as an expert and answer questions from children. You are a computer scientist with extensive knowledge in the field of deep learning. Please explain the following computer-related concept to a child around 10 years old, using simple examples whenever possible. {∗concept∗} Here the text “You are a computer scientist ... deep learning. ” is sometimes called system information, and is provided to help the LLM understand the context or constraints of the task it is being asked to perform. 3.1.2 In-context Learning Learning can occur during inference. In-context learning is one such method, where prompts involve demonstrations of problem-solving, and LLMs can learn from these demonstrations how to solve new problems. Since we do not update model parameters in this process, in-context learning can be viewed as a way to efficiently activate and reorganize the knowledge learned in pre-training without additional training or fine-tuning. This enables quick adaptation of LLMs to new problems, pushing the boundaries of what pre-trained LLMs can achieve without task-specific adjustments. In-context learning can be illustrated by comparing three methods: zero-shot learning, oneshot learning and few-shot learning. Zero-shot learning, as its name implies, does not involve a traditional “learning” process. It instead directly applies LLMs to address new problems that were not observed during training. In practice, we can repetitively adjust prompts to guide the LLMs in generating better responses, without demonstrating problem-solving steps or providing examples. Consider the following example. Suppose we want to use an LLM as an assistant that can help correct English sentences. A zero-shot learning prompt is given by { "Task": "Translation" "Source language": "English" "Target language": "Chinese" "Style": "Formal text" "Template": "Translate the following sentence: {∗sentence∗}" } Prompting 100 SYSTEM USER You are a helpful assistant, and are great at grammar correction. You will be provided with a sentence in English. The task is to output the correct sentence. Input: She don’t like going to the park. Output: Here the gray words are used to indicate different fields of the prompt. In one-shot learning, we extend this prompt by adding a demonstration of how to correct sentences, thereby allowing the LLM to learn from this newly-added experience. SYSTEM DEMO You are a helpful assistant, and are great at grammar correction. You will be provided with a sentence in English. The task is to output the correct sentence. Input: There is many reasons to celebrate. Output: There are many reasons to celebrate. USER You will be provided with a sentence in English. The task is to output the correct sentence. Input: She don’t like going to the park. Output: Furthermore, we can add more demonstrations to enable few-shot learning. SYSTEM You are a helpful assistant, and are great at grammar correction. DEMO1 You will be provided with a sentence in English. The task is to output the correct sentence. Input: There is many reasons to celebrate. Output: There are many reasons to celebrate. DEMO2 You will be provided with a sentence in English. The task is to output the correct sentence. Input: Me and my friend goes to the gym every day. Output: My friend and I go to the gym every day. USER You will be provided with a sentence in English. The task is to output the correct sentence. Input: She don’t like going to the park. Output: In few-shot learning, we essentially provide a pattern that maps some inputs to the corresponding outputs. The LLM attempts to follow this pattern in making predictions, provided that the prompt includes a sufficient number of demonstrations, although generally small. It is also 3.1 General Prompt Design 101 possible to use simpler patterns to achieve this. For example, one can use the following few-shot learning prompt for translating words from Chinese to English. DEMO USER 现在 来 去 男孩 女孩 → → → → → now come go boy If the LLM is powerful enough, few-shot learning can enable it to address complex problems, such as mathematical reasoning. For example, consider the following task of summing two numbers and then dividing the sum by their product. DEMO USER 12 5 3 1 −9 4 15 15 19 73 → → → → → (12 + 5)/(12 × 5) = 0.283 (3 + 1)/(3 × 1) = 1.33 (−9 + 4)/(−9 × 4) = 0.138 (15 + 15)/(15 × 15) = 0.133 In many practical applications, the effectiveness of in-context learning relies heavily on the quality of prompts and the fundamental abilities of pre-trained LLMs. On one hand, we need a significant prompt engineering effort to develop appropriate prompts that help LLMs learn more effectively from demonstrations. On the other hand, stronger LLMs can make better use of incontext learning for performing new tasks. For example, suppose we wish to use an LLM to translate words from Inuktitut to English. If the LLM lacks pre-training on Inuktitut data, its understanding of Inuktitut will be weak, and it will be difficult for the model to perform well in translation regardless of how we prompt it. In this case, we need to continue training the LLM with more Inuktitut data, rather than trying to find better prompts. It might be interesting to explore how in-context learning emerges during pre-training and why it works during inference. One simple understanding is that LLMs have gained some knowledge of problem-solving, but there are many possible predictions, which are hard to distinguish when the models confront new problems. Providing demonstrations can guide the LLMs to follow the “correct” paths. Furthermore, some researchers have tried to interpret in-context learning from several different perspectives, including Bayesian inference [Xie et al., 2022], gradient decent [Dai et al., 2023; Von Oswald et al., 2023], linear regression [Akyürek et al., 2023], meta learning [Garg et al., 2022], and so on. 3.1.3 Prompt Engineering Strategies Designing prompts is highly empirical. In general, there are many ways to prompt an LLM for performing the same task, and we need to perform a number of trial-and-error runs to find a satisfactory prompt. To write good prompts more efficiently, one can follow certain strategies. Examples of common prompting principles include Prompting 102 • Describing the task as clearly as possible. When we apply an LLM to solve a problem, we need to provide a precise, specific, and clear description of the problem and instruct the LLM to perform as we expect. This is particularly important when we want the output of the LLM to meet certain expectations. For example, suppose we are curious about climate change. A simple prompt for asking the LLM to provide some information is Tell me about climate change. Since this instruction is too general, the LLM may generate a response that addresses any aspect of climate change, which may not align with our specific interests. In this case, we can instead use prompts that are specific and detailed. One such example is Provide a detailed explanation of the causes and effects of climate change, including the impact on global temperatures, weather patterns, and sea levels. Also, discuss possible solutions and actions being taken to mitigate these effects. Now suppose we intend to explain climate change to a 10-year-old child. We can adjust the above prompt further. Explain the causes and effects of climate change to a 10-year-old child. Talk about how it affects the weather, sea levels, and temperatures. Also, mention some things people are doing to help. Try to explain in simple terms and do not exceed 500 words. • Guiding LLMs to think. LLMs have exhibited surprisingly good capabilities to “think”. A common example is that well-developed LLMs have achieved impressive performance in mathematical reasoning tasks, which are considered challenging. In prompt engineering, the “thinking” ability of LLMs needs to be activated through appropriate prompting, especially for problems that require significant reasoning efforts. In many cases, an LLM that is instructed to “think” can produce completely different results compared with the same LLM that is instructed to perform the task straightforwardly. For example, Kojima et al. [2022] found that simply appending “Let’s think step by step” to the end of each prompt can improve the performance of LLMs on several reasoning tasks. LLMs can be prompted to “think” in a number of ways. One method is to instruct LLMs to generate steps for reasoning about the problem before reaching the final answer. For example, consider a task of solving mathematical problems. See below for a simple prompt for this task. 3.1 General Prompt Design 103 You are a mathematician. You will be provided with a math problem. Please solve the problem. Since solving math problems requires a detailed reasoning process, LLMs would probably make mistakes if they attempted to work out the answer directly. So we can explicitly ask LLMs to follow a given reasoning process before coming to a conclusion. You are a mathematician. You will follow these detailed reasoning steps when solving math problems. Step 1: Problem Interpretation. The mathematician carefully listens to your query and understands the intricate details of the mathematical challenge you have presented. Step 2: Strategy Formulation. Drawing upon their extensive knowledge, the mathematician chooses the most effective strategy tailored to the type of math problem, whether it is algebra, calculus, or geometry. Step 3: Detailed Calculation. With precision and expertise, the mathematician performs the necessary calculations step by step, adhering to all mathematical principles. Step 4: Solution Review. Before providing the final answer, the mathematician meticulously checks the calculations for accuracy and offers a concise explanation or rationale for the solution. You will be provided with a math problem. Please solve the problem. {∗problem∗} Another method to guide LLMs to “think” is through multiple rounds of interaction with LLMs. For example, as a first step, we can instruct LLMs to solve the problem directly You will be provided with a math problem. Please solve the problem. {∗problem∗} Now we have an initial answer to the problem. As a second step, we prompt LLMs to evaluate the correctness of the answer and, if necessary, rework it to find a better solution. Prompting 104 You will be provided with a math problem, along with a solution. Evaluate the correctness of this solution, and identify any errors if present. Then, work out your own solution. Problem: {∗problem∗} Solution: {∗solution∗} The prompts presented here are closely related to a long line of research on reasoning problems in LLMs. It is impossible to provide a complete discussion of all related issues because this topic covers a large family of methods. But we will see a relatively more detailed discussion on how to improve prompting through more reasoning in Section 3.2. • Providing reference information. As discussed in the previous section, we can include demonstrations in prompts and allow LLMs to in-context learn from these demonstrations how to perform the task. In fact, given the remarkable ability of language understanding of LLMs, we can add any type of text into the prompts and so these models can predict based on enriched contexts. In many applications, we have various information that is relevant to user queries. Instead of using LLMs to make unconstrained predictions, we often want LLMs to produce outputs that are confined to the relevant text. One such example is RAG, where the relevant text for the user query is provided by calling an IR system, and we prompt LLMs to generate responses based on this provided relevant text. The following prompt shows an example. You are an expert that can generate answers to input queries. You have now been provided with a query and the corresponding context information. Please generate an answer based on this context information. Note that you need to provide the answer in your own words, not just copy from the context provided. Context information: {∗IR-result∗} Query: {∗query∗} If the context information is highly reliable, we can even restrict LLMs to answering using only the provided text. An example prompt is shown as follows 3.1 General Prompt Design 105 You are an expert tasked with generating answers from input queries. You have been provided with a query and corresponding context information, organized in a table where each row represents a useful record. Please generate an answer using only this context information. Ensure that you provide the answer in your own words. Context information: {∗table∗} Query: {∗query∗} When dealing with real-world problems, we often have prior knowledge and additional information about the problems that help produce better answers. Considering such information in prompting is generally helpful in improving the result. • Paying attention to prompt formats. In general, the performance of LLMs is highly sensitive to the prompts we input. Sometimes a small modification to a prompt can lead to a big change in model output. An interesting example is that changing the order of sentences in a prompt may cause LLMs to generate different results. To make prompts easy to read and reduce ambiguity, it is common to format them in a way that ensures clarity. One example is that we define several fields for prompts and fill different information in each field. Another example is we can use code-style prompts for LLMs which can understand and generate both natural language and code. See the following for a code-style prompt that performs translation where one demonstration is presented. [English] = [I have an apple.] [German] = [Ich habe einen Apfel.] [English] = [I have an orange.] [German] = LLMs can receive text in various formats. This allows us to use control characters, XML tags, and specific formatting to represent complex data. And it is useful to specify how the input and output should be formatted or structured. For example, we can delimit sections of text using quotes and prompt LLMs accordingly (e.g., adding a sentence like “the input text is delimited by double quotes” to the prompt). Above, we have discussed only a few strategies for writing good prompts. There are, of course, many such methods, and one needs to develop their own through practice. Interested readers can refer to various online documents for more information, such as OpenAI’s manual on the GPT series models2 . 2 See https://platform.openai.com/docs/guides/prompt-engineering/ six-strategies-for-getting-better-results. Prompting 106 3.1.4 More Examples In this subsection, we consider more examples of prompting LLMs to perform various NLP tasks. The motivation here is not to give standard prompts for these tasks, but rather to use simple examples to illustrate how LLMs can be prompted to deal with NLP problems. 3.1.4.1 Text Classification Text classification is perhaps one of the most common problems in NLP. Many tasks can be broadly categorized as assigning pre-defined labels to a given text. Here we consider the polarity classification problem in sentiment analysis. We choose polarity classification for illustration because it is one of the most popular and well-defined text classification tasks. In a general setup of polarity classification, we are required to categorize a given text into one of three categories: negative, positive, or neutral. Below is a simple prompt for doing this (for easy reading, we highlight the task description in the prompt). Analyze the polarity of the following text and classify it as positive, negative, or neutral. Text: The service at the restaurant was slower than expected, which was a bit frustrating. The polarity of the text can be classified as positive. To make the example complete, we show the response generated by the LLM (underlined text). Although the answer is correct, the LLM gives this answer not in labels but in text describing the result. The problem is that LLMs are designed to generate text but not to assign labels to text and treat classification problems as text generation problems. As a result, we need another system to map the LLM’s output to the label space (call it label mapping), that is, we extract “positive” from “The polarity of the text can be classified as positive”. This is trivial in most cases because we can identify label words via simple heuristics. But occasionally, LLMs may not express the classification results using these label words. In this case, the problem becomes more complicated, as we need some way to map the generated text or words to predefined label words. One method to induce output labels from LLMs is to reframe the problem as a cloze task. For example, the following shows a cloze-like prompt for polarity classification. Analyze the polarity of the following text and classify it as positive, negative, or neutral. Text: The service at the restaurant was slower than expected, which was a bit frustrating. The polarity of the text is positive 3.1 General Prompt Design 107 We can use LLMs to complete the text and fill the blank with the most appropriate word. Ideally, we wish the filled word would be positive, negative, or neutral. However, LLMs are not guaranteed to generate these label words. One method to address this problem is to constrain the prediction to the set of label words and select the one with the highest probability. Then, the output label is given by label = arg max Pr(y|x) (3.1) y∈Y where y denotes the word filled in the blank, and Y denotes the set of label words {positive, negative, neutral}. Another method of using LLMs to generate labels is to constrain the output with prompts. For example, we can prompt LLMs to predict within a controlled set of words. Here is an example. Analyze the polarity of the following text and classify it as positive, negative, or neutral. Text: The service at the restaurant was slower than expected, which was a bit frustrating. What is the polarity of the text? Just answer: positive, negative, or neutral. Positive Sentiment analysis is a common NLP problem that has probably been well understood by LLMs through pre-training or fine-tuning. Thus we can prompt LLMs using simple instructions to perform the task. However, for new classification problems, it may be necessary to provide additional details about the task, such as the classification standards, so that the LLMs can perform correctly. To do this, we can add a more detailed description of the task and/or demonstrate classification examples in the prompts. To illustrate, consider the following example. Prompting 108 Analyze the polarity of the following text and classify it as positive, negative, or neutral. Here’s what each category represents: Positive: This indicates that the text conveys a positive emotion or attitude. For example, texts expressing happiness, satisfaction, excitement, or admiration are considered positive. Negative: This refers to a text that expresses a negative emotion or attitude. It encompasses feelings of sadness, anger, frustration, or criticism. Neutral: Neutral sentiment is used to describe texts that do not exhibit clear positive or negative emotions but instead convey informational, factual, or indifferent tones. Text: The service at the restaurant was slower than expected, which was a bit frustrating. What is the polarity of the text? Positive While it seems straightforward to use LLMs for classification problems, there are still issues that have not been well addressed. For example, when dealing with a large number of categories, it remains challenging to effectively prompt LLMs. Note that if we face a very difficult classification problem and have a certain amount of labeled data, fine-tuning LLMs or adopting “BERT + classifier”-like architectures is also desirable. 3.1.4.2 Information Extraction Many NLP problems can be regarded as information extraction problems, involving the identification or extraction of specific pieces of information from unstructured text. This information can include named entities, relationships, events, and other relevant data points. The goal of information extraction is to transform raw data into a format that can be easily analyzed and used in various downstream applications. As information extraction covers a wide range of problems, we cannot discuss them all here. Instead, we start with the task of named entity recognition — a task that has long been a concern in NLP. Named entity recognition is a process that detects and classifies key information in text into specific groups. These key pieces of information, known as named entities, typically include proper names and are categorized into distinct classes such as people, locations, organizations, dates, monetary values, and percentages. Consider first a simple example of extracting person names from a text3 . 3 The text is from https://edition.cnn.com/travel 3.1 General Prompt Design 109 Identify all person names in the provided text. Text: Is the UK really doing that badly or have travel trends shifted, post-pandemic? For Tom Jenkins, CEO of the European Tourism Organisation, it’s the latter. “I think the UK is doing perfectly well but we’ll see more people going to Europe,” he says of 2024, adding that there’s “allegedly a slight plateau of American demand for the UK.” The person name identified in the provided text is: Tom Jenkins Then, we can extract all named entities. Identify and classify all named entities in the provided text into categories such as person names, locations, dates, and organizations. List each entity with its type on one line. Text: Is the UK really doing that badly or have travel trends shifted, post-pandemic? For Tom Jenkins, CEO of the European Tourism Organisation, it’s the latter. “I think the UK is doing perfectly well but we’ll see more people going to Europe,” he says of 2024, adding that there’s “allegedly a slight plateau of American demand for the UK.” 1. Tom Jenkins - Person Name 2. European Tourism Organisation - Organization 3. UK - Location 4. Europe - Location 5. 2024 - Date Given these identified named entities, we can further process the text using other information extraction tasks. For example, we can identify the relationships between these named entities (call it relation extraction). The corresponding prompt is shown as follows. Prompting 110 Given a text and a list of named entities identified within it, analyze and describe the relationships between these entities. Explain how each entity is contextually related to others. Text: Is the UK really doing that badly or have travel trends shifted, post-pandemic? For Tom Jenkins, CEO of the European Tourism Organisation, it’s the latter. “I think the UK is doing perfectly well but we’ll see more people going to Europe,” he says of 2024, adding that there’s “allegedly a slight plateau of American demand for the UK.” Named Entity: Tom Jenkins, European Tourism Organisation, UK, Europe, 2024 Tom Jenkins - European Tourism Organisation: Tom Jenkins is the CEO of the European Tourism Organisation. This indicates a professional association where Tom Jenkins likely has a significant role in directing the organization’s activities, especially those related to tourism in Europe. ... If LLMs have been fine-tuned with instruction following for information extraction tasks, it is relatively easy to perform various information extraction tasks. For example, the following is a prompt template for information extraction. You will be provided with a text. Your task is to {∗task-description∗} Text: {∗text∗} Here task-description is the instruction for the task. Examples of the instruction include extract the keywords from the text. extract the key events from the text. detect the coreference and anaphoric links between entities from the text. and so on. Of course, we can provide a more detailed description of the task to prompt LLMs more accurately. 3.1.4.3 Text Generation In NLP, the term text generation is used broadly to cover a wide range of different problems, many of which focus on generating text based on given text. These problems can be roughly categorized into two classes: 1) continual writing based on the input text (call it text completion), and 2) transformation of the input text into another text (call it text transformation). Here, we illustrate the prompts for the former and consider the prompts for the latter in the following discussion. 3.1 General Prompt Design 111 A simple example of text completion is continuing to write a story starting with a few given sentences. You are a writer who is great at writing stories. You are provided with a few sentences to start with. Please continue the story from the provided sentences. Start of the story: One sunny afternoon, Lily discovered a tiny, golden key hidden under a rock in her backyard. Curious, she picked it up and wondered what it might unlock. Just then, her cat, Whiskers, began to meow excitedly, leading her towards the old oak tree at the edge of the garden. Lily followed Whiskers, her heart pounding with excitement and a touch of nervousness. The old oak tree had always seemed mysterious to her, its gnarled branches casting long shadows even on the brightest days. As they reached the tree, Whiskers darted ahead and started pawing at something buried beneath a pile of autumn leaves ... Likewise, we can create a conversation given the conversational context. You are provided with a conversation between a tourist and a taxi driver. Please complete the conversation. Tourist: Could you take me to the downtown museum, please? Taxi Driver: Of course! First time in the city? Tourist: Yes, any must-see places around? Taxi Driver: Definitely check out the botanical gardens after the museum. They’re nearby and quite beautiful. Tourist: Great, thanks for the tip! Taxi Driver: No problem. Enjoy your visit! Another application of text completion is generating text based on user requirements. The requirements can include style, tone, length, and any specific content that the text should contain. For example, users may need a formal report, a creative story, or a piece of programming code. Here is a prompt for generating a five-character regulated Chinese poem. Prompting 112 You are a poet. Please write a traditional Chinese poem. Here is the requirement. Theme: Spring’s rejuvenation Structure: Five-character regulated poem Emotion: Convey a sense of happiness and renewal through vivid spring imagery. 春意盎然 翠柳风中舞, 红花雨后新。 山明水又绿, 天宽鸟自频。 If the LLM is trained to generate language and code, we can prompt it to perform code completion tasks. Here is an example. Please write a Python function to calculate the average of a list of numbers. def calculate_average(numbers): if numbers: # Check if the list is not empty return sum(numbers) / len(numbers) else: return 0 # Return 0 if the list is empty 3.1.4.4 Text Transformation Typical text transformation tasks include machine translation, summarization, text style transfer, and so on. The prompts used to perform these tasks are not significantly different from the ones presented previously. Here, we present a few examples to showcase these prompts. Translate the following text from English to Spanish. Text: The quick brown fox jumps over the lazy dog. El rápido zorro marrón salta sobre el perro perezoso. 3.1 General Prompt Design 113 Summarize the following this article in no more than 50 words: Article: In recent years, urban areas have been facing significant challenges related to sustainability and environmental management. Cities are experiencing higher rates of pollution, increased traffic congestion, and greater demands on infrastructure. This growth has led to numerous environmental issues, including elevated levels of air and water pollution, increased waste production, and strained public services ... Urban areas are grappling with sustainability challenges, such as rising pollution, traffic congestion, and infrastructure demands ... Rewrite this text in a formal tone. Text: Hey, what’s up? Long time no see! Hello, how have you been? It has been quite some time since we last met! 3.1.4.5 Question Answering The question-answering format is inherently simple. For a given question, there is an answer that corresponds to it. For example, in open-domain question answering, we expect the system to return an answer in response to a user-submitted question. Prompt templates for general-purpose question answering can be: {∗question∗} Question: {∗question∗} Answer: Question answering is important in NLP because many problems can be framed as questionanswering tasks. In particular, many recent reasoning tasks are defined in the form of question answering. For example, in the MMLU benchmark [Hendrycks et al., 2021], each example consists of a multiple-choice question, and LLMs are required to select the correct answer. See the following for an example prompt for answering a question in this dataset. Prompting 114 (Z, ∗) is a group with a ∗ b = a + b + 1 for all a, b in Z. The inverse of a is (A) 0 (B) −2 (C) a − 2 (D) (2 + a) ∗ −1 D Another widely-used benchmark is the GSM8K dataset [Cobbe et al., 2021]. It consists of thousands of grade school math word problems. For each problem, we use LLMs to return a solution in natural language. For example, the following is a prompt used for solving a problem in GSM8K. DEMO Q: Jerry’s two daughters play softball on different teams. They each have 8 games this season. Each team practices 4 hours for every game they play. If each game lasts for 2 hours, how many hours will Jerry spend at the field watching his daughters play and practice altogether? A: Jerry will spend 8 games * 2 hours per game = ≪ 8 ∗ 2 = 16 ≫ 16 hours watching one daughter play her games. He will spend 16 ∗ 2 =≪ 16 ∗ 2 = 32 ≫ 32 hours watching both daughters play their games. He will spend 8 games * 4 hours of practice = ≪ 8 ∗ 4 = 32 ≫ 32 hours watching one daughter practice. He will spend 32∗2 =≪ 32∗2 = 64 ≫ 64 hours watching both daughters practice. He will spend a total of 32 hours watching games + 64 hours watching practice =≪ 32 + 64 = 96 ≫ 96 hours. #### 96 Q: Mary bought six apples from the store. From the apples she bought, for each that Mary ate, she planted two trees from the remaining ones. How many apples did Mary eat? A: She planted eight trees. This means she used half of that amount of apples, which is 8 trees /2 trees/apple =≪ 8/2 = 4 ≫ 4 apples. That means that she planted four of the six apples she bought, leaving only 6 apples − 4 apples =≪ 6 − 4 = 2 ≫ 2 apples to be eaten. #### 2 USER Q: Boris has 100 apples. Beck has 23 fewer apples than Boris. If Boris gives Beck 10 apples, how many fewer apples does Beck have than Boris now? A: Boris starts with 100 apples and gives Beck 10 apples, leaving him with 100 apples −10 apples =≪ 100 − 10 = 90 ≫ 90 apples. Initially, Beck has 23 fewer apples than Boris, giving him a starting count of 100 − 23 =≪ 100 − 23 = 77 ≫ 77 apples. After receiving 10 apples from Boris, Beck’s total comes to 77 apples +10 apples =≪ 77 + 10 = 87 ≫ 87 apples. Consequently, Beck now has 90 − 87 =≪ 90 − 87 = 3 ≫ 3 fewer apples than Boris. #### 3 3.2 Advanced Prompting Methods 115 Here a few-shot prompt is adopted. The LLM learns from these demonstrations of problemsolution pairs not only the way of problem-solving but also the way of formatting the output. For example, the final result of calculation follows the #### token, and ≪ ... ≫ annotates the detailed calculation steps (called calculation annotation)4 . 3.2 Advanced Prompting Methods So far in this chapter, we have introduced the basic concepts related to LLM prompting and presented a number of prompts for NLP tasks. We now consider several techniques for enhancing the effectiveness of prompting. 3.2.1 Chain of Thought We have encountered the concept of chain of thought (CoT) several times in this chapter and previous ones [Wei et al., 2022c; Chowdhery et al., 2022]. CoT methods provide a simple way to prompt LLMs to generate step-by-step reasoning for complex problems, thereby approaching tasks in a more human-like manner. Rather than coming to a conclusion directly, the CoT methods instruct LLMs to generate reasoning steps or to learn from demonstrations of detailed reasoning processes provided in the prompts. To illustrate CoT, we consider the problem of algebraic calculation, as commonly described in the literature. Suppose we are given a algebraic problem Calculate the average of the numbers 2, 4, and 6. We can consider it as the question and prompt an LLM to answer it. Q: Please calculate the average of the numbers 2, 4, and 9. A: The answer is 6. It seems difficult for the LLM to directly give a correct answer. A simple improvement is to add demonstrations of similar problems in the prompt, and thus the LLM can learn from these demonstrations. Q: Please calculate the average of the numbers 1, 3, 5, and 7. A: The answer is 4. Q: Please calculate the average of the numbers 2, 4, and 9. A: The answer is 7. The problem here is that, although we have shown a similar question-answer pair, it remains difficult for the LLM to reason out the correct answer. In CoT, not only can LLMs learn from the 4 During prediction, a calculator is used when we see ≪ ... ≫. More specifically, once the LLM encounters “=” in a ≪ ... ≫, then the calculator calculates the expression on the left-hand side of “=”. This method helps reduce the calculation errors made by LLMs. Prompting 116 correspondence between questions and answers but they may gain more from detailed problemsolving steps that used to derive the answers. To do this, we can incorporate some reasoning steps into the prompt to obtain a CoT prompt. Q: Please calculate the mean square of the numbers 1, 3, 5, and 7. A: Calculate the square of each number: 12 = 1, 32 = 9, 52 = 25, and 72 = 49. Sum the squares, 1 + 9 + 25 + 49 = 84. There are 4 numbers in total. Divide the sum by the number of items, 84/4 = 21. The answer is 21. Q: Please calculate the average of the numbers 2, 4, and 9. A: Calculate 2 + 4 + 9, which equals 15. There are three numbers. Divide the total sum by the count, resulting in 15/3 = 5. The answer is 5. Here we highlight the reasoning steps in green. By providing a detailed reasoning process for a similar problem, the LLM learns to reason and thereby generates a problem-solving path that leads to the correct answer. There are several benefits of using CoT prompting. First, CoT allows LLMs to decompose complex problems into smaller, sequential reasoning steps. This somewhat mirrors human problem-solving behaviors, making it particularly effective for tasks requiring detailed, multi-step reasoning. Second, CoT makes the reasoning process more transparent and interpretable. Since all reasoning steps are visible, we can understand and interpret how a conclusion was reached. Third, if users can follow the logic behind the reasoning process, they will be more likely to trust the predictions of an LLM. This is particularly important when applying LLMs in fields like medicine, education, and finance. Fourth, CoT is an in-context learning approach, and thus, it is applicable to most well-trained, off-the-shelf LLMs. Moreover, CoT provides efficient ways to adapt LLMs to different types of problems. It can even inspire more creative solutions by exploring various alternative reasoning paths, which might not be obvious when arriving at a conclusion directly. The method described above requires providing one or more examples of CoT reasoning, typically called the few-shot CoT method. By contrast, the zero-shot CoT method does not require such examples. It instead prompts LLMs to reason step-by-step by incorporating specific instructions in prompts. For example, below is a zero-shot CoT prompt. Q: Please calculate the average of the numbers 2, 4, and 9. A: Let’s think step-by-step. We have three numbers: 2, 4, and 9. Add these numbers together, 2 + 4 + 9 = 15. Determine how many numbers there are, which in this case is three. The average is calculated by dividing the total sum by the number of elements. Completing the division gives 15/3 = 5. So the answer is 5. Following the instruction “Let’s think step by step”, the LLM is prompted to generate detailed reasoning steps. As discussed in Kojima et al. [2022]’s work, prompting with such instructions may result in LLMs generating only the reasoning steps without a clear conclusion. In this case, a second round of prompting can be used to extract the answer from these reasoning steps. For example, Kojima et al. [2022] create a second prompt which combines both the input and output 3.2 Advanced Prompting Methods 117 in the first round of prompting. Using this combined input, the LLM can continue its reasoning process and then generate the correct answer. Furthermore, it is possible to prompt LLMs to reason using instructions other than “Let’s think step by step”, such as “Let’s think logically” and “Please show me your thinking steps first”. While we have illustrated CoT methods using an algebraic reasoning problem, these methods can be applied to a variety of different problems. Typical problem-solving scenarios for CoT include mathematical reasoning, logical reasoning, commonsense reasoning, symbolic reasoning, code generation, and so on. See Figure 3.1 for more examples of applying CoT in various tasks. CoT today is one of the most active fields of prompt engineering. This has not only led to improved performance for LLM prompting but has opened the door to a wide range of methods for studying and verifying reasoning capabilities of LLMs. Although we have focused on the basic idea of CoT in this section, it can be improved in several ways. For example, we can consider the reasoning process as a problem of searching through many possible paths, each of which may consist of multiple intermediate states (i.e., reasoning steps). In general, we wish the search space to be well-defined and sufficiently large, so that we are more likely to find the optimal result. For this reason, an area of current LLM research is aimed at designing better structures for representing reasoning processes, allowing LLMs to tackle more complex reasoning challenges. These structures include tree-based structures [Yao et al., 2024], graph-based structures [Besta et al., 2024], and so on. By using these compact representations of reasoning paths, LLMs can explore a wider range of decision-making paths, analogous to System 2 thinking5 . Another line of research focuses on prompting LLMs with multi-round interactions. This involves decomposing complex problems into sub-problems, verifying and refining model outputs, employing model ensembling, and so on. Note that these methods and the issues involved are not limited to CoT. In fact, they are often used as more general approaches to improving LLMs, while CoT can be seen as a way to test the capabilities of LLMs. We will see discussions of some of these issues in the following subsections. Before leaving our discussion of CoT, we should consider its practical limitations. One of them is the need for detailed, multi-step reasoning demonstrations in few-shot CoT scenarios, which may be difficult to obtain, either automatically or manually. Also, there is no standard method for breaking down complex problems into simpler problem-solving steps. This often heavily depends on the user’s experience. In addition, errors in intermediate steps can also affect the accuracy of the final conclusion. For further discussion on the pros and cons of CoT, the interested reader can refer to recent surveys on this topic [Chu et al., 2023; Yu et al., 2023; Zhang et al., 2023a]. 3.2.2 Problem Decomposition We have seen that LLMs can benefit from solving a complex problem by breaking it down into simpler problem-solving tasks. Such an approach can be seen as an example of a broader paradigm known as problem decomposition, which has been extensively explored and discussed in psychology and computer science. From the psychological perspective, complex problem-solving refers to a process of addressing a problem using knowledge that helps overcome the barriers of 5 System 1 and System 2 thinking, as described by Kahneman [2011], represent two different modes of cognitive processing. System 1 is fast, automatic, intuitive, and emotional. This mode of thinking operates effortlessly and quickly, and is often what guides our daily decisions, judgments, and impressions. System 2 is slow, deliberate, and analytical. It is activated when we need to perform complex computations. Prompting 118 Q: What home entertainment equipment requires cable? Answer Choices: (a) radio shack (b) substation (c) television (d) cabinet A: The answer must require cable. Of the above choices, only television requires cable. So the answer is (c). Q: What type of person typically contracts illness? Answer Choices: (a) hospital (b) head (c) sick person (d) elderly person (e) doctor’s office A: The answer must be a person susceptible to illness. Of the choices, "elderly per son" is correct because they typically have weaker immune systems. So the answer is (d). Q: Do hamsters provide food for any animals? A: Hamsters are prey animals. Prey are food for predators. Thus, hamsters provide food for some animals. So the answer is yes. Q: Can you use Microsoft Office without internet? A: Yes, you can use Microsoft Office applications , although features like cloud storage and online collaboration require internet. So the answer is yes. (a) CSQA (b) StrategyQA Q: Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: [ { A: Let’s think step by step. 0: empty stack 1: [ ; stack: [ 2: { ; stack: [ { So the answer is } ]. Q: Take the last letters of the words in "Elon Musk" and concatenate them. A: The last letter of "Elon" is "n". The last letter of "Musk" is "k". Concatenating them is "nk". The answer is nk. Q: Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: < [ [ A: Let’s think step by step. 0: empty stack 1: < ; stack: < 2: [ ; stack: < [ 3: [ ; stack: < [ [ So the answer is ] ] >. (c) Dyck languages Q: Take the last letters of each words in "Renata Mariela Mona Kristin" and concatenate them. A: The last letter of "Renata" is "a". The last letter of "Mariela" is "a". The last letter of "Mona" is "a"". The last letter of "Kristin" is "n". Concatenating them is "aaan". The answer is aaan. (d) Last Letter Concatenation Fig. 3.1: CoT in four different reasoning tasks, including CSQA, StrategyQA, Dyck languages, and Last Letter Concatenation. The CoT parts are highlighted in green. the problem6 . There are generally no standard or clear paths to a solution for a complex problem. However, it is often advantageous to employ strategies that decompose the problem, thereby making it easier to tackle the corresponding sub-problems with less effort. For example, consider writing a blog about the risks of AI. If we simply prompt an LLM with the instruction “Please write a blog about the risks of AI”, the LLM may generate a blog with arbitrary structures and 6 A relatively formal definition can be found in Frensch and Funke [2014]’s book: complex problem-solving occurs to overcome barriers between a given state and a desired goal state by means of behavioral and/or cognitive, multi-step activities. 3.2 Advanced Prompting Methods 119 writing styles. A better method, instead, could be to outline the blog and provide more detailed information about each section. Consider the following prompt You are a blog writer. Please follow the provided outline below to write a blog about the risks of AI. • Introduction Introduce AI, its relevance, and the importance of understanding its risks for youth. • Privacy Concerns Discuss how AI might compromise personal privacy through interactions online. • Misinformation Explore AI’s role in spreading misinformation and influencing young people’s decisions. • Cyberbullying Highlight how AI tools can be utilized in cyberbullying and the impact on mental health. • Tips for Safe AI Use Offer guidelines for responsible AI usage and promote critical thinking. • Conclusion Recap main points and encourage proactive engagement with AI ethics. Here we give the title and major points for each section. Then, the LLM can use this structure to break down the writing task by filling in content for these sections. Note that the way to structure the blog can be provided by humans or even generated automatically. For example, we can use the LLM to first generate the outline, and then ask it to follow this outline to complete the writing. In computer science, decomposing complex problems is a commonly used strategy in software and hardware system design. A well-known example is the divide-and-conquer paradigm, which is often used to design algorithms for computation problems that can be reduced to simpler, more manageable problems. For example, consider a problem of determining whether a document discusses the risks of AI. We can instruct the LLM with the following prompt. You are provided with a text. Please determine whether it discusses the risks of AI. {∗document∗} If the document is long, the computation will be expensive. Alternatively, we can divide the document into relatively short segments and perform the same task on each segment. These segments can be processed in parallel to further reduce the computational cost. Next, we determine Prompting 120 the relevancy of each segment to the topic of AI risks. The final output is then generated using another prompt. Your task is to determine whether a text discusses the risks of AI. This text has been divided into segments, and you have obtained the relevancy of each segment to the topic of AI risks. Based on this, please provide your final result. Segment 1: {∗relevancy-to-the-topic1∗} Segment 2: {∗relevancy-to-the-topic2∗} Segment 3: {∗relevancy-to-the-topic3∗} ... Now let us return to a more general discussion of problem decomposition in prompting. While problem decomposition can be applied to various NLP problems, it has been more extensively discussed and tested in reasoning tasks recently. For complex reasoning tasks, we often need a multi-step reasoning path to reach a correct conclusion. We can use LLMs to achieve this in three different ways. First, LLMs can directly reach the conclusion. In other words, they can predict without explicit reasoning processes, and there is a hidden and uninterpretable reasoning mechanism. Second, LLMs are prompted to generate a multi-step reasoning path that leads to the conclusion, like CoT. However, we run LLMs just once, and all intermediate steps in reasoning are generated in a single prediction. Third, we break down the original problem into a number of sub-problems, which are either addressed in separate runs of LLMs or tackled using other systems. Here we focus our attention on the third approach, which is closely related to problem decomposition. Note, however, that a more comprehensive discussion could cover all these approaches, while the first two have been discussed to some extent in this chapter. A general framework for problem decomposition involves two elements. • Sub-problem Generation. This involves decomposing the input problem into a number of sub-problems. • Sub-problem Solving. This involves solving each sub-problem and deriving intermediate and final conclusions through reasoning. These two issues can be modeled in different ways, leading to various problem decomposition methods. One approach is to treat them as separate steps in a two-step process. For example, consider the blog writing task described at the beginning of this subsection. In the first step, we decompose the entire problem into sub-problems all at once (i.e., outline the blog). In the second step, we solve the sub-problems either sequentially or in another order (i.e., fill in content for each section as needed). The final output of this process combines the results from solving each sub-problem. While this method is simple and straightforward, it assumes that the problem is compositional, making it more suitable for tasks like writing and code generation. However, many real-world problems require complex reasoning. One key characteristic of these problems is that the reasoning steps may not be fixed. The reasoning path can vary for different problems, and each step of reasoning may depend on the outcomes of prior steps. In 3.2 Advanced Prompting Methods 121 such cases, it is undesirable to use fixed sub-problem generation in advance. Instead, sub-problems should be generated dynamically based on the input problem, and, if possible, generated on the fly during the reasoning process. This makes problem decomposition more challenging compared with designing divide-and-conquer algorithms. Ideally, we would like to jointly design both the systems for sub-problem generation and sub-problem solving. But a more practical and widely used approach is to adopt separate models for these tasks. A straightforward way to achieve this is to adapt an LLM for these tasks by either prompting or tuning the model. Here we consider a method based on the above idea, called least-to-most prompting [Zhou et al., 2023b]. The motivation for this method arises from the challenges of solving difficult reasoning problems — those that cannot be addressed by simply generalizing from a few examples. For these problems, a more effective problem-solving strategy is to follow a progressive sequence of sub-problems that systematically lead to the conclusion. More specifically, in the least-to-most prompting method, sub-problem generation is performed by prompting an LLM with instructions and/or demonstrations. For example, below is a 2-shot prompt for sub-problem generation in least-to-most prompting. TASK Your task is to decompose a problem into several sub-problems. You will be given a few examples to illustrate how to achieve this. DEMO Q: In a community, 5% of the population are infants, 15% are children, 40% are adults, and 40% are seniors. Which group makes up the largest portion of the population? A: To answer the question “Which group makes up the largest portion of the population?”, we need to know: “How many percent are infants?”, “How many percent are children?”, “How many percent are adults?”, “How many percent are seniors?”. Q: Alice, Bob, and Charlie brought beads for their group project in their craft class. Alice has twice as many beads as Bob, and Bob has five times as many beads as Charlie. If Charlie has 6 beads, how many beads can they use for their craft project? A: To answer the question “How many beads can they use for their craft project?”, we need to know: “How many beads does Bob have?”, “How many beads does Alice have?”. USER Q: The environmental study conducted from 2015 to 2020 revealed that the average temperature in the region increased by 2.3 degrees Celsius. What was the duration of the environmental study? A: To answer the question “What was the duration of the environmental study?”, we need to know: “When did the environmental study start?”, “When did the environmental study end?”. By learning from the examples, the LLM can generate two sub-problems for answering the new problem “What was the duration of the environmental study?” (highlighted in blue and orange). Given these sub-problems, we solve them sequentially. For each sub-problem, we take all previously-generated QA pairs as context, and then produce the answer. For the example above, Prompting 122 we need to answer the first sub-problem by prompting the LLM, like this The environmental study conducted from 2015 to 2020 revealed that the average temperature in the region increased by 2.3 degrees Celsius. SUB-PROB1 Q: When did the environmental study start? A: The environmental study started in 2015. Once we have the answer to the first sub-problem, we proceed to the second one. This time, we include both the first sub-problem and its corresponding answer in the input. The environmental study conducted from 2015 to 2020 revealed that the average temperature in the region increased by 2.3 degrees Celsius. SUB-PROB1 Q: When did the environmental study start? A: The environmental study started in 2015. SUB-PROB2 Q: When did the environmental study end? A: The environmental study started in 2020. Finally, we use the LLM to solve the original problem given the answers to all the subproblems. The environmental study conducted from 2015 to 2020 revealed that the average temperature in the region increased by 2.3 degrees Celsius. SUB-PROB1 Q: When did the environmental study start? A: The environmental study started in 2015. SUB-PROB2 Q: When did the environmental study end? A: The environmental study started in 2020. FINAL Q: What was the duration of the environmental study? A: The duration of the environmental study was 5 years. The least-to-most method offers a basic approach to prompting LLMs to generate and solve sub-problems separately. We can improve it in several ways. One simple improvement is to apply various advanced prompting techniques, which do not require changes to the problem decomposition framework. For example, we can incorporate CoT into the prompting to enhance the reasoning performance of sub-problem generation and solving. Another improvement is to explore methods for better decomposing problems and organizing problem-solving paths. To describe these approaches, we will use the symbol p0 to denote the 3.2 Advanced Prompting Methods 123 input problem, and use the symbols {p1 , ..., pn } to denote the sub-problems corresponding to p0 . For least-to-most prompting, we decompose p0 into {p1 , ..., pn }, given by {p1 , ..., pn } = G(p0 ) (3.2) where G(·) denotes the function of sub-problem generation. Then, we solve the sub-problems {p1 , ..., pn } sequentially, resulting in a sequence of answers {a1 , ..., an }. For answering the i-th sub-problem pi , we include both the original problem p0 and all previously-seen problem-answer pairs in the context for prediction. The answer ai is given by ai = Si (pi , {p0 , p