Published in Transactions on Machine Learning Research (11/2024) From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models Sean Welleck wellecks@cmu.edu Carnegie Mellon University Amanda Bertsch∗ abertsch@cs.cmu.edu arXiv:2406.16838v2 [cs.CL] 20 Nov 2024 Carnegie Mellon University Matthew Finlayson∗ mfinlays@usc.edu University of Southern California Hailey Schoelkopf∗ hailey@eleuther.ai EleutherAI Alex Xie alexx@cs.cmu.edu Carnegie Mellon University Graham Neubig gneubig@cmu.edu Carnegie Mellon University Ilia Kulikov kulikov@meta.com Meta FAIR Zaid Harchaoui zaid@uw.edu University of Washington *Co-second authors Reviewed on OpenReview: https://openreview.net/forum?id=eskQMcIbMS Abstract One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model’s logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems. 1 Contents 1 Introduction 3 2 Preliminaries 4 2.1 2.2 The user’s goal in generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The modeling problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Token-level generation algorithms 3.1 3.2 3.3 3.4 3.5 7 MAP decoding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling and adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Token-level sampling adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Controlled generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Meta-generation algorithms 4.1 4.2 4.3 4.4 Chained meta-generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel meta-generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Step-level search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Refinement algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . External environment information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 24 Token budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaling the token budget to improve performance . . . . . . . . . . . . . . . . . . . . . . . . . Minimizing the token budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compute optimal inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dependence on the underlying generator(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Speeding up generation 7.1 7.2 7.3 7.4 13 15 19 21 23 6 Token cost and performance analysis 6.1 6.2 6.3 6.4 6.5 7 9 9 10 12 13 5 Incorporating external information 5.1 5.2 5 6 25 26 27 27 27 27 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speeding up the generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speeding up meta-generation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Libraries and tools for fast generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29 31 32 8 Discussion: why use sophisticated generation algorithms? 32 9 Conclusion 34 2 Published in Transactions on Machine Learning Research (11/2024) Meta-generator 1. Generation algorithms • Maximization • Sampling • Controlled generation 2. Meta-generation • Programmatic patterns • External information – Multiple models – Tools – Environments 3. Efficient generation • Optimizing token cost • Speeding up generators • Speeding up meta-generators def generate_proof(llm, theorem): strategies = [ "Prove by contradiction.\n", "Prove by induction.\n", ] candidates = [ llm.generate(strategy + theorem) for strategy in strategies for sample in range(5) ] output = llm.generate( "Which of the proofs is best?\n" + "\n".join(candidates) ) return output Generator Figure 1: Generation algorithms produce output text using a language model. Meta-generation algorithms are programs that interleave calls to generation algorithms with control flow and external information, yielding text. Our survey covers generation algorithms and their goals (§3), meta-generation patterns (§4) and sources of external information (§5), and efficiency in terms of token cost (§6) and speed (§7). 1 Introduction One of the most striking findings in modern research on large language models (LLMs) is that, given a model and dataset of sufficient scale, scaling up the compute used at training time leads to better final results (Kaplan et al., 2020; Hoffmann et al., 2022). However, there is another, lesser-mentioned scaling phenomenon, where adopting more sophisticated methods or scaling compute at inference time (Jones, 2021) can result in substantially better outputs from LLMs. This survey focuses on these approaches by exploring three connected themes: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, have a rich history in natural language processing, ranging from classical greedy decoding and beam search to modern sampling algorithms such as nucleus (Holtzman et al., 2020) and η-sampling (Hewitt et al., 2022). These methods operate by sampling one token at a time or constructing a token-level search space. They assume varying levels of access to a language model’s internals, such as logits, next-token distributions, or probability scores. Recently there has been growing interest in meta-generation algorithms—algorithms that operate on partial or full sequences, and treat the LLM as a black box that is called as part of a larger generation program (Figure 1; Khattab et al. (2022); Dohan et al. (2022); Schlag et al. (2023)). For example, a meta-generation algorithm for solving a math problem might generate multiple solution paths, evaluate the solutions with a calculator, then select the most common answer. Meta-generators can increase the compute resources devoted to generation by making multiple model calls, augmenting the model with search algorithms (Yao et al., 2023; Madaan et al., 2023), or incorporating external data sources. Doing so has seen success in improving task performance (e.g., problem solving (Lewkowycz et al., 2022)) and steering the output distribution (e.g., with human preferences (Stiennon et al., 2020)), and may offer a way to overcome limitations of standard LLMs such as error accumulation (Dziri et al., 2023) and computational capacity (Merrill & Sabharwal, 2024). Moreover, meta-generation research is widely accessible, as it often only requires black-box LLM access. Finally, generation needs to be fast and cost-effective. Fast generation becomes increasingly challenging as models grow in size, while cost becomes a critical factor in meta-generation algorithms that call models many times. On the other hand, meta-generation algorithms open new kinds of shared computation that can be leveraged for improved efficiency. As a result, there is growing interest in efficient generation algorithms that speed up generation and reduce token costs by drawing on ideas from machine learning systems and 3 Published in Transactions on Machine Learning Research (11/2024) related areas. Efficient generation in turn expands the frontier of algorithms that are feasible to experiment with and develop, leading to a virtuous cycle of algorithmic development. Our survey provides a unified treatment of these three themes: token-level generation algorithms, metageneration algorithms, and techniques for making generation fast and cost-effective. We integrate ideas from traditional natural language processing, modern LLMs, and machine learning systems, and present a mathematical formalism that includes both classical generation algorithms and modern meta-generators. This unified view is particularly important as the field expands. For example, practitioners working on novel meta-generation algorithms may benefit from learning about the historical context of generation algorithms or practical efficiency constraints, while researchers interested in efficiency may benefit from learning about major algorithmic patterns. More broadly, we aim to promote further research on inference-time approaches. Comparison to existing surveys. Several prior surveys have focused on training-time methods for better text generation (Li et al., 2021; Lu et al., 2018). Wiher et al. (2022) presents a detailed analysis of a smaller set of decoding strategies, while Zarrieß et al. (2021) spotlight token-level methods, with a particular focus on considerations for encoder-decoder models. In parallel, several surveys have addressed prompting and related methods (Liu et al., 2021b; Sahoo et al., 2024), though these works do not address token-level methods. Recent surveys have also considered strategies for speeding up inference (Chitty-Venkata et al., 2023; Miao et al., 2023a; Khoshnoodi et al., 2024; Wang et al., 2024a). However, these works focus primarily on tokenlevel generation, not meta-generation; as a result, the discussion of inference-time compute-performance tradeoffs is limited. Our survey unifies and draws connections across these three areas. Finally, Xiao et al. (2023) focus on non-autoregressive generation, while our survey focuses on autoregressive generation. Roadmap. This paper provides a survey of algorithms for token-level generation, meta-generation, and efficient generation, summarized in Figure 1. First, we consider why we use generation algorithms at all. Generally, a user’s intent is to surface a high-quality output from the model, which we formalize and discuss in §2. Readers who would like to review terminology or follow the mathematical formulation of the survey in depth should start in this section. Next, we discuss token-level generation algorithms in detail in §3. Most algorithms referred to as “decoding algorithms” in the literature are covered in this section. We discuss these methods’ theoretical motivation, practical impact, commonalities, and provide a unified frame for discussion. These methods generally require some degree of access to the model’s internals. A growing set of methods operate over partial or full sequences rather than individual tokens. These metageneration algorithms have emerged from several communitites, including researchers interested in designing new decoding algorithms or prompting methods, as well as researchers interested in language model alignment and reasoning. Works from these communities often have different motivations and use different terminology. We present a unified picture in §4, classifying them according to their programmatic structure (e.g., parallel generation, search, or refinement), and discussing their motivations. In addition to wanting a high-quality output, we often care about the efficiency of generation. We consider two definitions of efficient generation. In §6 we consider the token cost of generation algorithms, which is especially relevant for studying cost-performance tradeoffs as the amount of computation allocated to generation is scaled up, and for those using API-access models that charge by the token. In §7, we discuss methods for speeding up generation primarily from a systems perspective, where access to the model weights is assumed and latency and throughput are the key considerations. In this section, we draw upon work primarily from the machine learning systems (MLSys) community. The section serves as both an introduction to this area for machine learning researchers whose work does not focus on systems, and a practical exploration of tools for speeding up generation. We include a review of libraries that implement the described techniques. We conclude the survey by discussing takeaways, broader directions, and future work in §8. 2 Preliminaries Generation algorithms are used to produce outputs from a trained language model. Language models are probabilistic models over sequences, pθ (y|x), and most generation algorithms attempt to either find highly probable sequences or sample from the model’s distribution. A natural question is why are sophisticated 4 Published in Transactions on Machine Learning Research (11/2024) Symbol Name Explanation/example pθ sθ p∗ A r v g g(·|x) q∗ f ϕ d Y V P(Y) yt y 1 approaches uniform sampling (all tokens have the same probability). Many other token-level decoding methods can be cast as sampling adapters, including methods that reweight logits with outputs from another model (Liu et al., 2021a; Li et al., 2023a), and a variety of other transformations summarized in Table 2. Many of these token-level generation algorithms assume access to 9 Published in Transactions on Machine Learning Research (11/2024) Method Purpose Adapter Extrinsic Ancestral sampling Temperature sampling [1] Greedy decoding Top-k sampling [56] Nucleus sampling [86] Typical sampling [154] Epsilon sampling [82] η sampling [82] Mirostat decoding [11] Basis-aware sampling [57] Contrastive decoding [129] DExperts [137] Inference-time adapters [146] Proxy tuning [138] y ∼ pθ y ∼ q(pθ ) y ← max pθ y ∼ q(pθ ) y ∼ q(pθ ) y ∼ q(pθ ) y ∼ q(pθ ) y ∼ q(pθ ) Target perplexity y ∼ q(pθ ) y ∼ q(pθ ) y ∼ q∗ (·|x, c) y ∼ q∗ ∝ r(y) y ∼ q∗ (·|x, c) – Rescale Argmax (temperature→ 0) Truncation (top-k) Truncation (cumulative prob.) Truncation (entropy) Truncation (probability) Truncation (prob. and entropy) Truncation (adaptive top-k) Truncation (linear program) log pθ′ − log pθ and truncation ∝ pθ · (pθ+ /pθ− )α ∝ (pθ · pθ′ )α ∝ pθ · (pθ+ /pθ− )α – – – – – – – – – LP Solver Model pθ′ Models pθ+ , pθ− Model pθ′ Models pθ+ , pθ− Table 2: Survey of token-level generation. r(y) is a scalar reward function. c is a control attribute. Extrinsic refers to a model or solver separate from the underlying language model pθ . the language model’s next-token distributions. In practice, next-token distributions are increasingly not provided by common generation APIs, both for practical reasons and for security (Finlayson et al., 2024b; Carlini et al., 2024). Instead, token-level algorithms are often implemented by the API provider, and used by setting hyperparameters (e.g., setting a temperature τ ). Adapters for statistical control. Several decoding methods use sampling adapters to control the statistical and information-theoretic properties of model outputs and align them with those of human text. These include locally typical sampling (Meister et al., 2023b), which aims to sample from the LM distribution’s typical set (MacKay, 2004); and mirostat sampling (Basu et al., 2021), which attempts to match the perplexity of the generated text to the expected perplexity under Zipf’s law (Zipf, 1999; Powers, 1998). Intriguingly, Shi et al. (2024a) evaluate Llama 2 models with a variety of adapters (temperature, top-k, top-p, η, Mirostat, and typical sampling), and find no definitive best method for the evaluated open-ended text generation tasks. Furthermore, temperature sampling usually outperformed the other adapters in input-output tasks such as code generation and translation. In general, which adapter to use remains an open question. Autoregression and lookahead adapters. Token-level algorithms generate from left-to-right, meaning that they generate each token without knowing the eventual identity of tokens to the right. Several algorithms have incorporated various heuristic scores v(y≤t ) that adjust the next-token distribution using information from potential future tokens. This includes explicitly generating several tokens ahead (e.g., Lu et al. (2022); Leviathan et al. (2022)), or learning a function vϕ (y≤t ) that predicts a property of a full sequence (e.g., its style score or correctness) (Yang & Klein, 2021). Doing so can aid in satisfying sequence-level criteria. Distribution adjustment with another language model. Some algorithms adjust the next-token distribution using another language model. This can arise from several motivations, including removing abnormalities in the model’s next-token distributions (Li et al., 2023a), speeding up generation (Leviathan et al., 2022), or shifting the generation distribution to one with a property (e.g., a style) (Liu et al., 2021a). 3.4 Controlled generation Many scenarios can be framed as aiming to sample from a language model’s distribution modulated by a sequence-level criterion c(y) (Korbak et al., 2022a;c; Hu et al., 2024; Zhao et al., 2024a): q∗ ∝ pθ (y|x)c(y). 10 (15) Published in Transactions on Machine Learning Research (11/2024) For example, c(y) may assign high values to sequences with a particular style, or low values to sequences with toxic content or buggy code. Another way of phrasing (15) is sampling from a particular energy-based model (LeCun et al., 2006; Khalifa et al., 2021). We discuss three examples based on the structure of c(y). Classifier. In some cases c(y) is a classifier p(a|x, y), which predicts the probability that y contains an “attribute” a, such as a style or non-toxicity. The goal is then to sample from: q∗ ∝ pθ (y|x)p(a|x, y)β , (16) where β is a hyperparameter assigning more weight to the classifier at higher values of β. Various generation algorithms have been developed for this purpose, such as approximations based on reweighting next-token distributions with other language models (Liu et al., 2021a), reweighting with a learned classifier that approximates the sequence-level classification pϕ (a|y