A R EVIEW OF S PARSE E XPERT M ODELS IN D EEP L EARNING William Fedus∗ Google Brain Jeff Dean Google Research Barret Zoph∗ Google Brain arXiv:2209.01667v1 [cs.LG] 4 Sep 2022 A BSTRACT Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-ofExperts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work. 1 I NTRODUCTION Remarkable advances in machine learning – especially in natural language – have been achieved by increasing the computational budget, training data, and model size. Notable milestone language models include GPT-2 (Radford et al., 2018), BERT (Devlin et al., 2018), T5 (Raffel et al., 2019), GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022), and PaLM (Chowdhery et al., 2022). However, state-of-the-art models now require thousands of specialized, interconnected accelerators for weeks or months at a time. These models are therefore expensive to produce and incur high energy costs (Patterson et al., 2021). Therefore, as the scale of machine learning systems has increased, the field has sought more efficient training and serving paradigms. Sparse expert models have risen as a promising solution. Dense Model y1 Sparse Model y2 y y2 y1 y Add + Normalize Add + Normalize Add + Normalize Add + Normalize FFN FFN Layer FFN Sparse FFN Layer Add + Normalize FFN 1 FFN 2 FFN 3 FFN 4 FFN 1 FFN 2 FFN 3 FFN 4 Add + Normalize Add + Normalize Add + Normalize Self-Attention Self-Attention Self-Attention Self-Attention x x x1 "The" x2 x1 "Dog" "The" x2 "Dog" Figure 1: Comparing a dense and sparse expert Transformer. A dense model (left) sends both input tokens to the same feed-forward network parameters (FFN). A sparse expert model (right) routes each input token independently among its four experts (FFN1 · · · FFN4). In this diagram, each model uses a similar amount of computation, but the sparse model has more unique parameters. Note while this figure showcases a specific and common approach of sparse feed-forward network layers in a Transformer (Vaswani et al., 2017), the technique is more general. Sparse expert models, of which, Mixture-of-Experts (MoE) is the most popular variant, are neural networks where a set of the parameters are partitioned into “experts”, each with a unique weight. ∗ Equal contribution. Correspondence to {liam.fedus,barretzoph}@gmail.com. 1 During training and inference, the models route input examples to specific expert(s) weights. As a result, each example only interacts with a subset of the network parameters, contrasting the usual approach where the entire network is used for each input. Because only a fraction of the experts are used for each example, the amount of computation may remain small relative to the total model size. Many modern sparse expert models draw inspiration from Shazeer et al. (2017), which trained the largest model at the time and achieved state-of-the-art language modeling and translation results. Sparse expert models have further surged in popularity when combined with Transformer language models (Lepikhin et al., 2020; Fedus et al., 2021). And while most work has been in natural language processing, they have also been successfully used in a variety of domains including computer vision (Puigcerver et al., 2020), speech recognition (You et al., 2021) and multi-modal learning (Mustafa et al., 2022). Recent work by Clark et al. (2022) rigorously studied the scaling properties of sparse expert models across different model sizes and number of experts. Further, state-of-the-art results on many benchmarks are currently held by sparse expert models such as ST-MoE (Zoph et al., 2022). The field is evolving quickly with research and engineering advances increasing our understanding and improving empirical results. We narrow our survey to sparse expert models in the era of deep learning (heuristically 2012onward), recounting recent advances and discussing promising future avenues. For a comprehensive review of the history of Mixture-of-Experts, predating the recent deep learning advances, we refer readers to the survey, “Twenty Years of Mixture-of-Experts” (Yuksel et al., 2012). Further, sparse expert models may be regarded as a special class of adaptive computation models which are surveyed in Xu and McAuley (2022). Finally, Tay et al. (2020) surveys a broader set of methods aimed at increasing the computational efficiency of Transformers, of which, sparse expert models are one promising approach. 2 S PARSE E XPERT M ODELS The concept of MoE in machine learning dates back at least three decades to the work of Jacobs et al. (1991); Jordan and Jacobs (1994). In early concepts, the experts defined an entire neural network and the MoE was similar to ensemble methods. 2.1 I N D EEP L EARNING Eigen et al. (2013) proposed architectures that used stacked layers of Mixture-of-Experts on jittered MNIST (LeCun et al., 1998). This work used a continuous mixture of the experts’ outputs (soft selection) rather than restricting to the top subset of experts at each layer (hard selection) – limiting its practicality1 . This work, however, set the stage for later efficient implementations which relied on the idea of MoE as a component of a neural network. The first large-scale success of this approach in deep learning came from Shazeer et al. (2017). This work inserted an MoE layer between two LSTM layers (Hochreiter and Schmidhuber, 1997) where the output from the lower layer’s LSTM was sent for computation within the MoE. The resulting sparse model was state-of-the-art in machine translation, though the largest variant with 131,072 experts and 137B parameters generalized worse than smaller variants. Despite this success, however, follow-on research was relatively dormant with greater emphasis on directly studying the Transformer (Vaswani et al., 2017). This changed with the release of GShard (Lepikhin et al., 2020) and Switch Transformers (Fedus et al., 2021) – both of which replaced the feed-forward layers in Transformers with expert layers. However, while the experts-as-a-layer approach has become the dominant paradigm, more recent works revisit the concept of experts as fully independent models (Gururangan et al., 2021; Li et al., 2022). This confers a benefit of modularity and composability; Li et al. (2022) shows custom networks can be constructed by composing them of expert language models trained on specific domains. Figure 2 illustrates the original top-k routing mechanism proposed in Shazeer et al. (2017) which was foundational to many follow-on works. New advances to the routing algorithm are described in Section 4. Choosing experts based on the input usually entails a discrete selection (i.e. which expert to use), which complicates backpropagation algorithms relying on differentiability. As a solution, Shazeer et al. (2017) proposed a top-k routing function which takes as an input a token 1 The full computational cost is incurred with soft selection even if the expert was not necessary (i.e. an exceedingly small routing weight). 2 Expert Weights Expert 1 Expert 2 Expert 3 Expert 4 Expert 5 Router Weights Token Representations -0.3 -1.6 0.1 0.8 -0.1 0.2 2.3 1.7 0.5 -0.6 -1.1 -0.2 -0.4 1.3 -1.1 0.9 1.2 1.3 0.7 1.5 -1.1 -0.7 0.1 0.4 Normalized Router Scores -1.32 1.97 0.1 2.25 2.61 0.02 -2.81 -0.68 -0.41 E1 1.58 T3 0.22 E2 -0.25 T2 0.05 0.05 0.03 0.5 E3 E1 E2 0.51 T1 0.67 0.01 0.31 0.11 E4 T3 0.74 0.27 0.59 0.1 E5 T2 0.14 E3 -0.3 1.2 0.5 T1 3.13 E4 Dot Product E5 Router Scores 0.00 0.02 0.07 0.2 1.3 -0.7 Figure 2: Schematic of top-k routing. We visualize an example of the top-k token routing scheme over five experts and three input tokens. Each expert and token is color-coded and the router weights (Wr ) have a representation for each expert (color matched). To determine the routing, the router weight performs a dot product with each token embedding (x) to produce the router scores (h(x)). These scores are then normalized to sum to one (p(x)). representation x and then routes it to the top-k experts out of the set {Ei }N i=1 of N experts. The router has a trainable variable Wr which computes the logits h(x) = Wr · x which are normalized via a softmax distribution over the N experts. The gate-value for expert i is given by, eh(x)i . pi (x) = PN h(x)j j e (1) We denote the set of selected top-k expert indices as T . The output computation of the layer is the linearly weighted combination of each expert’s computation on the token by the gate value, X y= pi (x)Ei (x). (2) i∈T We note that in contrast to Eigen et al. (2013), this selection is only over the top-k experts and is thus more computationally efficient. 2.2 O N M ODERN H ARDWARE Modern sparse expert models have been co-designed with the distributed systems used to train the largest neural networks. These are a special case of sparse neural networks (Gale et al., 2019; Dettmers and Zettlemoyer, 2019; Evci et al., 2020) which are similar in that they only use a subset of parameters, but differ because they have potentially irregular sparsity patterns. And while generic sparse neural networks (with irregular sparsity patterns) reduce overall theoretical FLOPs, these are often not efficiently supported on current hardware which specialize in linear algebra operations on contiguous (regular) blocks of memory. Sparse expert models, on the other hand, activate entire blocks of parameters (i.e. entire matrices), and thus easily translate theoretical FLOPs savings to practical time savings on modern hardware (Fedus et al., 2021; Rajbhandari et al., 2022). The largest neural networks (Brown et al., 2020; Rae et al., 2021; Chowdhery et al., 2022) now far exceed the memory capacity of a single accelerator and therefore tensors (e.g. weights, activations, optimizer variables) are sharded using various parallelism strategies. Three common approaches 3 include data parallelism (model weights replicated, but data sharded), tensor model-parallelism (Shazeer et al., 2018) (data and weight tensors are split across devices), and pipeline parallelism (Harlap et al., 2018; Huang et al., 2019) (entire layers or groups of layers are split across devices). Mixture-of-Experts fit naturally with these parallelism schemes. Experts reside on different accelerators and the input data is dynamically dispatched to and fetched from them. Early architectures often employed many, small experts that would fit within an individual accelerator (Lepikhin et al., 2020), but later works designed larger experts that must be split across accelerators (Fedus et al., 2021; Du et al., 2021) and required additional optimizations for communication efficiency (Shazeer et al., 2018; Roberts et al., 2022; Rajbhandari et al., 2022). Dynamic routing on distributed systems incurs additional communication overhead beyond standard Transformer models. Dispatching the inputs to the experts is often implemented as an all2all communication primitive, where each accelerator communicates data to all other accelerators.2 The capacity factor directly impacts the communication cost by modulating the expert batch size (Lepikhin et al., 2020) to be CF · (B/E), where CF is the capacity factor, B is the total tokens per batch and E is the number of experts. Larger capacity factor values can improve quality, but at the expense of increased communication, memory and compute costs. Efficient implementations of the all2all primitive, along with changes to the routing algorithm (e.g. reduced capacity factor), alleviate the added communication costs from sparse expert algorithms. When training normal distributed Transformers it is known in advance what batch of data each accelerator will process. However, dynamic routing algorithms break this property because inputs are dynamically routed to experts, which can often lead to different number of inputs getting sent to each of the experts. Therefore, routing algorithms often encourage load balance over the accelerators to encourage good utilization. Load balance has been accomplished by auxiliary losses (Shazeer et al., 2017) as well as through treating this as a linear assignment problem (Lewis et al., 2021; Clark et al., 2022). More details on advances to load balancing are provided in Section 4. Finally, recent systems advances have further improved both the training and deployment of MoE models. Jaszczur et al. (2021) sparsify all the layers (e.g. dense and self-attention) of a Transformer model to achieve 37× inference speedups for a special-case of single-example inference (unbatched). Kossmann et al. (2022) relaxes the constraints of static expert batch sizes with the RECOMPILE library. This system dynamically recompiles and optimizes the computational resources of Mixture-of-Experts models so tensor sizes are matched to the experts’ computational demands, not statically-set arrays. Next, in addition to data-, model-, and expert-parallelism, the DeepSpeed-MoE library (Rajbhandari et al., 2022) supports ZeRO partitioning (Rajbhandari et al., 2019) (fully partition tensors and regather as needed) and ZeRO-Offload (offloading to CPU to reduce GPU memory usage). This system yielded 10× inference improvements (Rajbhandari et al., 2022) and state-of-the-art translation (Kim et al., 2021) – increasing the practicality of these models for production services. 3 S CALING P ROPERTIES OF S PARSE E XPERT M ODELS The cross-entropy loss of dense neural language models was shown to scale as a power-law (i.e. α l(x) = (c/x) for a variable x) with respect to the model parameter count, amount of data, and compute budget when not constrained by the other two factors (Kaplan et al., 2020). The power law coefficients were later corrected in Hoffmann et al. (2022), which demonstrated that computeoptimal models required a closer balance of data and parameter scaling. In contrast, early research in sparse expert models scaled heuristically – achieving strong empirical results – but without careful characterization of the scaling laws. Further, several works highlighted discrepancies between upstream (e.g. pre-training) and downstream (e.g. fine-tuning) behavior (Fedus et al., 2021; Artetxe et al., 2021), further complicating the understanding and explanation of sparse expert models. 2 Many routing algorithms (but not all) incur two all2all communication costs in the forward pass and another two in the backward pass. An example of a routing algorithm using more is BASE layers (Lewis et al., 2021), which requires four all2all in the forward pass and another four in the backward pass. 4 3.1 U PSTREAM S CALING Sparse expert models have excelled when trained on large datasets. A common paradigm in natural language processing is to perform upstream training (e.g. pre-training) which is then followed by downstream training (e.g. fine-tuning) on data distributions of specific interest. Sparse expert models have consistently yielded high gains over dense counterparts during the upstream phase. Shazeer et al. (2017) presented scaling curves with respect to model parameters and the computational budget on the 1-Billion-Word Language-Modeling Benchmark (Chelba et al., 2013), achieving significant gains over dense versions. Lepikhin et al. (2020) presented translation improvements as a function of model scale, and obtained a 13.5 BLEU score gain on their largest 600B parameter sparse model. Switch Transformers (Fedus et al., 2021) measured 4-7× speed-ups in wall-time using the same compute resources over T5 models. The work also studied the cross entropy loss scaling as a function of parameter count, but observed the gains diminished with 256+ experts. Furthering our understanding, Artetxe et al. (2021) distinguished upstream scaling behavior of MoE models on in-domain and out-of-domain data and found significantly better scaling for in-domain language modeling compared to dense models, corroborating the difficulties of transfer from Fedus et al. (2021). 1.8 × 100 2.25 × 100 Switch Validation Loss 1.75 × 100 Validation Loss Hash RL-R S-Base 2.2 × 100 2.15 × 100 1.7 × 100 1.65 × 100 2.1 × 100 2.05 × 100 1.6 × 100 2 × 100 1 2 4 8 16 32 Expert Count 64 128 256 1 2 4 8 16 32 64 128 256 512 Expert Count Figure 3: Sparse scaling plots with expert count. The cross-entropy scaling plots as a function of the number of experts are shown from Fedus et al. (2021) (left) and the three sparse variants from Clark et al. (2022), S-Base, RL-R, Hash (right). The top left-most point in both plots is an approximately compute-matched dense model. As the expert count increases, the models become increasingly sparse and yield lower validation losses. After these early empirical successes, Clark et al. (2022) conducted the first large-scale effort to mathematically characterize the scaling properties of sparse expert models. This work considered three classes of sparse models and derived a notion of effective parameter count (EPC). The EPC estimates the dense-parameter equivalent for a sparse expert models, based on the FLOPs and the number of experts. It was derived by conjecturing that sparse expert models followed a bilinear loss and it was shown empirically that the cross entropy loss scales as a power law in this variable. Figure 3 presents the cross entropy scaling of Switch Transformers on the left and the three sparse variants of Clark et al. (2022) on the right. One key property of the scaling curves was that the gain of sparse expert models decreased with scale, which when extrapolated, implied that there would be no further benefit of sparsity beyond 900B parameters of FLOPs. This result, however, was dependent on the number of tokens used for training and all models used only 130B tokens. But in light of the recent scaling results from Hoffmann et al. (2022) which recommends more tokens to train compute-optimal models (Chinchilla was a 70B parameter model trained on 1.4T tokens), future work might revisit this analysis. 3.2 D OWNSTREAM S CALING However, the reliable upstream scaling did not immediately yield consistent gains on downstream tasks. In one work highlighting the challenge of transfer, Fedus et al. (2021) observed 4× pretraining improvements with a low-compute, high-parameter encoder-decoder Transformer (1.6T 5 parameters with 2048 experts per sparse layer), but it fine-tuned poorly on reasoning-heavy tasks such as SuperGLUE (Wang et al., 2019) compared with dense models. This finding hinted at further necessary research as well as a potential needed balance between computation and parameters. However, strong empirical results soon followed in few-shot inference, fine-tuning, and other modalities. 70 Dense Sparse Accuracy (EM) 60 64B/64E 137B 8B/64E 1.7B/64E 8B 50 1.7B 40 30 20 0.1B/64E 10 0.1B 100 101 Normalized Pref. Metrics Du et al. (2021) presented the scaling of sparse GLaM models ranging from 1B-64B FLOPs using 64 experts per sparse layer. GLaM achieved state-of-the-art results, outperforming the 175B parameter GPT-3 (Brown et al., 2020) model in zero and one-shot performance, while using 49% fewer FLOPs per token at inference and 65% lower power (left plot in Figure 4). In another example of sparse models performing well on few-shot inference, the BIG-Bench (Srivastava et al., 2022) collaboration measured a 2× improvement of sparse over dense models on the 161 contributed JSON tasks (right plot in Figure 4). 102 10 8 6 BIG-G (0-shot) BIG-G (1-shot) BIG-G (2-shot) BIG-G Sparse (0) BIG-G Sparse (1) BIG-G Sparse (2) 4 2 0 2 107 108 109 Effective Parameter Count GFlops Per Token Prediction 1010 Figure 4: Sparse scaling for few-shot inference. Left: Du et al. (2021) measures the few-shot inference performance on TriviaQA, demonstrating consistent gains of sparse MoE models over dense models up to 137B parameters. Each label, such as 8B/64E, says how many parameters per input are used (8B) and how many experts (64E). Right: BigBench (Srivastava et al., 2022) studied the few-shot scaling properties on a larger set of 161 contributed JSON tasks to confirm improvements of sparse expert models over their FLOP-matched dense counterparts. Finally, Srivastava et al. (2022) studied the calibration of sparse models on the multiple choice BIG-Bench tasks. Calibration measures the degree to which the probability of a prediction matches the probability of being correct. This work measured calibration by the Expected Calibration Error (Naeini et al., 2015) which is the absolute deviation between the predicted probability and average accuracy, after binning examples by their predicted probability. While the calibration improves for both larger dense and sparse models (Figure 5), the sparse models were found to match the calibration of a dense model using 10× more FLOPs. 3.3 S CALING THE N UMBER , S IZE AND F REQUENCY OF E XPERT L AYERS Several important hyperparameters, beyond those in a dense Transformer, govern the scale of sparse expert models including, 1) the expert count, 2) the size of each expert, and 3) the frequency of expert layers. The decisions can have significant implications to the upstream and downstream scaling. Many earlier works scaled to thousands of relatively small experts per layer, which has produced excellent pre-training and translation quality (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021). However, the quality of sparse models is disproportionately reduced under domain shift (Artetxe et al., 2021) or when fine-tuning on different task distributions (Fedus et al., 2021). The state-of-the-art sparse models for few-shot inference (GLaM (Du et al., 2021)) and for fine-tuning (ST-MoE (Zoph et al., 2022)) use only up to 64 larger experts – a better balance of computation and parameters. As a result of the increased expert dimensions, these models require specific systemlevel sharding strategies over accelerators to run efficiently (Du et al., 2021; Rajbhandari et al., 2022). Next, we recap the current conventions around the frequency of expert layers. Usually, sparse models are constructed by beginning with a dense model and either inserting or substituting sparse expert 6 Calibration on Multiple-Choice Tasks (ECE) 0.45 Expected Calibration Error BIG-G (0-shot) BIG-G (1-shot) BIG-G (2-shot) BIG-G (3-shot) BIG-G Sparse (0-shot) BIG-G Sparse (1-shot) BIG-G Sparse (2-shot) BIG-G Sparse (3-shot) 0.40 0.35 0.30 0.25 107 108 109 1010 Effective Parameter Count 1011 Figure 5: Sparse model calibration. The Expected Calibration Error improves with scale for both dense and spare models. However, sparse models exhibit significantly better calibration and roughly match the calibration of a 10× larger dense model. Figure is reproduced from Srivastava et al. (2022). layers at either a fixed interval or heuristically. As an example, Rajbhandari et al. (2022) put more sparse layers near the final layers of the network. In the Transformer, the most common approach is to replace every other Feed-Forward Network (FFN) layer (Lepikhin et al., 2020; Du et al., 2021; Artetxe et al., 2021; Rajbhandari et al., 2022), that is, substitute with a frequency of 0.5. However, other frequencies have been used, including every-fourth-layer (0.25) in Zoph et al. (2022) and every-layer (e.g. 1.0) in Fedus et al. (2021). Finally, a frequency of 0.5-1.0 is recommended by Clark et al. (2022). Ultimately, the answer to the question of optimal hyperparameters depends on the application and the hardware system specifications. Prior work demonstrated strong pre-training and translation results with a high number of experts (Shazeer et al., 2017; Lepikhin et al., 2020), whereas, the best performing models under transfer have used fewer, larger experts (Du et al., 2021; Zoph et al., 2022; Mustafa et al., 2022). Further, these decisions are highly hardware-dependent. Due to the added all2all communication costs to implement routing, networks with slower interconnect speeds may find fewer expert layers is optimal on a time-basis to a certain quality. A simulation of the compute, memory, and communication properties of a distributed system would significantly aid practitioners to more quickly determine optimal settings, without costly trail-and-error launches. We note that this analysis and trade-offs are for experts-as-a-layer approach (Eigen et al., 2013). In contrast, the Branch-Train-Merge (BTM) approach to experts (Li et al., 2022) is “embarrassingly parallel“ in that each expert is a fully formed language model, trainable independently and asynchronously, without the expensive communication costs. Therefore, this approach and others following-suit have completely different scaling characteristics with the number of experts. 4 ROUTING A LGORITHMS The routing algorithm, a key feature to all sparse expert architectures, determines where to send examples. This area has been studied extensively, including counter-intuitive methods that use fixed, non-learned routing patterns (Roller et al., 2021). Typically the naive routing decision is non-differentiable because it makes a discrete decision of which experts to select. The problem of expert selection can be recast as a Bandit problem and several works have used reinforcement learning to learn the selection (Bengio et al., 2016; Rosenbaum et al., 2017; 2019; Clark et al., 2022). Shazeer et al. (2017) proposed a differentiable heuristic that side-stepped reinforcement learning challenges. Rather than routing the example to the chosen expert and proceeding, the output of the expert computation is weighted by the probability of choosing it (Equation 2). This produces a gradient to the router since the probability of choosing the expert is differentiable. In contrast to a 7 Bandit approach, where only one expert might be chosen, Shazeer et al. (2017) conjectured that it was necessary to route to the top-k experts with k > 1. The intuition was two or more experts on the same example allowed the network to compare and to optimize the relative performance. Lepikhin et al. (2020) later adapted the same routing algorithm to the Transformer architecture, yielding state-of-the-art machine translation results. However, Fedus et al. (2021) demonstrated that top-1 routing can achieve competitive results, corroborated by later work (Clark et al., 2022). 4.1 ROUTING TAXONOMY One way to understand many routing algorithms is to analyze the matrix of routing scores (i.e. the router scores from Figure 2). As a demonstrative example, we use a natural language sparse expert model. Figure 6 shows the un-normalized router scores computed for three tokens (columns) routed across five experts (rows). Each value is produced by the dot product of the token embedding and the expert embedding (from the router weights). Once the scores are computed, there are a variety of -0.68 -0.41 -0.25 1.58 2.61 0.02 -0.68 -0.41 E1 -2.81 0.51 1.58 E2 0.02 0.74 -0.25 -1.32 1.97 0.1 2.25 2.61 0.02 2.25 -2.81 -0.68 -0.41 -2.81 E3 2.61 0.14 0.51 Experts 2.25 3.13 3.13 0.14Top-K 0.74 Choose E4 0.1 T3 E5 1.97 T2 E1 -1.32 T1 E2 1.58 Tokens T3 E3 -0.25 Experts E2 0.51 T2 E4 0.74 T1 E5 0.14 Choose Top-K E1 3.13 E3 Experts T3 E4 Tokens T2 E5 Tokens T1 Globally Decide Expert -1.32 1.97 0.1 Assignment Figure 6: Three common classes of routing algorithms. We illustrate three methods with an Experts × Tokens activation matrix obtained through the process explained in Figure 2. Left: “Choose Top-k” along the Experts axis includes the standard top-k routing algorithm (Shazeer et al., 2017; Lepikhin et al., 2020). Center: “Choose Top-k” along the Tokens axis are routing algorithms such as Zhou et al. (2022). Right: “Globally Decide Expert Assignment” routing algorithms such as BASE layer (Lewis et al., 2021; Clark et al., 2022). ways to determine which experts should get which tokens. We highlight three common categories: 1) each token chooses the top-k experts, 2) each expert chooses the top-k tokens, and 3) globally determine what tokens should go to each expert (and not use a greedy approach). This taxonomy also further suggests yet to be explored routing algorithms. One example is an algorithm that benefits by looking both horizontally and vertically at the router scores, but without incurring the cost of looking globally. A token could first choose what experts it wants to go to, then based on this information, each expert could choose what tokens it wanted. Each token chooses the top-k experts. This class of routing algorithms have each token choose the top-k experts to be sent to. This is the original routing top-2 formulation proposed in Shazeer et al. (2017) and used in Lepikhin et al. (2020) that achieves state-of-the-art machine translation results. Fedus et al. (2021); Rajbhandari et al. (2022) used top-1 routing with success. Clark et al. (2022) proposed a reinforcement learning routing algorithm that used top-1 routing. However, instead of scaling the output of the expert computation by the router probability, they use REINFORCE (Williams, 1992) with the reward being the negative cross entropy of the predicted token. Figure 7 depicts the top-1, top-2 and reinforcement learning routing algorithms. Yang et al. (2021) introduced an extension of top-1 routing by using expert prototyping to split experts into different groups and then applied k top-1 routing procedures. Nie et al. (2021) begins routing as a soft gating function where all experts are trained (e.g. a dense model) and anneals down to the standard top-1 routing algorithm. This approach (DTS-Gate) improves over Switch Transformer (Fedus et al., 2021) on OpenWebText pre-training. Dua et al. (2021) proposes a similar approach of first training a dense model where each inputs goes to every expert and then adapting it to be sparse. Hazimeh et al. (2021) proposes DSelect-k, which is a smooth version of the top-k routing algorithm that improves over standard top-k routing. Rajbhandari et al. (2022) designs PR8 MoE which uses top-2 routing, but each token is sent to a shared dense layer and a single expert of its choosing (instead of two experts). Top-1 Routing Top-2 Routing y2 y1 y2 y1 Add + Normalize FFN 1 FFN 2 FFN 3 FFN 4 Add + Normalize FFN 1 FFN 2 FFN 3 FFN 4 FFN 1 p = 0.8 p = 0.65 Router FFN 2 p = 0.65 FFN 4 FFN 4 p = 0.15 x2 "The" "Dog" Expert Chooses Tokens y2 y1 Add + Normalize FFN 3 FFN 3 Router x1 "Dog" y2 FFN 2 FFN 2 p = 0.3 Hash Routing FFN 1 FFN 1 Router x2 "The" y1 FFN 4 p = 0.8 Router x1 FFN 3 Add + Normalize FFN 1 FFN 2 Hash Function FFN 3 FFN 4 FFN 1 FFN 2 FFN 3 FFN 4 Router Router Router Router Hash Function x1 x1 x2 "The" "Dog" BASE Routing y2 y1 Add + Normalize FFN 1 FFN 2 FFN 3 "Dog" Reinforcement Learning y2 y1 x2 "The" Add + Normalize FFN 4 FFN 1 FFN 2 FFN 3 FFN 4 FFN 1 FFN 2 FFN 3 FFN 4 Loss += -log( ) * R Loss += -log( ) * R Solve Linear Assignment Router x1 x1 x2 "The" "Dog" Router x2 "The" "Dog" Figure 7: Visualization of six different routing algorithms. Each diagram is of a Transformer sparse expert model with four experts (feed-forward networks) routing two tokens: “The” and “Dog”. Riquelme et al. (2021) introduces an improvement for top-k routing named Batch Prioritized Routing (BPR) for ViT image classification models. MoE models use fixed batch sizes per expert, which can cause tokens to overflow if there is not enough capacity. If a token overflows then no computation will be applied to that token at that given expert. In top-k routing, priority of which tokens to not drop at an overflowed expert is given to the tokens sent earlier in a sentence/batch. BPR instead prioritizes inputs that have higher routing scores. This is relevant for ViT models as there is no autoregressive nature to the inputs, so all inputs can see each other. In language there is typically a left-to-right ordering of the inputs, which could in theory allow the model to cheat during training. Zoph et al. (2022) found BPR routing to be helpful for MoE language models. Kim et al. (2021) proposes randomizing the prioritization of the tokens in the sequences to make sure the routing is not biased towards early tokens in the sentence. Static routing. Most routing algorithms dynamically learn the routing decisions while training, but this can be statically determined before training begins. Dynamic routing algorithms typically operate on the internal input representations within the network, so the routing decisions take into 9 account the current token and previous inputs to the model (usually through the self-attention layer). Most routing algorithms are dynamic, but a notable example of a static routing algorithm is is Hash Layers from Roller et al. (2021). This work shows random fixed routing by hashing the input token led to competitive performance with learned routing. Load balancing is achieved by choosing hash functions before training that balances batches of tokens. A depiction of Hash Layers can be found in Figure 7. Each expert chooses the top-k tokens. Instead of each token choosing what experts to be sent to, Zhou et al. (2022) flips this and has each expert choose what tokens it wants routed to it. This alleviates the need for auxiliary load balancing losses to be added during training or for linear assignment algorithms. Now each expert will always have the same amount of tokens, although some tokens might not get sent to any expert or some tokens might get sent to all of them. Empirically this algorithm performs well and has an adaptive computation interpretation where the model can implicitly apply more computation to certain tokens. Globally determine what tokens should go to each expert. BASE layers (Lewis et al., 2021) treats token routing as a linear assignment problem. It aims to route a fixed number of tokens to each expert and maximize the scores from the routing matrix. Since the tokens per processor are highly correlated as they come from the same sentences, tokens are randomly shuffled around before locally solving the linear assignment problem on each device. This shuffling introduces two additional communication primitives (all2all) in both the forward and backward pass. Clark et al. (2022) proposes their own variant of BASE layer (S-BASE) that uses an optimal transport formulation. Other routing algorithms. Some routing algorithms do not neatly fall into the above three categories. Zuo et al. (2021) introduced THOR, an algorithm which randomly selects two experts for each input during training and inference and found improvements of 2 BLEU points over standard MoE models. Gururangan et al. (2021) proposes DEMix, which explicitly has different experts for different pre-training domains (e.g. law, medical, etc.). Experts can then be selected by doing domain matching on the inputs. Fan et al. (2021) uses explicit language-specific sublayers where input tokens can be deterministically routed based on their language. This avoids needing dynamic routing algorithms. Ma et al. (2018) introduces a multi-gate routing algorithm where each task gets it own unique gating function. 4.2 L OAD BALANCING Most routing algorithms handle load balancing by adding an auxiliary loss during training to encourage equal amounts of tokens getting sent to the different experts (Shazeer et al., 2017). Some routing algorithms handle load balancing through their design: BASE Layers (Lewis et al., 2021) solves a linear assignment problem that enforces an equal number of tokens going to each expert as part of the problem statement. S-BASE from Clark et al. (2022) follows a similar protocol, but solves the assignment problem using optimal transport. Nie et al. (2021) starts by training a Mixture-of-Expert model where all tokens get sent to each expert, but over time adapts the network to do top-1 routing. This algorithm doesn’t need load balancing as the network naturally learns to the specialize the expert representations over training. 5 S PARSE E XPERT M ODELS ACROSS D OMAINS Sparse expert and MoE models were introduced and popularized in natural language processing (NLP). This domain was a natural fit for large models which benefit from the easy availability of trillions of tokens and the strong self-supervised algorithms of next word prediction and masked language modeling. However, the impact of these models is quickly spreading to other domains including computer vision, speech recognition and multi-modal applications. The spread of techniques from NLP to other domains has been accelerated because the Transformer has been rapidly adopted in other domains and modalities. Some examples are image classification (Dosovitskiy et al., 2020), object detection (Carion et al., 2020), recommendation systems (Chen et al., 2019), speech recognition (Dong et al., 2018; Nakatani, 2019; Gulati et al., 2020). 10 Across various domains, the sparse architectures and algorithms stay roughly the same, but what is routed to the experts is different. Table 1 shows the different sparse layer inputs for a variety of different domains. Domain Input Representation NLP Vision Speech Multimodal Word, subword, or sentence Image patch Spectrogram Word or image patch Table 1: Inputs to sparse models in different domains. The input is used to determine what expert to route to and is what the MoE layer will apply compute to. 5.1 NATURAL L ANGUAGE P ROCESSING Initially Shazeer et al. (2017) introduced the Mixture-of-Expert layer for LSTM language modeling and machine translation. The layers were inserted between the standard layers in the LSTM model. Follow up works are now based around Transformers, and the expert layers typically replace the dense layers. Lepikhin et al. (2020) first introduced the MoE layer into Transformers and studied it in the context of machine translation. They achieved state-of-the-art translation results across 100 different languages when scaling up to 2048 experts per expert layer. Fedus et al. (2021) later created a sparse 1.6T parameter language model that achieved state-of-the-art pre-training quality. They also studied using sparse layers to produce the q/k/v activations in the Self-Attention layers, but found this technique to be more unstable. Lee-Thorp and Ainslie (2022) introduces the Fast Sparse Mixer, which is an encoder only model that achieves 89% training and 98% inference speedups over BERT (Devlin et al., 2018). Recently, there has been a flurry of MoE research on a variety of different topics in the NLP domain. As an example, prior MoE architectures in the NLP domain acted on the word or byte-pair level. Kudugunta et al. (2021) instead had an MoE architecture route at the task or sentence level, which allows for more efficient inference and serving. This was studied in the context of machine translation where sentences would be routed based on what language they were translating into. New results have been able to push state-of-the-art on few-shot inference and fine-tuning benchmarks. Du et al. (2021) trained a MoE decoder-only language model and achieved state-of-the-art few-shot results, while requiring only 1/3 the compute needed to train GPT-3. Zoph et al. (2022) introduced ST-MoE, a sparse encoder-decoder model that achieves state-of-the-art on a large set of reasoning and generation tasks including SuperGLUE, ARC Easy/Challenge, XSum, CNN-GM, Web-QA, ANLI, and Winogrande. ST-MoE outperforms PaLM-540B (Chowdhery et al., 2022) when fine-tuning on SuperGLUE, while using roughly 20× less pre-training FLOPs and 40× less inference FLOPs. 5.2 C OMPUTER V ISION Due to the universality of Transformers (e.g. ViT (Dosovitskiy et al., 2020)), applying improvements to the MoE architecture across domains has been fruitful. Riquelme et al. (2021) created a vision MoE model by adding MoE layers into the ViT architecture. Their model, V-MoE, was applied to image classification and was able to use just half the amount of inference compute while matching the performance of prior state-of-the-art architectures. Lou et al. (2021) introduces a sparse MoE MLP model for image classification based on the MLP-Mixer architecture (Tolstikhin et al., 2021). Their MoE variant achieved better image classification performance on ImageNet and CIFAR compared to its dense counterpart. Wu et al. (2022) improved the efficiency of training MoE models through their proposed Residual Mixture-of-Expert layer. This architecture achieves 30% reduced training cost and comparable quality to standard MoE models on both segmentation and object detection. Hwang et al. (2022) implements an efficient framework and adaptive parallelism strategy for MoE layers. To benchmark 11 their system, they add MoE layers to the Swin Tranformer V2 architecture (Liu et al., 2022) for image classification and object detection. Their MoE variant achieves 1.5x-2x speedups in both training and inference over the previous MoE implementation. Aljundi et al. (2017) uses expert models in a continual learning setup where they add new experts over time and demonstrate improvements on image classification and video prediction. Caccia et al. (2021) dynamically increases the expert count over the course of training for image classification models on CIFAR and MNIST. Ramachandran and Le (2018) studies how depth and architectural diversity impacts sparse expert model performance and achieves gains on image recognition. Kirsch et al. (2018) develops an endto-end algorithm for doing conditional computation based on the input data that achieves strong image classification performance. 5.3 S PEECH R ECOGNITION SpeechMoE (You et al., 2021) uses MoE Transformer models for speech recognition and achieves strong character error rate improvements across four datasets. They use novel auxiliary losses to promote sparsity and introduce a new routing architecture. SpeechMoE2 (You et al., 2022) further improves on the SpeechMoE’s results by making a new routing algorithm that adds in new auxiliary information for making routing decisions. Kumatani et al. (2021) yields improvements for multi-lingual speech recognition by using MoE layers in two different types of Transformer speech architectures: seqeuence-to-sequence and transducers. 5.4 M ULTIMODAL AND M ULTI -TASK Mustafa et al. (2022) does multimodal learning by training a MoE model (LIMoE) that takes as input both images and text and learns using a contrastive loss similar to CLIP (Radford et al., 2021). The MoE layer can route both the image patches and the word tokens to the available experts. The model outperforms CLIP when using a comparable training strategy, and when scaled further matches state-of-the-art methods. 6 W HEN TO U SE A S PARSE V ERSUS D ENSE M ODEL A common question is if you are given a fixed compute or FLOP budget (e.g. 100 GPUs for 20 hours), what type of model should you train to achieve the best performance? Many prior works show that sparsity is better than a dense model for this type of setup (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021; Du et al., 2021; Artetxe et al., 2021; Lewis et al., 2021). Given all the strong state-of-the-art results using sparse models, why should you ever not use a sparse model over a dense model? Sparse models are a generalization of a dense model; a sparse model with a single expert is roughly a dense model. Fundamentally, sparse models allow to vastly increase the number of parameters in a model by increasing the number of experts, while keeping the FLOPs per example approximately constant. This can be good or bad depending on the setup and how the model is going to be used later. At a high level, sparsity is good when you have many accelerators (e.g. GPU/TPU) to host all the additional parameters that comes when using sparsity. Typically models are trained using dataparallelism where different machines will get different slices of the training/inference data. The machines used for operating on the different slices of data can now be used to host many more model parameters. Therefore, sparse models are good when training with data parallelism and/or have high throughput while serving: training/serving on many machines which can host all of the parameters. Using sparsity requires careful consideration for how the model will be used in downstream usage too. If there are lots of machines to pre-train a model, but a lot less for fine-tuning or serving, then the amount of sparsity (e.g. the number of experts) should be tailored to fit the amount of memory available in the downstream use cases. This is often a practical design consideration used in the literature. On a per parameter basis, sparse models will always look comparatively worse to dense models. Assuming that all parameters are kept in the accelerators memory, this is a similar requirement to 12 seeking the best model that can fit onto a certain hardware size (e.g. 4 GPUs), where again a sparse model will be a worse option than a dense one. As mentioned above, sparse models are a great fit when you have the ability to either be training or serving on many machines in parallel in order to host the additional model parameters from the experts. All hope is not lost for sparse models in memory restrictive settings though. Fedus et al. (2021) shows that sparse models work well with as few as two experts, which requires limited additional memory. New research also allows for overcoming GPU/TPU memory shortages by dynamically swapping model memory between the CPU and GPU (see Section 2.2 for more details). Other approaches for reducing the memory footprint of sparse models are discussed in Section 7.3. 7 S PARSE M ODEL T RAINING I MPROVEMENTS Sparse models often have different dynamics than dense models and benefit from different training and fine-tuning methodologies. 7.1 I NSTABILITY Sparse models have frequently been reported to be more unstable, meaning the loss diverges and increases (Lepikhin et al., 2020; Fedus et al., 2021; Zoph et al., 2022; Mustafa et al., 2022). Instabilities also appear more often at larger model scales. Lepikhin et al. (2020) encountered training instability using bfloat16 activations with a 1 trillion parameter model. Fedus et al. (2021) encountered instabilities in their highest-compute Switch-XXL model. Zoph et al. (2022) encountered instabilities in their largest models, especially in the multi-lingual setting. Mustafa et al. (2022) observed increased instability when doing multi-modal training on both images and text. Much research has been done to improve the training dynamics of sparse models. Lepikhin et al. (2020) noted that the largest model instabilities can be fixed by training the models using higher precision (float32), but comes at the cost of more memory usage and slower training. Fedus et al. (2021) recommended using a lower weight initialization scale and casting only a specific subset of the routing network to higher precision for better model stability/training. Du et al. (2021) skips batches of data that have any NaNs or Infs in the gradients and also restarts the model from an earlier checkpoint when any training divergences occur. Artetxe et al. (2021) propose a smarter initialization of the expert layer to account for the reduced batch size of the expert weights. Since 3 each expert will have a batch size of B E the authors propose scaling the gradients of the expert layer by √1E . Zoph et al. (2022) introduced the router z-loss to improve both the model instability and also quality. This auxiliary loss aims to reduce floating point roundoff errors by encouraging the logits going into the router function to remain small over the course of training. Mustafa et al. (2022) extensively studied many techniques to fix model instability, and used a combination of different approaches including the router z-loss and two novel entropy losses. 7.2 T RANSFER TO N EW D ISTRIBUTIONS Several research papers, especially at larger scales, note that MoE models transferred to new domains (as in fine-tuning) lags their dense counterparts. Fedus et al. (2021); Narang et al. (2021) compared the pre-training perplexity versus fine-tuning performance for dense and sparse models. They noticed for a given pre-training perplexity, sparse models were fine-tuning worse on reasoning tasks, but better on knowledge heavy tasks. In addition to worse out-of-domain language modeling performance, Artetxe et al. (2021) observed worse fine-tuning compared to dense models on multiple tasks including HellaSwag, PIQA and Winogrande. Several different ways have been proposed to help address the fine-tuning issues. One is to scale models by having more FLOPs compared to more sparsity (e.g. fewer experts, but make them larger). Fedus et al. (2021) trained a 1.6T parameter model with 2048 experts, but it only had as many FLOPs as a 2B dense model. Conversely, the model with the best fine-tuning performance had only 128 experts, but the amount of FLOPs as a 11B dense model. Trading off less sparsity 3 B is the tokens in the batch and E is the number of experts. This assumes top-1 routing, but similar analysis holds for the other routing variants. 13 for more FLOPs when scaling a model is a simple way to ensure better fine-tuning performance. Zoph et al. (2022) noticed that the optimal fine-tuning hyperparameters (e.g. learning rate and batch size) can be dramatically different for dense and sparse models. Using the best hyperparameters for dense models on sparse models can mask any of the sparsity pre-training improvements – therefore, an independent hyperparameter study is beneficial. 7.3 I NFERENCE By design, sparse expert models have many more parameters than their dense counterparts. While the computation done is still relatively low, the memory footprint can be a burden. Therefore, some research has focused on reducing the number of parameters needed at inference-time to ease serving requirements. Kudugunta et al. (2021) routes at the task level instead of the word or token level for machine translation. This allows for more efficient inference because the subset of weights only for the needed tasks are required. Kim et al. (2021) prunes away experts at inference to reduce the memory footprint of the model. Two different methods are used for pruning: randomly selecting a subset of the experts and choosing the experts with the highest utilization at inference time. Fedus et al. (2021) distill large sparse models into smaller dense models for language modeling and fine-tuning. Rajbhandari et al. (2022) studies distillation of sparse models for language modeling by reducing the depth in the expert layers of the network. Rajbhandari et al. (2022) implements an optimized version of MoE into the DeepSpeed framework that results in 7.3× faster inference latency than existing frameworks. 8 I NTERPRETABILITY Sparse expert models more naturally lend themselves to interpretability studies because each input is processed by an identifiable, discrete subset of the model weights (i.e. the chosen experts). Therefore, instead of the daunting task of interpreting possibly trillions of floating point numbers, one can instead read off a small discrete set of integers corresponding to which expert the input was sent. Shazeer et al. (2017) conducted preliminary studies into expert specialization for the encoder of their 2048 MoE layer on the WMT ’14 EnFr machine translation task. They identified three experts, one with a specialization of words around innovation, the second which processed the article “a”, and a third which was routed synonyms of speed. Later, more extensive analyses were conducted by Lewis et al. (2021) on Transformer-based architectures. Lewis et al. (2021) conducted a study where they tracked the most frequent prior input token when the expert was selected. This revealed specialization in quantities, numbers, possessives, subword fragments, and clusters of related verbs, nouns and adjectives, with selected results presented in Table 2. Expert Top-5 preceding tokens 5 9 34 42 62 72 74 101 year, years, billion, millions, tonnes electronic, local, public, national, outdoor to, will, should it, may two, 50, 1, 80, 000 work, started, involved, working, launched is, was, be, been, were going, go, come, back, return B, T, W, H, k Table 2: Expert specialization based on preceding context in BASE Layers. We reproduce a portion of table of Lewis et al. (2021), presenting the most frequent preceding top-five tokens for the selected experts. This example shows experts specializing in punctuation, conjunctions & articles, verbs, visual descriptions, proper names, counting & numbers. Zoph et al. (2022) trained an encoder-decoder Transformer and finds similar patterns in the encoder, including experts that specialize in a shallow way, such as over articles (e.g. “a”, “the”). Table 3 reproduces a portion of the observed specializations of Zoph et al. (2022). Those studies further 14 found expert specialization in punctuation, numbers, proper names, verbs, colors and special mask tokens used for the pre-training objective. Expert specialization Expert position Routed tokens Sentinel tokens Layer 1 been floral to ... Punctuation Layer 2 Layer 6 , , , , , , , , , - , , , , , ). ) , , , , , : . : , & , & & ? & - , , ? , , , . Conjunctions and articles Layer 3 Layer 6 The the the the the the the the the The the the a and and and and and and and or and a and . Verbs Layer 1 died falling identified fell closed left posted lost felt left said read miss place struggling falling signed died Visual descriptions color, spatial position Layer 0 her over her know dark upper dark outer center upper blue inner yellow raw mama bright bright over open your dark blue Proper names Layer 1 A Mart Gr Mart Kent Med Cor Tri Ca Mart R Mart Lorraine Colin Ken Sam Ken Gr Angel A Counting and numbers written and numerical forms Layer 1 after 37 19. 6. 27 I I Seven 25 4, 54 I two dead we Some 2012 who we few lower each Table 3: Encoder expert specialization in ST-MoE. We reproduce a table of Zoph et al. (2022) demonstrating expert specialization in punctuation, conjunctions & articles, verbs, visual descriptions, proper names, counting & numbers. But a deeper analysis of the full encoder-decoder ST-MoE architecture found clearer evidence of specialization in the encoder, rather than the decoder. This warrants further study into the value and positioning of expert layers. Lack of evident specialization may either signal a difficult to discern patterns or no useful patterns. Interpretability of sparse expert models has not only been limited to text. One example is LIMoE (Mustafa et al., 2022), a multi-modal model that was observed to learn experts that specialize in textual and visual data, including patches of textures, plants, eyes, and words (Figure 8). As in the text based models, the complexity of the specialization varies significantly. For instance, text-based experts were found to span simple objectives like processing the article “a” up to more complicated concepts like past-tense verbs. Similarly, in multi-modal models, the sophistication of expert specialization varies to concepts as simple as a basic textures up to high-level objects such as wheels or door handles. Finally, we highlight one significant limitation of these interpretability approaches. These consider the tokens or patches arriving to each expert in a narrow way. Specifically, the initial embedding of a word or of a patch incorporates contextual information from surrounding data (Transformers do this through self-attention or encoder-decoder attention). Therefore, more nuanced specialization may be missed by these heuristic techniques. More careful and thorough interpretabilty work will be needed to better understand sparse expert models. The release of spare expert model checkpoints by Artetxe et al. (2021) and Fedus et al. (2021) allows broader groups to analyze and explain these dynamics. 9 F UTURE D IRECTIONS AND C ONCLUSIONS Even though sparse expert models and Mixture-of-Expertss date back to at least the early nineties – many questions remain. We conclude our review with a conjecture of promising areas of future work, specifically highlighting the intersection with two recent developments (adaptive computation and retrieval methods) and our parting thoughts. 15 Figure 8: Visual expert specialization in LIMoE. We reproduce a figure from Mustafa et al. (2022) which finds expert specialize in patches of textures (solid and striped), natural objects (plants, hands, eyes), and man-made objects (wheels, door handles, words). Adaptive Computation. Adaptive computation is the idea that different inputs to a machine learning system may use differing amounts of computation (i.e. the amount or the type of compute is adapted on-the-fly). Sparse models build on the mirrored idea: each input uses the same amount of computation, but potentially with different parameters. However, these techniques are not mutually exclusive; some routing algorithms (Section 4) allow for adaptive computation by sending a token to a variable number of experts (Riquelme et al., 2021; Zhou et al., 2022). Still future models may benefit by combining other adaptive computation techniques – as an example, in addition to choosing which expert, a network might choose the number of layers to use, as well (Schuster et al., 2022). Heterogeneous expert layers also are a natural fit for adaptive computation. Most sparse models uses experts of the same type and size for simplicity and efficiency on modern hardware. But by allowing experts to differ in size (e.g. in depth or width), the routing decision will then result in differing amounts of computation. New software systems, such as Pathways (Dean, 2021), will help facilitate efficient implementations of these heterogeneous architectures and algorithms on modern hardware. Retrieval Methods. Retrieval mechanisms effectively expand the capacity of models by allowing them to dynamically access information beyond the current context or what is stored in the parameters (Khandelwal et al., 2019; Guu et al., 2020; Borgeaud et al., 2022). Sparse expert models and retrieval models have an overlapping goal: increase the capacity of the model to better store, retrieve, and apply knowledge. Sparse expert models do this parametrically (i.e. experts contain more learnable parameters), while retrieval based systems embed information that can be dynamically retrieved non-parametrically (i.e. nearest neighbor lookup over a corpus). Studying the trade-offs and combining both approaches is likely to prove a useful future direction. Conclusions. Sparsity reduces the training and inference costs, resulting in massive models with a better accuracy than their dense counterparts. But many open questions remain. For instance, we still poorly understand how the optimal number and size of experts depends on the task (e.g. should one use a few large experts or many small experts for translation?). As many works have pointed out, achieving strong out-of-domain generalization is less straight-forward and better explanations are needed. Further, most sparse expert models have relatively low architectural diversity where sparse layers are interspersed at regular intervals. Future models may benefit from less standardized structure and heterogeneous expert architectures. Additionally, the appropriate granularity of sparsity still must be determined: most works have focused on experts replacing components, such as feed-forward network layers, but benefits of more fully modular, independent experts were discovered (Gururangan et al., 2021; Li et al., 2022). The field is still uncovering properties of sparse expert models, including much improved calibration (Srivastava et al., 2022); others remain unknown including their dynamics under asynchronous training (Recht et al., 2011) or their memorization abilities (Carlini et al., 2020). In short, these models pose a myriad of challenging mathematical, 16 engineering, and research problems, but the solutions so far have yielded significant gains and we believe more improvements lie ahead. ACKNOWLEDGEMENTS We’d like to thank the BIG-Bench core authors, Mike Lewis, Aidan Clark, Diego de Las Casas, Nan Du, and Carlos Riquelme for permission to reproduce figures and tables here. We would also like to thank Daniel S. Park, Nan Du, Jason Wei, James Lee-Thorp, and Yanqi Zhou for feedback and comments on our drafts. 17 R EFERENCES Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3375, 2017. Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. Efficient large scale language modeling with mixtures of experts, 2021. Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models, 2016. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240. PMLR, 2022. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. Lucas Caccia, Jing Xu, Myle Ott, Marc’Aurelio Ranzato, and Ludovic Denoyer. On anytime learning at macroscale. arXiv preprint arXiv:2106.09563, 2021. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2020. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013. Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, pages 1–4, 2019. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169, 2022. Jeff Dean. Introducing pathways: A next-generation ai architecture. Google AI Blog, 2021. Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5884–5888. IEEE, 2018. 18 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. Glam: Efficient scaling of language models with mixture-of-experts, 2021. Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, and Angela Fan. Tricks for training sparse translation models. arXiv preprint arXiv:2110.08246, 2021. David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke Zettlemoyer. Demix layers: Disentangling domains for modular language modeling, 2021. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929– 3938. PMLR, 2020. Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018. Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, and Ed H. Chi. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning, 2021. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019. 19 Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. arXiv preprint arXiv:2206.03382, 2022. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907, 2021. Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019. Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. Scalable and efficient moe training for multitask multilingual models, 2021. Louis Kirsch, Julius Kunze, and David Barber. Modular networks: Learning to decompose neural computation. Advances in neural information processing systems, 31, 2018. Ferdinand Kossmann, Zhihao Jia, and Alex Aiken. Optimizing mixture of experts using dynamic recompilations. arXiv preprint arXiv:2205.01848, 2022. Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021. Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. arXiv preprint arXiv:2112.05820, 2021. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. James Lee-Thorp and Joshua Ainslie. Sparse mixers: Combining moe and mixing to build a more efficient bert. arXiv preprint arXiv:2205.12399, 2022. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020. Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. arXiv preprint arXiv:2103.16716, 2021. Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models, 2022. URL https://arxiv.org/abs/2208.03306. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022. Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, and Yang You. Sparse-mlp: A fully-mlp architecture with conditional computation. arXiv preprint arXiv:2109.02008, 2021. 20 Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939, 2018. Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learning with limoe: the language-image mixture of experts. arXiv preprint arXiv:2206.02770, 2022. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Tomohiro Nakatani. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proc. Interspeech, 2019. Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021. Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, and Bin Cui. Dense-to-sparse gate for mixture-of-experts. arXiv preprint arXiv:2112.14397, 2021. David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021. Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, André Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239, 2020. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher, 2021. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019. 21 Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. arXiv preprint arXiv:2201.05596, 2022. Prajit Ramachandran and Quoc V Le. Diversity and depth in per-example routing models. In International Conference on Learning Representations, 2018. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011. Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. arXiv preprint arXiv:2106.05974, 2021. Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022. URL https://arxiv.org/abs/2203.17189. Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models. arXiv preprint arXiv:2106.04426, 2021. Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239, 2017. Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019. Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. arXiv preprint arXiv:2207.07061, 2022. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys (CSUR), 2020. Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34:24261– 24272, 2021. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 22 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3266– 3280, 2019. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992. Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, and Lu Yuan. Residual mixture of experts. arXiv preprint arXiv:2204.09636, 2022. Canwen Xu and Julian McAuley. A survey on dynamic neural networks for natural language processing. arXiv preprint arXiv:2202.07101, 2022. An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, and Hongxia Yang. M6-t: Exploring sparse expert models and beyond, 2021. Zhao You, Shulin Feng, Dan Su, and Dong Yu. Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts. arXiv preprint arXiv:2105.03036, 2021. Zhao You, Shulin Feng, Dan Su, and Dong Yu. Speechmoe2: Mixture-of-experts model with improved routing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7217–7221. IEEE, 2022. Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Laudon. Mixture-of-experts with expert choice routing. arXiv preprint arXiv:2202.09368, 2022. Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Designing effective sparse expert models. arXiv preprint arXiv:2202.08906, 2022. Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Tuo Zhao, and Jianfeng Gao. Taming sparsely activated transformer with stochastic experts, 2021. 23