Unified Scaling Laws for Routed Language Models Aidan Clark * 1 Diego de las Casas * 1 Aurelia Guy * 1 Arthur Mensch * 1 Michela Paganini 1 Jordan Hoffmann 1 Bogdan Damoc 1 Blake Hechtman 2 Trevor Cai 1 Sebastian Borgeaud 1 George van den Driessche 1 Eliza Rutherford 1 Tom Hennigan 1 Matthew Johnson 2 Katie Millican 1 Albin Cassirer 1 Chris Jones 1 Chris Jones 1 Elena Buchatskaya 1 David Budden 1 Laurent Sifre 1 Simon Osindero 1 Oriol Vinyals 1 Jack Rae 1 Erich Elsen 1 Koray Kavukcuoglu 1 Karen Simonyan 1 Abstract The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters. 1. Introduction It is a commonly held belief that increasing the size of a neural network leads to better performance, especially when training on large and diverse real-world datasets. This vague and debated notion has become increasingly justified as large empirical studies have shown that the performance of models on many interesting classes of problems are well understood as power-laws; where a multiplicative increase * Equal contribution 1 DeepMind 2 Google Research. Correspondence to: Aidan Clark , Diego de las Cases . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). in model size leads to an additive reduction in the model’s loss (Kaplan et al., 2020; Hernandez et al., 2021; Henighan et al., 2020; Rosenfeld et al., 2019). These relationships are not well understood, but a key implication is that a sequence of small1 models can be used both to infer the performance of models many times more powerful, but also to provide global information about the scalability of an architecture. Enter Routing Networks: models with the unusual property that each input interacts with only a subset of the network’s parameters — chosen independently for each datapoint (Bengio et al., 2016; 2013; Denoyer & Gallinari, 2014). For a Routing Network, the number of parameters is nearly independent from the computational cost of processing a datapoint. This bifurcates the definition of size and prevents a scaling law in parameters alone from fully describing the model class. Specific Routing Networks have been trained successfully at large scales (Fedus et al., 2021; Du et al., 2021; Artetxe et al., 2021), but the general scaling behavior is not well understood. In this work we analyze the behavior of routed language models so that we might infer the scaling laws that describe their performance. Key contributions. We analyze three different techniques for training Routing Networks, detailed in §3: Sinkhorn-BASE, a sparse mixture-of-experts (SMOE) approach modifying BASE (Lewis et al., 2021); non-parametric HASH Layers (Roller et al., 2021); and routing via Reinforcement Learning (RL - R). With models up to 200 billion parameters, we observe the following: 1. Routing improves the performance of language models across all sizes and variants attempted (see Fig. 1). 2. Training a Routing Network with RL (§3.3), a technique used in early routing work (Bengio et al., 2013), is of comparable effectiveness to state-of-the-art techniques. 3. The performance of all Routing Networks is accurately described by scaling laws in the number of experts and in the underlying dense model size (§4) which generalize those 1 Measured as training or inference floating point operations, devices or time required, financial cost, carbon emissions, etc. Unified Scaling Laws for Routed Language Models S-BASE 15M 25M 55M 130M 370M 1.3B 2.0 1 2 4 8 16 32 128 512 Expert Count Hash Dense LM b) Curves of Constant Loss Value 512 256 128 64 32 16 8 4 2 1 Validation Loss 3.2 3.0 2.8 2.6 2.4 Expert Count Validation Loss a) Predicting Loss for Varying Expert Count RL-R 5M 25M 55M 130M 370M 1.3B 3.2 3.0 2.8 2.6 2.4 2.0 Dense Model Size c) Unified Model Scaling 100M 1B 10B Effective Parameter Count Figure 1. (a) The performance achieved by Routing Networks when varying the number of experts for a fixed dense model size is described by a bilinear function (Eq. 1), (b) whose level curves indicate how to trade model size with expert count to maintain a fixed performance, (c) and which can be manipulated to align dense and routed model performance under a shared power law. from Kaplan et al. (2020). 4. These laws can be restated in terms of parameter count and inference compute, capturing an even wider set of routing architectures under a shared fit (§4.4). 5. They further imply an Effective Parameter Count: a mapping equating the performance and scaling for both dense and routed networks (§5). The data used to derive the scaling laws is available in a GitHub repository2 . 2. Background We first review the language modelling problem and existing scaling laws before discussing the process of routing a neural network and how it is applied to language models. Language modelling. We consider the problem of autoregressively predicting natural language, a task with consistent and predictable scaling characteristics across many orders of magnitude (Henighan et al., 2020; Kaplan et al., 2020). The objective is to maximize the likelihood of a sequence of tokens P (x1 , . . . , xT ) factored auto-regressively QT as p (x1 , . . . , xT ) = i p (xi |xj 1, Ê > E for small values of E. The form (10) is selected for having nice properties clearly related to its parameters, useful in our comparative analysis in §5.1. Practically, we found that the fit is the same over a wide range of different saturation functions. Fitting. Solving Equation (1), equal to Eq. (7) with E → Ê, is complicated by its non-convexity. We find the coefficients (a, b, c, d, Estart , Emax ) as the best of repeated solutions provided by the L-BFGS-B algorithm (Byrd et al., 1995). Fig. 2 shows fitted curves from these equations; coefficients are reported in Table 3. Interpretation. Relative to using the simple bilinear law (7), fitting Eq. (1) improves prediction for the lowest and highest values of E considered. Crucially, while the deviation from a power-law (and therefore improvement in RMSLE) is relatively minor for the values of E considered, the deviation is nonetheless clear (seen best looking at the raw losses in Fig. 22). We believe it is important to model this saturation because (as argued in §5.2) the limit behavior of model performance as N increases is substantially different when bounded, with important properties that are independent of Emax . We further hypothesize that future work, able to test still larger values of E, will see a more quantitative benefit from including these terms. This can be already observed in Fig. 21 when noting that the law (7) does not over and under estimate the performance for E = {2, 4, 256, 512} as it does in Fig. 4. Level curves of Eq. (1) enumerate the {(N, E)} which are predicted to achieve fixed performance, as visualized in Fig 1(b). This demonstrates of the power of routing: a model with N = 5 million and E = 128 equals the performance of a model with N = 55 million and E = 1,which requires over ten times more compute per inference. 8 55M 370M 870M Dense Model Size 1 100 1000 TeraFLOPs 256 Param. Utilization Ratio 64 1 K=1 K=2 K=4 10 Expert Count Expert Count 256 Param. Utililzation Ratio Unified Scaling Laws for Routed Language Models R=0.25 R=0.5 R=1.0 10 64 8 1 55M 370M 870M Dense Model Size 1 100 TeraFLOPs 1000 Figure 5. Level curves for Equation (1) and Equation (2) on S - BASE for K ∈ {1, 2, 4} (left two), R ∈ {1.0, 0.5, 0.25} (right two). Scaling laws in (N, E) differ for models with different values of (K, R): indicated by non-overlapping level-curves. A change of variables to (F, P ) leads to almost-overlapping functions: allowing the same fits to be reused across changes in the routing architecture. Figure 6. Example of different parameterizations of Eq. (10) value as functions of (N, E) and of (F, B). For varying (K, R), the loss surface as a function of N and E changes: meaning a joint fit would be inaccurate. Plotted as functions of (F, B), the loss surface is almost the same, suggesting a shared fit between all three methods (see Fig. 26 and Fig. 27 for joint fits for K and R respectively). We highlight that R = 0.25 deviates slightly. Plausible explanations are discussed in §D.4. The possibility to use a shared fit indicates a singular takeaway: the architectural details K and R little affect the scaling behavior of a Routing Network. The loss of the network can thus be predicted based only on inference flops F and total number of parameters P . 5. Scaling Law Applications 4.4. Generalizing Across Architecture Variants The models trained so far use fixed choices for two key details of routing: the number of experts executed perdatapoint K and the frequency of routed layers across depth R (previously set at 1 and 0.5, respectively). For any selected value of K and R we may fit Eq. (1) to observed performance, but since these variables are independent of N and E, we do not expect the same coefficients to remain valid across values of K and R. To allow for a unified scaling law, we modify Eq. (1) to use terms in F , the TeraFLOPs required per forward pass, and in the ratio B , P F where P is the total number of parameters. Specifically, F is motivated by the approximation from Kaplan et al. (2020) that F = 2N . B, the parameter utilization ratio, is an affine function of E, close to linear when most parameters lie in the routed components of the model. Using (F, B) instead of (N, E) (and setting Emin to 12 ) results in Eq. (2). To show the advantage of this change of variables we conduct two experiments: varying K across {1, 2, 4} and R across {0.25, 0.5, 1.0}. In both cases, we vary E ∈ {8, 64, 256} and N ∈ {15M, 370M, 870M }. Next we provide three applications of these scaling laws. This analysis must be considered in its specific context: that the scaling laws were fit to a set of models which were all trained to 130 billion tokens regardless of N and E. We expect the coefficients and limits described in this section to be tightly dependent on this specific token count (but not our overall analysis, which we expect to be robust). In particular, we expect Ncutoff to increase with added data. Prior work has established the notion of an optimal relationship between N and training-token count5 (Kaplan et al., 2020; Hoffmann et al., 2022). Our analysis cannot be conducted in the optimal-token-count setting, since it is unclear how we should change the amount of data given a change in E. Future work establishing this interaction might reapply our analysis in that context (we examine this issue in more depth in App. F). 5.1. Effective Parameter Equivalence We leverage Eq. (1) to compute the size N̄ of a dense model giving the same performance as a Routing Network. Specif5 Fitting. Eq. (2) predicts the scaling behavior of models as well as Eq. (1) for a given routing architecture, as indicated in Fig. 25. The benefit of the change of variables is seen most clearly in Fig. 5, which plots contours of fixed loss Given N , one selects a token-count such that performance is maximized relative to other model sizes trained on however many tokens requires an equivalent amount of compute for that size. Unified Scaling Laws for Routed Language Models S - BASE RL - R HASH a b c d -0.082 -0.083 -0.087 -0.108 -0.126 -0.136 0.009 0.012 0.012 1.104 1.111 1.157 Maximum eff. parameter count Table 3. Solutions to Eq. (1). 1T 100B 10B 1B 100M Ncutoff Ncutoff Routing improvement 100M 1B 10B 100B Base model size Note that N̄max is continuous near Ncutoff , since for all E, 1.847 314.478 N̄ (Ncutoff , E) = Ncutoff . Moreover, the slope of N̄max (·) −a/c , 1.880 469.982 for N ⩽ Ncutoff is positive whenever Emax ⩽ Estart 10 4.175 477.741 which is true for our coefficients. In this setting N̄max (·) is a non-decreasing function of N . Therefore for any routing network where N < Ncutoff , N ⩽ N̄max (N ) ⩽ Ncutoff , meaning routing will never let you train a model more powerful than Ncutoff . Note that despite this value not depending S-BASE on Emax , its existence crucially depends on the saturating RL-R transformation: without it N̄max is unbounded. Hash Estart Emax Dense 5.3. Comparative Analysis 1T Figure 7. Maximum effective parameter count as a function of base model size extrapolated from our coefficients. Routing helps until a certain size Ncutoff , that varies strongly between methods (S - BASE being the best) Kaplan et al. (2020) use scaling laws to encapsulate and contrast the behavior of entire model classes. Here we mirror this analysis by using the scaling laws we have proposed to summarize the relative behavior of the three routing techniques considered. We make four concrete observations: • S - BASE consistently outperforms RL - R and HASH, though RL - R is very competitive at smaller N . ically, we solve for L(N̄ , 1) = L(N, E), yielding (11) • All routing techniques suffer from reducing efficacy as N increases. Amongst the three techniques, S - BASE scales best: the fitted parameter c is lowest. Here α(E) = a + c log E. Given a model with N and E, we call N̄ that model’s Effective Parameter Count (or EPC). Eq. (1) predicts that the performance of all models increases as a power law in this variable • For small N , RL - R and S - BASE scale similarly with expert count and better than HASH (as indicated by computing the effective expert slope b(N ) = b + c log N ). N̄ , (N ) α(Ê)/α(Estart )  Ê/Estart b/α(Estart ) log L(N, E) = a log N̄ (N, E) + d. (12) The result of plotting all models as a function of N̄ is shown in Fig. 1(c): a good fit across four orders of magnitude. Scaling in terms of N̄ results in a unifying power law: valid for dense and routed language models alike. 5.2. Routing Behavior for Large N EPC leads to a better grasp of the behavior of routing as N increases. Of immediate interest is Ncutoff : the value of N where N̄ (N, E) ⩽ N . For larger N , routing will not improve performance. This is easily found to obey log Ncutoff = cb . Ncutoff equals 937B, 85B and 83B for S BASE , RL - R and HASH respectively. Next we consider N̄max (N ) , maxE N̄ (N, E), i.e. the maximal effective parameter count that a routing network can reach. Eq. (11) predicts that log N̄ is an affine function of log N for any fixed E, and N̄max (N ) = N for N > Ncutoff . Therefore log N̄max is piecewise-affine in log N , as displayed in Fig. 7: b ∀ N ⩽ Ncutoff = 10− c , N̄max (N ) = N̄ (N, Emax ), ∀N ⩾ Ncutoff , N̄max (N ) = N. (13) • HASH and RL - R maintain power-law behavior for longer than S - BASE (larger Emax ). However they suffer from more interference (c); leading to worse performance for most model sizes. • HASH has large initial overhead (bigger Estart ), clearly visible as a more obvious curvature at small E. For a practitioner interested in applying routing techniques, we conclude with some recommendations: 1. Use routing when training any model with N ⩽ 1.3B. 2. S - BASE is a good default routing algorithm. RL - R will sometimes match S - BASE in performance but is less robust and scalable (§D.1). 3. Target using E ∈ {64, 128} experts. Larger values will continue to improve, but with diminishing returns. 4. Use K=1 experts. Route layers at frequency 0.5 ⩽ R ⩽ 1; lower frequency reduces performance. 5. Future routing research should focus on the terms c and Emax ; indicative of limits to arbitrary scaling. 6. New routing techniques must be validated at multiple values of N and E when comparing with prior work. Results on single sizes cannot be extrapolated. Unified Scaling Laws for Routed Language Models 6. Related Work describes the performance of routed and dense models alike. In studying the empirical aspects of scaling, this work follows Kaplan et al. (2020); which triggered much research including Henighan et al. (2020), Hernandez et al. (2021) and Ghorbani et al. (2021). The underlying theory is less understood, but there is some exploration of this space including Hutter (2021) and Bahri et al. (2021). Our analysis hints towards a number of interesting research directions. Future work might investigate whether larger models continue to obey the performance our laws have predicted, or see whether a wider set of routing architectures or tasks display similar scaling behavior. Alternatively, investigations into the relationship explained in §5.3 between optimal token count and E would be extremely informative, helping future scaling work more precisely identify the correct routed model size for a given computational budget. These studies, and ours, are mutually reliant on a large corpus of work improving the scalability of Transformers. This includes models like GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), Jurassic-1 (Lieber et al., 2021) and Gopher (Rae et al., 2021), as well as work improving the ability of these models to be efficiently parallelized across multiple devices, including Shoeybi et al. (2019), Narayanan et al. (2019), Kim et al. (2021) and Xu et al. (2021). Parallel to all this has been a long study of Routing Networks; a term introduced by Rosenbaum et al. (2018) but developed extensively in the literature as Conditional Computation (Bengio et al., 2013; 2016; Bengio, 2017; Denoyer & Gallinari, 2014) and Mixture of Experts (Jacobs et al., 1991; Collobert et al., 2003; Eigen et al., 2014). The framework is sometimes further generalized, seen as per-example architecture search in Ramachandran & Le (2018) or as a graph problem in Denoyer & Gallinari (2014). Routing was popularized for large scale training by Shazeer et al. (2017), and furthered by work including GShard (Lepikhin et al., 2020), Switch Transformer (Fedus et al., 2021) and GLaM (Du et al., 2021). In this vein, Artetxe et al. (2021) undertake a comparative analysis of dense networks and SMOE s with E = 512 that aligns with our results. Finally, the core routing architecture is still being improved. Nie et al. (2021) adapt K through training where Hazimeh et al. (2021) learn it via a differentiable loss. Ramachandran & Le (2018) increase K through depth and encourage architectural diversity across experts. Caccia et al. (2021) grows E throughout training and Rajbhandari et al. (2022) propose networks where E changes with depth. 7. Conclusion Using conditional computation to scale neural networks has long been a research goal, and methods based on Routing Networks have been increasing in popularity. Here we have introduced a scaling law that models the behavior of these networks (Eq. (1)). This scaling law predicts that, for all models considered, introducing routing into a language model improves performance. That improvement follows a power-law in the number of experts E that diminishes with model size N , and can be further generalized across routing architectures with Eq. (2). These scaling laws quantify the differences between three different routing techniques and lead to a single scalar (Eq. (11)) that simultaneously This work provides an empirical framework with which to analyze future innovations in routing. We hope the overwhelming evidence we provide towards the benefits of routing encourage it to be more rapidly adopted as a powerful tool for model improvement, whose scaling characteristics align with traditional methods of scaling (in depth and width) and which will remain beneficial up to models with base model size greater than 900 billion parameters. Acknowledgments We would like to thank Marc’Aurelio Ranzato, Nando de Freitas, Jacob Menick and Andy Brock for useful comments and feedback on early drafts of this paper. The infrastructure needed to train these models wouldn’t have been possible without the dedicated work of the JAX and XLA teams, especially Peter Hawkins, Roy Frostig and James Bradbury who all were crucial in the development of the routing software. Unified Scaling Laws for Routed Language Models References Artetxe, M., Bhosale, S., Goyal, N., Mihaylov, T., Ott, M., Shleifer, S., Lin, X. V., Du, J., Iyer, S., Pasunuru, R., Anantharaman, G., Li, X., Chen, S., Akin, H., Baines, M., Martin, L., Zhou, X., Koura, P. S., O’Horo, B., Wang, J., Zettlemoyer, L., Diab, M., Kozareva, Z., and Stoyanov, V. Efficient Large Scale Language Modeling with Mixtures of Experts. arXiv:2112.10684, 2021. Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. Explaining neural scaling laws. arXiv:2102.06701, 2021. Bengio, E. On Reinforcement Learning for Deep Neural Architectures: Conditional Computation with Stochastic Computation Policies. McGill University (Canada), 2017. Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. Conditional computation in neural networks for faster models. In International Conference on Learning Representations, 2016. Bengio, Y., Léonard, N., and Courville, A. C. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-Milne, S. Jax: composable transformations of Python + NumPy programs, 2018. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. arXiv:2005.14165, 2020. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190–1208, 1995. Caccia, L., Xu, J., Ott, M., Ranzato, M., and Denoyer, L. On anytime learning at macroscale. arXiv:2106.09563, 2021. Collobert, R., Bengio, Y., and Bengio, S. Scaling large learning problems with hard parallel mixtures. International Journal of Pattern Recognition and Artificial Intelligence, 17(03):349–365, 2003. Curation. Curation corpus base, 2020. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013. Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019. Denoyer, L. and Gallinari, P. Deep sequential neural networks. In NIPS Deep Learning Workshop, 2014. Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., Zoph, B., Fedus, L., Bosma, M., Zhou, Z., Wang, T., Wang, Y. E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q. V., Wu, Y., Chen, Z., and Cui, C. Glam: Efficient scaling of language models with mixture-of-experts. arXiv 2112.06905, 2021. Eigen, D., Ranzato, M., and Sutskever, I. Learning factored representations in a deep mixture of experts. ICLR Workshop, 2014. Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv:2101.03961, 2021. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027, 2020. Ghorbani, B., Firat, O., Freitag, M., Bapna, A., Krikun, M., Garcia, X., Chelba, C., and Cherry, C. Scaling laws for neural machine translation. arXiv:2109.07740, 2021. Hazimeh, H., Zhao, Z., Chowdhery, A., Sathiamoorthy, M., Chen, Y., Mazumder, R., Hong, L., and Chi, E. H. Dselectk: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 2021. Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv:2010.14701, 2020. Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. Scaling laws for transfer. arXiv:2102.01293, 2021. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. Holtzman, A., Buys, J., Forbes, M., and Choi, Y. The curious case of neural text degeneration. CoRR, 2019. Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline Unified Scaling Laws for Routed Language Models parallelism. Advances in Neural Information Processing Systems, 2019. Hutter, M. Learning curve theory. arXiv:2102.04074, 2021. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. Kantorovitch, L. On the Translocation of Masses. Management Science, 5(1):1–4, 1958. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv:2001.08361, 2020. Kim, Y. J., Awan, A. A., Muzio, A., Salinas, A. F. C., Lu, L., Hendy, A., Rajbhandari, S., He, Y., and Awadalla, H. H. Scalable and efficient MoE training for multitask multilingual models. CoRR, abs/2109.10465, 2021. Knopp, P. and Sinkhorn, R. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967. Kool, W., Maddison, C. J., and Mnih, A. Unbiased gradient estimation with balanced assignments for mixtures of experts. CoRR, abs/2109.11817, 2021. Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018. Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2020. Lewis, M., Bhosale, S., Dettmers, T., Goyal, N., and Zettlemoyer, L. BASE layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021. Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 2021. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018. McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training. arXiv:1812.06162, 2018. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. International Conference on Learning Representations, 2016. Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the ACM Symposium on Operating Systems Principles, 2019. Nie, X., Cao, S., Miao, X., Ma, L., Xue, J., Miao, Y., Yang, Z., Yang, Z., and Cui, B. Dense-to-sparse gate for mixture-of-experts. arXiv:2112.14397, 2021. Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016. Peyré, G. and Cuturi, M. Computational Optimal Transport. Foundations and Trends in Machine Learning, 2019. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J.-B., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., d’Autume, C. d. M., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., Casas, D. d. L., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv:2112.11446, 2021. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21: 1–67, 2020. Unified Scaling Laws for Routed Language Models Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2020. Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. arXiv 2201.05596, 2022. Ramachandran, P. and Le, Q. V. Diversity and depth in per-example routing models. In International Conference on Learning Representations, 2018. Roller, S., Sukhbaatar, S., Szlam, A., and Weston, J. Hash layers for large sparse models. arXiv:2106.04426, 2021. Rosenbaum, C., Klinger, T., and Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, 2018. Rosenbaum, C., Cases, I., Riemer, M., and Klinger, T. Routing networks and the challenges of modular and compositional computation. arXiv:1904.12774, 2019. Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, N. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations, 2019. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv:1909.08053, 2019. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 2000. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992. Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., et al. Gspmd: General and scalable parallelization for ml computation graphs. arXiv:2105.04663, 2021. Unified Scaling Laws for Routed Language Models Figure 8. Training curves color-coded by random seed for 1, 8 and 256 experts and the step-wise maximum disagreement between runs. A. Architecture Our Transformer (Vaswani et al., 2017) is based on the architecture in (Radford et al., 2019) with relative positional encodings (Dai et al., 2019). Text is tokenized via SentencePiece (Kudo & Richardson, 2018) with 32, 000 tokens and a byte-level backoff. We use Megatron-style FFW sharding (Shoeybi et al., 2019) where useful. Parameters are stored in bfloat16 but all optimizer statistics are kept in float32. As a result, the activations of the language models are calculated in bfloat16 (though we explicitly upcast to perform all operations involving a softmax, including the Attention Block and Router, in full float32 precision). This is crucial to maintain stability on larger models (Fedus et al., 2021; Rae et al., 2021). The learning rate starts at 1e-7 and decays to 2e-5 with a cosine decay rate over the entire 250, 000 steps, after an initial warmup phase ramping up to 2e-4 in the first 1500 steps. We use seven different model sizes, with names and architectures specified in the following table. The width of the hidden layer dffw is fixed at four times the width of the activations dmodel , and we use the same dimension for keys and values. Name dmodel nlayers nheads K/V size Actual # Params 15M 25M 55M 130M 370M 870M 1.3B 512 512 640 896 1536 2048 2048 6 8 10 12 12 16 24 8 8 12 16 12 16 16 32 64 64 64 128 128 128 16, 527, 360 27, 279, 360 57, 369, 600 132, 163, 584 368, 123, 904 872, 546, 304 1, 308, 819, 456 Table 4. Model definitions used throughout this work. The number of models we trained was too large to practically include multiple runs of each model with different seeds. To give an idea of the potential error introduced by random chance, we trained all three routing techniques with 3 different seeds on a 130M model for 100, 000 steps with 8 and 256 experts (along with a dense baseline). Results are shown in Fig. 8. Different seeds (which influence not only parameter initialization but Expert Parallelism – see Appendix C) lead to extremely minimal model divergence after an initial transitory period, with different seeds diverging by no more than 0.01 before 100, 000 steps. This is a close match to the 0.02 error mentioned in (Kaplan et al., 2020)6 . B. Detailed Routing Techniques Here we detail aspects of the routing techniques crucial to their implementation and provide comparisons to key alternatives. 6 Anecdotally, throughout the development of this work we used 0.02 as the cutoff to denote statistical significance. Unified Scaling Laws for Routed Language Models B.1. Balancing Losses We encourage uniform routing in both our SMoE and RL-R methods with the differentiable load balancing loss adapted from the mean square auxiliary loss in Shazeer et al. (2017) and introduced in Lepikhin et al. (2020); Fedus et al. (2021). E X ge N (14) 1 X pe (x) N (15) 1{argmax p(x), e} (16) LB = E · me · e=1 Where me is the mean gate per expert: me = x∈B And ge is the gating decision per expert: ge = X x∈B For x in batch B of size N and policy p(x) = softmax(Wp x + bp ). There are two cases where the selected experts may not be the ones used: in S - BASE after the Sinkhorn redistribution step (see §B.2.1) and when experts are skipped due to load-balancing (see §C.2). In both cases, the balancing loss is applied to the original gating decisions made by the policy. We found that the auxiliary loss is less effective if post-balancing experts were considered. B.2. SMoE with Sinkhorn redistribution (S - BASE) Our implementation of S - BASE differs from that proposed in Lewis et al. (2021) in two ways. First, we replace the auction algorithm for re-assigning expect selections with a continuous rebalancing process implemented via a Sinkhorn algorithm (Cuturi, 2013; Peyré & Cuturi, 2019). Second, we add a shuffling step, similar to Lewis et al. (2021), before computing the optimal assignment via Sinkhorn per-device (as opposed to across all devices as done in Lewis et al. (2021)). In addition, we did not use any input jitter on the activations sent to ρ as we did not see a noticeable effect. This is in line with BASE but differs from recommendations in other SMOE papers (Lepikhin et al., 2020; Fedus et al., 2021). B.2.1. S INKHORN R EDISTRIBUTION We rebalance expert selections using a Sinkhorn layer applied on top of the router logits, an idea that was explored independently in parallel by Kool et al. (2021). This is substantially more efficient on our accelerator cluster than a hard matching algorithm. We consider H ∈ RT ×d the intermediary embeddings of the networks before the application of a routed layer (folded on the batch and time axes of respective sizes b and t, with T , bt). Those are fed to the linear router, which output a logits matrix Li = Hi W + b ∈ RT ×e . Here E is the number of experts, and W ∈ Rd×E and b ∈ RE are the router parameters. From these logits, SMOE and RL - R computes expert selection probabilities Π by applying a softmax operation along the expert axis. In doing this, we compute selection probabilities for each input separately, without taking into consideration any capacity constraints on expert, forcing us to introduce load-balancing later (§C.2). We seek a proper way to integrate constraints in a mathematically grounded framework. Mathematically, Π is obtained by solving a simple problem with constraints: each input must, on average, prefer exactly one expert. This is made clear by the variational formulation of the softmax: Π ∈ RT ×E , [softmax(Li )]i∈[1,T ] = argmax ∀ i∈[T ], hΠ, Li − H(Π) (17) Π⩾0, P j∈[E] pij =1, PT PE where H is the Shannon entropy of the matrix Π, i.e. H(Π) , i=1 j=1 pij log pij , and [·] denotes horizontal stacking. This variational formulation offers a natural alternative to incorporate extra constraints. For ideal performance, each expert T should be assigned the same number of tokens on average B = E . We therefore add E additional constraints: T n o X ∀ j ∈ [E], pij = B , i=1 (18) Unified Scaling Laws for Routed Language Models which yields the doubly constrained regularized linear problem Π ∈ RT ×E , argmaxhΠ, Li − H(Π),   Π ⩾ 0, P E ∀ i ∈ [T ], j=1 pij = T1 , under the constraints PT  ∀ j ∈ [E], i=1 pij = E1 (19) that we recognize as the regularized Kantorovich problem of optimal transport (Kantorovitch, 1958; Cuturi, 2013). We solve this problem using the Sinkhorn algorithm (Knopp & Sinkhorn, 1967), that takes the logit matrix L ∈ RT ×E and returns a soft-assignment matrix Π ∈ RT ×E . The Sinkhorn algorithm solves Eq. (19) by alternated ascent in the dual (see Peyré & Cuturi (2019) for details). Starting from f0 = 0 ∈ RT and g0 = 0 ∈ RE , we set E ∀ i ∈ [T ], (ft+1 )i = − log 1 X exp(Lij − (gt )j ), E j=1 ∀ j ∈ [E], (gt+1 )j = − log 1X exp(Lij − (ft+1 )i ). T i=1 (20) T These updates converge towards an optimal couple (f, g), such that Π= 1 exp(−L + f ⊕ g) TE (21) is the solution to Eq. (19), where (f ⊕ g)ij , fi + gj for all i, j ∈ [T ] × [E]. As detailed below, we early stop the iterations (20) by measuring the primal violation of constraints in L1 norm, i.e. when E X T X j=1 i=1 T (Πt )ij − E X X 1 1 + (Πt )ij − ⩽ etol E T i=1 j=1 (22) Once the plan is computed, we greedily select, for each token, the device with highest device-selection probability, effectively applying an argmax operation on top of the Sinkhorn logits to form a transportation plan projection. Comparison to S - BASE and performance. Compared to using an exact (early-stopped) auction algorithm as Lewis et al. (2021), the complexity of the Sinkhorn algorithm is in O(N × E) versus O((N × E)3/2 ), and its update are well adapted to batch computations on TPU/GPU. In contrast, the auction algorithm must be run on CPU as it is a greedy per-coordinate algorithm; it becomes a computational bottleneck applied to models with many routed layers. Replacing the softmax output by an regularized optimal transport plan is very naturally interpreted as adding a balancing distribution constraint to the softmax operator. Using an auction algorithm on top of the softmax assignment does not have this property. Moreover, the Sinkhorn algorithm can be halted before it has fully converged with a proper tolerance parameter (22) where Lewis et al. (2021) uses a hard number of iterations. We find an error tolerance of etol = 10−2 gives consistently good performance. In practice we observe an end-to-end model overhead of 1% to 3% compared to Switch (the same routing technique without this reassignment). This computational offset is negligible compared to the per-step performance gain. Without the rebalancing step, Switch is very sensitive to balancing loss hyperparameters (as noted in Lewis et al. (2021)) whereas S - BASE maintains uniform routing decisions with improved performance and robustness while varying E and N . B.2.2. S HUFFLING T OKENS Similar to Lewis et al. (2021), we shuffle router inputs across workers by first computing a random permutation of the inputs and sending the tth row of the batch to the b tE T cth worker. We found that this shuffling stage was necessary to prevent training from becoming unstable at larger scales. Our hypothesis is that the re-assignment provides a subtle side channel through which information can be propagated backwards in time, and this can be abused by larger models resulting in the validation loss diverging during training. Adding a shuffling stage ameliorates this issue by introducing a large number of irrelevant elements to the rebalancing process, making it harder to infer behavior of future inputs. Further work is needed to confirm this theory, but the introduction of the shuffling step does eliminate this performance degradation. Unified Scaling Laws for Routed Language Models 3.2 25M Validation Loss 3.0 RLR-B RLR-G RLR-S 2.8 2.6 130M 2.4 2.2 1.3B 64 128 Expert Count 256 512 Figure 9. The RLR-B method consistently outperforms RLR-G and RLR-S across scales. We found that Nucleus Sampling gives a significant improvement over Greedy Reinforce. However, performance is slightly improved by adding a learned baseline. B.3. Routing with Reinforcement Learning (RL - R) We will first describe a naive REINFORCE (Williams, 1992) implementation of routing, then describe possible extensions and improvements which lead to the form used in the main text as RL - R. Our implementation of REINFORCE uses the balancing loss in Equation 14 and a policy gradient loss: N 1 X log πi · Ri L= N i=1 (23) Where Ri is the reward for each sequence in the batch of size N and π is the normalized expert preferences output by a linear transformation as in SMOE. The proper thing is for ρ, the selected experts, to be samples from the distribution π, but we found that this substantially degraded performance at larger scales. This phenomenon can be attributed towards unwanted interference, where exploratory steps for ρ which turn out to be unnecessary lead to bad gradient updates to the rest of the network (Rosenbaum et al., 2019). We therefore consider a greedy selection method, where router outputs are selected as ρ(x) = TopK(softmax(Wp x + bp )). While sampling (even when tuning softmax temperature) decreased the performance of the model, we would nevertheless like to regain some of its exploratory power. To ameliorate this, we can use Nucleus Sampling (Holtzman et al., 2019), which samples from the top-p set of experts E (p) . ( P (e)/p0 if e ∈ E (p) , 0 P (e) = (24) 0 otherwise. Where E (p) is the smallest set of experts such that: X P (e) ⩾ p (25) e∈E (p) This eliminates the possibility of selecting experts with very low likelihood, while still introducing some randomness. It is important to emphasize that this introduces a distributional shift to the samples, which can be corrected with off-policy correction methods such as Importance Sampling. An alternative improvement is to learn an additional baseline function for each router. This method has an additional entropy regularization loss and computes advantages Ai = Ri − bi for the learned baseline bi : L= N N N 1 X 1 X 1 X log pi · Ai − log pi · pi + vi N i=1 N i=1 N i=1 (26) Unified Scaling Laws for Routed Language Models Where we use the Huber Loss to calculate the value loss vi . ( 1 (Ri − bi )2 vi = 2 δ(|Ri − bi | − 12 δ) if |Ri − bi | ⩽ δ, otherwise. (27) We numerate three RL - R variants below: • Greedy REINFORCE (RLR-G). REINFORCE selecting the top-k experts and no additional auxiliary losses. • Nucleus-sampled REINFORCE (RLR-S). REINFORCE using nucleus sampling to eliminate less reliable expert selections and reduce noise in the policy gradient update. In this method we sample from the top-p truncated distribution. Nucleus sampling at a fixed top-p scales well with increasing the number of experts. • REINFORCE with baseline (RLR-B). Our RL method which stabilizes training with a learned baseline and a policy entropy regularization loss. We learn a baseline with a value function that has a single hidden layer of size dmodel 8 . Table 5 details the hyperparameters chosen for each RL - R variant and Fig. 9 contains validation losses across a number of models. Note that the entropy loss is negative to encourage a more concentrated policy, and the weight must be tuned jointly with the load balancing loss to keep routing balanced. This is in line with Bengio et al. (2016), who also use two loss terms to both encourage early specialization and expert diversity. Additionally, since the policy entropy loss has a similar effect to nucleus sampling, we did not see an improvement from including both regularization methods. RLR-B consistently performed the best, especially with regards to scalability in E and N . For that reason we selected it as our prime example, and refer to it as RL - R elsewhere. Table 5. Selected hyperparameters for RL - R variants. Hyperparameter Policy entropy weight Load balancing weight Policy gradient weight Nucleus top-p Value weight Value hidden layers Value loss type RLR-G 0. 1. 1e-1 - RLR-S 0. 1. 1e-1 0.9 - RLR-B -5e-4 1. 1e-2 1. 1e-2 1 Huber B.4. Hash layers (HASH) HASH is simple compared to RL - R or S - BASE , but is highly reliant on the particular choice of hashing function. Many functions rely on knowing the integer ID which the tokenizer assigns to each unique token (characters, bytes, subwords, etc.). Roller et al. (2021) describe multiple alternative functions, including pre-computing expert assignments for each token using a greedy assignment based on the frequency counts of the token on the training set. They do not observe any improvement in terms of perplexity relative to simpler random assignments of token to expert, but argue that balanced hashing has better properties for distributed training. Our implementation uses a simple modular hashing function, namely the token index modulo the number of experts. Tokens are indexed by our tokenizer in an order that is roughly ordered by their underlying frequencies in the training dataset, which means this strategy will be more balanced than an arbitrarily random assignment, while simpler to implement than fully balanced hashing. We note that poor balancing with increasing expert count is to some extent inevitable for any routing technique that defines one-to-one mappings between tokens and experts, assuming a bounded Expert Capacity (see Section C.2), as it becomes progressively harder to assign high frequency tokens into a bigger number of smaller buckets due to the tokens’ heavy-tailed distribution. This can be seen in Fig. 10. C. Distributed Routing Details Here we describe the key aspects of Routing relevant to training on large clusters. We note there are several libraries available for supporting large-scale Routing, including DeepSpeed (Kim et al., 2021; Rajbhandari et al., 2022) and GSPMD (Xu et al., 2021). Unfortunately these were incompatible with our preexisting infrastructure. Normalized token frequency Unified Scaling Laws for Routed Language Models greedy (0.8% overflow) modulo (5.0% overflow) random (7.2% overflow) 10 1 greedy (18.9% overflow) modulo (21.5% overflow) random (23.4% overflow) 10 2 greedy (0.0% overflow) modulo (0.0% overflow) random (0.0% overflow) 10 3 0 2 4 6 # Experts = 8 8 0 20 40 # Experts = 64 60 0 100 200 300 400 # Experts = 512 500 Figure 10. HASH becomes less balanced as E increases. Here we compare three hash routing strategies using the token frequency in our validation set. The lines represent the amount of tokens sent to each expert, ordered from most subscribed to least subscribed. The dotted line represents the point where tokens are likely to overflow under our bounded Expert Capacity setup (C = 2). greedy implements Balanced assignment as described in Roller et al. (2021), where the per-token frequency tables are pre-computed and tokens are assigned to the most empty expert ordered by frequency; random assigns each token to a random expert; and modulo uses the technique described in this paper. Note that (a) the token distribution is different from the one used by the tokenizer and (b) this simulation is based on marginal token frequencies, not batches of sequences. The greedy strategy does improve the workload for the mid range (E = 64), but not significantly for low (E = 8) or high (E = 512) numbers of experts. modulo provides a modest improvement over random. C.1. Expert Parallelism We briefly review parallelism techniques, building up to Expert Parallelism, a technique for efficiently distributing parameters over an accelerator cluster. For a more in-depth exposition we recommend Lewis et al. (2021), Lepikhin et al. (2020) or Rajbhandari et al. (2022). In a fully data-parallel world, every device has an identical copy of all parameters Θ and a different input batch X. Each device executes a forward and backward pass on X and (usually) does a synchronous all-reduce across all devices on the gradients to Θ. This is effective, but requires one copy of Θ for each device, wasteful when |Θ| is large. The general class of techniques known as Model Parallelism reduce this duplication by having any individual device store only a subset of the entire model parameters. This reduction in memory comes with a cost: no longer can a single device take an input and produce the model’s output; that device no longer contains all of Θ. Most techniques therefore require some additional synchronization or data exchange. Sharding Parallelism (Shoeybi et al., 2019) takes advantage of a mathematical property present both in 2-layer-MLPs and a Transformer’s attention blocks: namely, that the output can be represented as the sum of N components, where each component applies the same functional form with independent weights on the same input. Shoeybi et al. (2019) contains more details, but a simplified example can be given for a matrixP multiplication where we observe the effect of splitting a N matrix into columnwise sub-matrices: W x = [W1 , ..., WN ]x = i Wi x. The effect of applying this technique such that each device has a separate subcolumn is to prevent the duplication of the weight matrices (which consist of the vast majority of Θ). The disadvantage is that all devices must see the same input, meaning the total throughput of data on the cluster has been reduced N -fold. In addition, the sum described above is actually now a sum across devices, which introduces additional communication overhead. Expert Parallelism takes further advantage of the structure of a routed layer to similarly reduce the necessity of parameter duplication while avoiding the need to duplicate data between devices. In particular, rather than duplicating experts across all devices, each device contains only a subset of the experts which are not replicated anywhere else. Different devices still see different inputs. The key motivation is that a given input x never needs to interact with the parameters corresponding to experts which the router did not send x to. Therefore, a single input x need only be present on a single device (the one which contains the experts which the router selected for x) to produce the correct output. In order to produce an output, the router selects an expert for all inputs and an additional data-exchange is introduced which sends all inputs to the device which contains the requested experts. Each device then processes the inputs it was sent, then returns all inputs to their original devices. Crucially, a roughly uniform router distribution leads to an evenly balanced computation across devices. This allows routed layers to be stored across a cluster with no duplicated data and without a reduction in data throughput. Unified Scaling Laws for Routed Language Models The downside is that this data exchange required across devices is generally more costly than the cross-device-sum required by sharding. More details are given in Lewis et al. (2021). Previous work (Fedus et al., 2021) suggests using one expert per device. We believe this to be an implementation detail dependent on many aspects of the infrastructure in use. For us, typically using 4 or 8 local experts per device gave good performance. All of Data, Sharding and Expert parallelism can be applied simultaneously. We use all three methods at will, selecting the combination which works fastest for a given cluster structure and model size. There are still more variations of model parallelism, notably Pipeline Parallelism (Narayanan et al., 2019; Huang et al., 2019), which we do not use. C.2. Load Balancing This at-will changing of parallelism techniques is dependent on the parallelism not affecting the output of the model. This is generally true, but expert parallelism brings in one complicating factor: load balancing. In the description above, we emphasized that a roughly-uniform router (averaged over a minibatch) will send the same number of inputs to each device (we will call the expected value BSavg ). However, in the worst case all inputs on all devices might select the same expert, and therefore need to be sent to a single device. If memory is pre-allocated to accommodate this worse case, then each device must have enough free memory to potentially store the entire global batch size: prohibitive for large clusters. The most common solution is to specify a capacity factor C, and only allocate space for BSavg × C tokens. When an expert is oversubscribed tokens are dropped at random until no experts are exceeding capacity. Having C > 1 is useful during training to prevent unnecessarily large numbers of tokens from being dropped. We set C = 2 for all experiments (though during evaluation we always allow all tokens to be routed to the desired expert). This strategy works well for the Transformer architecture due to its residual connections – dropping a token means skipping that transformer block. As long as the amount of dropped tokens is kept at a reasonable bound, it does not impact learning. An optimization we support is allowing an oversubscribed expert to use the memory allocated by an undersubscribed expert on the same device. This reduces the average number of tokens which are skipped, but does so at the minor cost of introducing an interaction between tokens being skipped and the specific co-habitation of experts on devices. In practice we do not find this to have a large effect. We note that the rebalancing used in S - BASE substantially ameliorates the load balancing problem by attempting to force all experts to be assigned the same number of tokens. However because we use the approximate Sinkhorn algorithm, not a hard matching algorithm, over-subscription still happens (though at a much reduced rate) and so these steps are still taken. D. Architectural Variations Throughout this work we have focused on a narrow subset of possible Routing Net architectures, which we believe are representative of recent work on large scale Routing Nets (Roller et al., 2021; Fedus et al., 2021; Lewis et al., 2021; Shazeer et al., 2017; Artetxe et al., 2021; Lepikhin et al., 2020). However, we also experimented with many variations of these architectures, some of which we highlight now in more depth. D.1. Robustness to hyper-parameter changes We evaluated the robustness of S - BASE and RL - R to changes in hyperparameters in Fig. 11. We focus on E = 512 due to anecdotal experience that the largest performance variance occurred at this scale. RL - R is found to be highly sensitive to the hyperparameters in Table 5, especially the choice of balancing weight. In addition, changes to the policy entropy weight can lead to unbalanced routers when the balancing weight is not tuned jointly. Unlike Switch which has been shown to be sensitive to the choice of balancing loss (Roller et al., 2021), S - BASE is robust to changes in balancing weight for values of 1e − 3 to 1. S - BASE also has competitive performance without a balancing loss, but training is less stable. Additionally, Switch has higher expert oversubscription rates even when tuning the balancing weight. D.2. Varying Routing Frequencies All of our models thus far have been routed every other layer with experts which are single FFWs (Lepikhin et al., 2020; Fedus et al., 2021). However, Lewis et al. (2021); Roller et al. (2021) explored stacking FFWs in the experts and placing N L routed layers at NL+1 ... NN+1 . We consider the performance impact of alternative routing frequencies, varying the frequency Unified Scaling Laws for Routed Language Models Validation Loss 3.2 Hash RL-R S-BASE 3 2.8 2.6 2.4 0 50k 100k Steps 150k 200k 250k Validation Loss Figure 11. Hyperparameter sensitivity at 512E 55M. For RL - R, hyperparameter selection has the largest impact on model performance of the three methods. The top performing RL - R models outperform HASH and are comparable with S - BASE. However, non-optimal RL - R configurations perform worse than the other two methods. 3.2 3.0 2.8 55M 2.6 2.4 2.2 370M 870M 2.0 0.03 0.0625 0.125 0.25 0.5 Routing Frequency 1.0 3.2 3.0 2.8 3.2 3.0 2.8 2.6 2.6 2.4 2.4 2.2 2.2 2.0 0.03 0.0625 0.125 0.25 0.5 Routing Frequency 1.0 2.0 S-BASE Hash 0.03 0.0625 0.125 0.25 0.5 Routing Frequency 1.0 Figure 12. The model performance improves with increasing routing frequency for S - BASE, while HASH flattens at higher frequencies for 8E (left), 64E (middle) and 256E (right). L NL R= N L and placing routed layers at N ... N . We compare routing every layer to routing at frequencies R ∈ { 12 , 14 , L1 }. For routing a single layer we chose the second to last layer (Roller et al., 2021), but consider routing at L2 in subsection D.4. S - BASE scales well with routing frequency, but HASH degrades in performance as shown in Fig. 12. At a single routed layer, HASH has the lowest validation loss across model sizes. D.3. Varying the Routing Policy Motivated by the improved scaling results for S - BASE, we investigate whether learning a routing policy becomes more beneficial as the frequency of routers increases. Shared routing decisions. In Fig. 13, the routing decisions are made at the first routed layer and shared across layers, which keeps the number of routers constant as R increases. As HASH selects experts based on the token index at the input layer, its routing function is unchanged for this variant. S - BASE and HASH have similar losses for shared routing decisions, whereas S - BASE improves when learning to route at each expert layer. Permuting the hash function. Conversely, we tested a variant of HASH where the hash function at each router uses a static permutation of the input tokens to select the experts. This allows tokens to be routed to the same expert at some layers without having the same hash. We found that performance was unchanged for this variant, suggesting that increasing the number of possible routing paths does not necessarily impact performance for static policies. These router variants suggest that methods which can adapt to each expert layer will outperform static policies. Further Unified Scaling Laws for Routed Language Models 2.9 S-BASE Hash Shared S-BASE 64E 256E Validation Loss 2.8 2.7 2.6 2.5 2.4 0.125 0.25 0.5 1.0 Routing Frequency 3.2 3.0 2.8 2.6 2.4 55M 370M 870M 2.2 2.0 2.6 S-BASE Hash Validation Loss Validation Loss Figure 13. Shared expert selections across layers has a large effect on performance for S - BASE (in grey) at 25M. S - BASE scales similarly to HASH in the single router case. S-BASE 3x S-BASE Hash 3x Hash 2.5 2.4 2.3 1 8 Expert Count (a) 64 256 10k 50k Steps 100k (b) Figure 14. (a) S - BASE and HASH scale similarly when routing a single layer at L − 1. (b) We see similar performance for S - BASE and L HASH at 32E 1.3B when routing at 2 with three FFWs per expert. However, S - BASE performance is improved for interleaving three routed layers. work is needed in analyzing how policies can more effectively learn to route across layers. D.4. Routing a Single Layer We analyzed the scaling behavior of HASH and S - BASE when only routing a single layer. We observed that the routing gains for R = L1 deviated from higher frequencies, which also impacted R = 41 to a lesser degree. We attribute this performance regression to the suboptimal behavior of the first routed layer. In both cases the total number of routers is low, and the first layer has a larger impact on overall performance than at higher routing frequencies. For R = L1 the complexity of routing is reduced and a simpler routing method can reach competitive performance. HASH and S - BASE have similar performance across expert counts in this case, as shown in Fig. 14. We also compared routing a single layer at L2 with three FFWs per expert to three evenly spaced routed layers in Fig. 14. Similar to the results shown in (Roller et al., 2021), three evenly spaced routed layers has slightly better performance than three stacked FFWs for a 32E 1.3B model. We also found that S - BASE benefits more from interleaving the routed and dense layers, which is consistent with our routing frequency results. D.5. Varying number of experts per datapoint In this work we have focused on routing each datapoint to a single expert at all routing layers, i.e. for the case where K = 1. However, SMoE models have historically routed datapoints to more than one expert (Shazeer et al., 2017; Lepikhin et al., Validation Loss Unified Scaling Laws for Routed Language Models 3.0 2.8 2.6 2.4 2.2 1 2 4 Dense S-Base(64E) 108 Parameter Count 109 1020 Total FLOPS Figure 15. Example of scaling curves for a dense model and a S - BASE (64E) model. When looking at performance per parameter, higher values of K are always better. But lower values of K are generally more flop efficient, and achieve a better loss for a given FLOP budget. 2020; Ramachandran & Le, 2018). Increasing K incurs in extra computation on the experts, but this additional computation may be helpful for the end result, reflecting in better loss. Moreover, routing a datapoint through more experts means each expert gets to see more data for each forward pass, which may speed up training. For these reasons, it is not obvious that K = 1 is the best setup. Section 4.4 investigated this and argued both that the generalized formula Equation (2) can accommodate such cases and also that the resulting fits show no substantial difference in performance for K. However we explore this variance more in Fig. 15: plotting both scaling curves for varying values of K as well as plotting the loss in terms of F . Higher values of K invariably yield better performance per step, but they are not necessarily more flop efficient. In fact, K = 1 is always in the pareto front. We can verify that this holds for varying numbers of experts. Note that this difference in flop-efficiency is not only theoretical, and is also followed by increased communication costs when using expert parallelism. We observed in practice that reducing K by half amounted to close to 2x speedup in inference and training. E. Effects of scaling strategy on Zero-shot Transfer There is a strong relationship between the validation loss we have been discussing and the downstream performance of models and specific tasks (Kaplan et al., 2020). However, recent work has shown that this relationship is not as straightforward for large Routing Networks, and individual tasks can benefit more or less from expert scaling. For example, Artetxe et al. (2021) show a narrowing performance gap between a SMOE Routing Network with E = 512 and its dense equivalent, with more marked improvement from routing in some tasks like HellaSwag and PIQA than in in tasks like Winogrande and ReCoRD. Likewise, Fedus et al. (2021) shows that Switch benefits more from scale better in TrivaQA than in SuperGlue. A detailed analysis of the scaling properties of Routing Networks and how that transfers to downstream tasks merits dedicated work. Here we start the conversation by looking at zero-shot transfer on a set of well known downstream tasks: LAMBADA (Paperno et al., 2016), The Pile (Gao et al., 2020), Curation Corpus (Curation, 2020), WikiText-103 (Merity et al., 2016) and C4 (Raffel et al., 2020). We estimate the scaling coefficients individually for each task and routing technique. For simplicity of interpretation we ignore the bounded scaling term and focus on the bilinear fit on Eq. 7. The coefficients can be seen in Table H. We expect that scaling in both N and E will improve the downstream performance. The key question revolves around understanding changes in the relative magnitude of a, b and c as we move from task to task. Viewing Table H it is immediately clear that the individual scaling coefficients vary greatly across tasks, i.e. different tasks have different relative gains at Zero-Shot performance as we move to larger scales. This can be better shown in Fig. 16, where all coefficients are displayed in a single plot. The variation across tasks are not the same for a and b. e.g. WikiText-103 has higher values for b and lower for a when compared to the validation set. This means that even though tasks see monotonic improvement in performance by scaling through either adding more experts or increasing the base model size, some tasks benefit more and some less from which method is used. Unified Scaling Laws for Routed Language Models Size Scaling (a) Validation Set LAMBADA The Pile CC WikiText-103 C4 Hash RL-R S-Base 0.075 0.100 0.125 0.150 0.175 0.200 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) 0.175 0.150 0.125 0.100 0.075 Expert Scaling (b) Figure 16. Individual scaling coefficients for different tasks and routing techniques, compared to the coefficients estimated in the validtion set. Different techniques also scale differently depending on the task, but this also depends on the interaction term (see Fig. 17) For a more complete picture, we can account for the N and E interaction coefficient c by incorporating it into one of the scaling coefficients – by holding the other quantity fixed – which leads to a(E) and b(N ) (see Section 4.2). This can be seen in Fig. 17 for varying values of N and E. We see that S - BASE tends to dominate with lower coefficients at higher values of E and N (due to its smaller interaction term relative to scaling terms), but this varies across tasks. For example, RL - R shows better b(N ) for most values of N in LAMBADA, until it is overtaken by S - BASE at N = 410M , but S - BASE is always superior in C4. Moreover, the ordering is not consistent between HASH and RL - R across tasks, even though they often do not cross. This all means it is difficult to establish superior performance of a routing technique without looking at a variety of tasks and scales. We often want to compare Routing Networks with a dense baseline with the same performance on the validation set, and see how this changes on downstream tasks. We can use these parameters in a simplified version of the Effective Parameter Count (EPC, Equation 11), by assuming Estart = 1 and Emax = ∞, such that Ê = E. First, we note that since the coefficients vary greatly across tasks, each task will have a different EPC for the same network configuration. Moreover, the effects of scaling by varying E and N will vary across tasks. Say we have a routing net of size N with E experts and we want to increase its base model size by a factor of x while keeping the same number of experts. The effect on N̄ in this case will be a multiplication by xa(E)/a(1) = x1+c log E . Since c varies per task, the improvement achieved by increasing the base model size will also be task dependent. Say we have a routing net of size N with E experts and we want to increase its base model size by a factor of x while c keeping the same number of experts. The effect on N̄ in this case will be a multiplication by xa(E)/a(1) = x1+ a log E . Since c a varies per task, the improvement achieved by increasing the base model size will also be task dependent. For example, the EPC validation for N=110M, E=32 is 370M, but EPC lambada for the same model is 284M, while EPC pile is 535M. The key implication here is not only do the values change, but their slopes are different. This means that downstream tasks must be analyzed carefully: a practitioner could scale a model via routing expecting some overarching improvement, but get a much diminished (or enhanced!) improvement on specific downstream tasks, depending on their specific values of b and c. F. On Convergence, or Lack Thereof Here we digress on two important details, both focusing on token count. First we argue that discussing converged performance of large transformers on modern and massive text datasets is probably a misnomer; scaling analyses should focus on optimal performance at a fixed number of tokens. Second, we provide evidence arguing against a proposed equation in Kaplan et al. (2020) (Eq. (1.6)). F.1. Convergence on Large Datasets There are two cases where the converged performance of a model can be clearly defined. The first is when continued training of the model produces no improved results (even analyzed at logarithmic scale in the number of tokens), the second is when continued training leads to reduced validation performance: overfitting. Our models exhibit neither behavior. No overfitting is seen even for our largest models, likely due to the complexity and size of the dataset used. Furthermore, despite being trained for 130 billion tokens, not even our smallest models have saturated. We push this envelope even further: training two additional sets of 15M models with 1, 64, 128 and 512 experts. The first set is trained for just 75, 000 steps, and the second for 1, 000, 000 steps: four times more data (half a trillion tokens). We Unified Scaling Laws for Routed Language Models Validation Set 0.055 LAMBADA 0.060 a(E) CC 0.19 0.09 0.075 0.20 0.10 0.080 0.21 0.070 1 8 64 # Experts (E) 512 0.11 1 8 64 # Experts (E) 512 1 0.03 0.035 0.020 0.04 0.040 0.025 0.05 0.045 0.030 0.06 0.035 0.07 25M 55M 130M 370M Base Model Size (N) 1.3B 8 64 # Experts (E) 512 0.030 Base Model Size (N) 1.3B 0.055 0.085 0.060 0.090 0.065 8 64 # Experts (E) 512 0.0200 0.0225 0.0250 0.0275 0.0300 0.0325 0.0350 0.025 25M 55M 130M 370M 0.050 0.080 1 25M 55M 130M 370M Base Model Size (N) 1.3B 25M 55M 130M 370M Base Model Size (N) C4 0.045 0.075 512 0.020 0.060 1.3B 8 64 # Experts (E) 0.015 0.055 Base Model Size (N) 1 0.010 0.050 25M 55M 130M 370M WikiText-103 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.08 0.18 0.065 b(N) The Pile 0.17 1.3B 0.070 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300 Hash RL-R S-Base 1 8 64 # Experts (E) 512 Hash RL-R S-Base 25M 55M 130M 370M Base Model Size (N) 1.3B # Experts (E) 512 64 8 1 25M 55M 130M 370M 1.3B Base Model Size (N) 25M 55M 130M 370M 1.3B Base Model Size (N) 25M 55M 130M 370M 1.3B Base Model Size (N) 25M 55M 130M 370M 1.3B Base Model Size (N) 25M 55M 130M 370M 1.3B Base Model Size (N) 25M 55M 130M 370M 1.3B Base Model Size (N) 3.4 Validation Loss 3.2 3.0 2.8 2.6 2.4 10B 100B Token Count 1T b: Slope of Expert Improvement Figure 17. Estimated scaling coefficent for Zero-shot performance across different datasets. Top half: The coefficient for increasing E while keeping N fixed for varying values of N at different downstream tasks. Middle: The coefficient for increasing N varying values of E. 1-Experts 64-Experts 128-Experts 512-Experts 75000 Steps Trained 250000 Steps Trained 1000000 Steps Trained 0.020 0.022 0.024 0.026 0.028 0.030 10B 100B Token Count 1T Figure 18. On the left: validation performance over time for 15M models trained with different expert counts and over three different lengths of training. On the right, the coefficient b from fitting Eq. (4), representing scaling from E across intermediate values. highlight that this involves corresponding changes to the cosine cycle decay. We exclusively train HASH models, both due to limits in the number of extra models we were able to train and also because it has the largest value of Emax . Results from these models are plotted in Fig. 18 (left). 15M with no routing, the smallest model we train as part of this work, is still far from having saturated its performance capability. Indeed, training for 4x longer further reduces the validation loss by 0.05. This pattern continues, and is exacerbated, when increasing the expert count: the same model with E = 512 gets a 0.07 reduction in loss from 4x more tokens. It is clear then that the very smallest models considered have yet to converge. The same is certainly true for larger ones, and probably more so. If 500 billion tokens is a lower bound to the convergence point of 15M, the analysis in Kaplan et al. (2020) would predict needing trillions of tokens to converge 1.3B: much more than what was used to train some of the Unified Scaling Laws for Routed Language Models (a) (b) (c) Figure 19. a) Values of a found for dense models across training. b) RMSE for these same fits. c) Three attempts to fit Eq. (29). In black the standard fit. In orange and grey fits only using and ignoring the final 150, 000 steps respectively. largest language models yet created (Brown et al., 2020). For large, complex text datasets of the scale used to train large language models, convergence is not a proper criteria. F.2. Performance Qualified on Token Count Rather than claiming analysis at a non-observed point of convergence, we emphasize that the scaling behavior we have described in this work is valid only as a function of a particular number of steps (or tokens). At each point, we can define instantaneous values of scaling coefficients, with the values from all models taken at S steps7 . In fact, the situation is more complicated that simply conditioning our scaling coefficients on token count. We can see this by plotting b, the scaling coefficient for changes in expert-count in Fig. 18(right). An immediate observation is that the values of b are non-constant, supporting the need to qualify scaling on token count. A second, more substantial point, is that these values are not uniquely defined by token count. For a given number of tokens, the scaling behavior of three different sets of models is completely different, dependent on how far into the learning rate schedule those sets of models were. We note that this behavior is suggested by experiments in Kaplan et al. (2020) (App. D.6). Attempting to find the full set of parameters on which these scaling terms depend is beyond the scope of this work. We highlight just the importance of insuring that all variables possible are matched when comparing values to calculate scaling coefficients. F.3. Performance Modeled as L(N, S) We conclude by highlighting one implication of the fact that scaling coefficients are dependent on token count. We analyze only the dense models trained as part of this work, and calculate values of a in Equation (3) for all dense models trained as part of the primary sweep across all step counts; plotted in Fig. F.3(a) with RMSE values plotted in Fig. F.3(b). First, it is important to emphasize that the fits remain good throughout S (after an initial period of transience). Namely, though the slope is different, the validation losses for a given intermediate S follow a power law about as well as they do later in training (if anything, more so). Second, the estimated coefficients a are clearly monotonically increasing with S. (Kaplan et al., 2020) propose (Eq. 1.6) a unified prediction of the loss achieved by a model with size N training for S steps:  αN  αS Nc Sc L(N, S) = + (28) N S This comes with the subtlety that S must be defined as the number of steps when training at the averaged critical batch size, c where our models are trained with a fixed batch size. This means a proper analysis must use Smin with S = S(1+ BLB−α )−1 B 7 This sidesteps the issue of critical batch size (McCandlish et al., 2018; Kaplan et al., 2020), consideration of which requires a substantially larger sweep of models. Future work estimating the critical batch size will likely lead to better model fits. Unified Scaling Laws for Routed Language Models 3.2 RL-R Dense 3.0 Validation Loss 2.8 2.6 2.4 2.0 15M 25M 55M 130M 370M 870M 1.3B Dense Model Size 7B Figure 20. RL - R performance for 64E continues to scale well compared to dense up to 7B base model size. for constants Bc and αB . It is important to highlight however that Smin , as described in Kaplan et al. (2020), should be ∂ independent of N . This implies that ∂N (L(N, S)) is independent of S, or in log-log space: ∂ log(L(N ∗ , S)) = −αN ∂N ∗ N ∗ =10N (29) This prediction of constant scale is in concrete opposition to the increasing value seen in Fig. F.3(a). We can furthermore check that this functional form cannot be obviously fit to our learning curves, with examples show in Fig. F.3(c). There are subtle differences between training setups, and we do not want to claim our experiments wholly disprove the conjecture in (Kaplan et al., 2020). However, the results in Fig. F.3 motivate us to assume that Eq. (F.3) cannot be used to model our specific training curves. A consequence of this is that we can also no longer conclude Equation B.5 from (Kaplan et al., 2020), that: αN L(Nef f (C), C) = (1 + )L(Nef f , ∞) (30) αS With this equation, we might be able to lower-bound true converged performance (which we have not seen in our models) by inference from compute-efficient performance, which has been achieved by the majority of our models. G. Large Scale Routing Behavior, Coefficient Sensitivity, and Future Work Our analysis predicts that larger values of E will continue to improve performance, especially for small models, at a diminishing rate. §5.2 also predicts that routing will continue to help with increasing N for at least one, if not two orders of magnitude larger base model size. Practical compute limitations prevented our sweep from exploring these regimes, and there are interesting unanswered questions in the limit of these two variables. In particular, exact predictions of Ncutoff are highly dependent on the precise value of b, where error in the second decimal place shifts predicted values by orders of magnitude (not surprising, as it is the slope of a line in log-log space). We believe exploring the limit behavior of N and E, especially arriving at a more precise value of b, is crucial. Anecdotally, we can report the results of one experiment: a large RL - R model with N > 7, 000, 000, providing a rough upper bound for error in b for RL - R. In particular, we trained a model with dmodel = 4096, nlayers = 32, nheads = 32, E = 64 and K/V size of 128. There are some important eccentricities of this model which affect its match to the fits described in this work: it was trained with a batch size of 1024 for 100k steps with a policy gradient weight of 1e-1 and balancing weight of 1e-1. Other training details are consistent with Section 2.1. The performance of this model, relative to a dense model of the same size and also to a number of smaller models, is plotted in Fig. 20 evaluated at 100B tokens. The changes described above prevent the analysis in this work from accurately predicting this model’s performance, but one key feature remains: the routed 7B model substantially outperforms the Unified Scaling Laws for Routed Language Models baseline. This is of particular interest since just a 0.01 decrease in b would predict an Ncutoff at 9B, meaning we would already be close to the regime where routing would cease to work. Nevertheless, at this value routing is clearly still a major improvement, and our estimate of b is unlikely to be a substantial overshoot. While the differences between this model and those analyzed in the paper make concrete extrapolation impossible, it shows that routing techniques still maintain competitive improvements at almost an order of magnitude larger value of N than analyzed and it is unlikely the scaling coefficients measured in this work substantially overestimate the routing technique’s scalability. We encourage future work probing the limits of routing networks, both in N and E, to better understand their properties and provide more accurate predictions of their scaling coefficients. H. Extra Plots and Tables This section contains some helpful visualizations and data which are not included in the main text. Table 6. Values of b(N ) with hold-out RMSEs in parentheses. S - BASE RL - R HASH 15M 25M 130M 370M 1.3B -0.035 (0.035) -0.033 (0.016) -0.031 (0.039) -0.031 (0.019) -0.031 (0.013) -0.029 (0.029) -0.029 (0.017) -0.027 (0.013) -0.025 (0.023) -0.024 (0.014) -0.022 (0.014) -0.021 (0.016) -0.019 (0.012) -0.016 (0.009) -0.015 (0.011) S - BASE RL - R HASH a b c d 0.079 0.080 0.081 0.088 0.105 0.097 0.007 0.010 0.009 1.072 1.076 1.086 Table 7. Fits to Equation (7). 4 8 32 64 128 256 512 S - BASE RL-R HashLayers 0.077 0.073 0.070 0.066 0.064 0.058 0.060 0.075 0.073 0.067 0.063 0.060 0.056 0.053 0.077 0.075 0.069 0.066 0.063 0.059 0.056 Table 8. Values of a(E) for different values of E . Unified Scaling Laws for Routed Language Models Hash 1 2 4 8 32 128 512 Expert Count Validation Loss Figure 21. Affine fits for HASH with a shared slope in grey. 3.2 3.0 2.8 2.6 2.4 2.0 1 2 4 8 16 32 64 128256512 Expert Count 109 Total Parameters 1011 100 101 Parameter Utilization Ratio 102 Validation Loss Figure 22. The validation loss for all S - BASE models plotted as a function of expert count (left), the total number of parameters (center) and the ratio of parameters to TeraFLOPs per inference (right). 3.2 3.0 2.8 2.6 2.4 2.0 1 2 4 8 16 32 64 128256512 Expert Count 109 Total Parameters 1011 100 101 Parameter Utilization Ratio 102 Figure 23. The validation loss for all RL-R models plotted as a function of expert count (left), the total number of parameters (center) and the ratio of parameters to TeraFLOPs per inference (right). Validation Loss Unified Scaling Laws for Routed Language Models 3.2 3.0 2.8 2.6 2.4 109 1 2 4 8 16 32 64 128256512 Expert Count Total Parameters 1011 100 101 Parameter Utilization Ratio 102 Validation Loss Figure 24. The validation loss for all HashLayer models plotted as a function of expert count (left), the total number of parameters (center) and the ratio of parameters to TeraFLOPs per inference (right). S-BASE RL-R 3.2 3.2 2.8 2.8 2.8 2.8 2.4 2.4 2.4 2.4 2 1 2 4 8 16 32 64 128256512 Expert Count 2 3.2 2 1 2 4 8 16 32 64 128256512 Expert Count 1 3.2 10 Parameter Utilization Ratio 100 2 1 10 Parameter Utilization Ratio Validation Loss Figure 25. Fitting S - BASE and RL - R with Eq. (1) and Eq. (2). 3.2 3.0 2.8 3.2 3.0 2.8 2.6 2.6 2.4 2.4 2.0 1 10 Parameter Utilization Ratio 2.0 1 2 4 1 10 Parameter Utilization Ratio Figure 26. Joint fits to Equation (2) for K ∈ {1, 2, 4}. 100 Validation Loss Unified Scaling Laws for Routed Language Models 3.2 3.0 2.8 3.2 3.0 2.8 2.6 2.6 2.4 2.4 2.0 1 10 Parameter Utilization Ratio 100 0.25 0.5 1.0 2.0 1 10 Parameter Utilization Ratio Figure 27. Joint fits to Equation (2) for R ∈ {0.25, 0.5, 1.0}. a policy dataset Dense Validation Set LAMBADA The Pile CC WikiText-103 C4 Validation Set LAMBADA The Pile CC WikiText-103 C4 Validation Set LAMBADA The Pile CC WikiText-103 C4 Validation Set LAMBADA The Pile CC WikiText-103 C4 Hash S-Base RL-R -0.078 -0.203 -0.102 -0.097 -0.090 -0.066 -0.082 -0.213 -0.111 -0.101 -0.093 -0.070 -0.081 -0.211 -0.110 -0.100 -0.092 -0.068 -0.081 -0.212 -0.110 -0.100 -0.092 -0.069 b -0.102 -0.167 -0.161 -0.101 -0.086 -0.088 -0.092 -0.152 -0.117 -0.101 -0.074 -0.081 -0.107 -0.190 -0.149 -0.113 -0.091 -0.092 c d RMSE 0.009 0.015 0.014 0.010 0.007 0.008 0.008 0.012 0.008 0.010 0.005 0.007 0.010 0.016 0.012 0.011 0.008 0.009 1.063 1.952 1.239 1.133 1.172 1.009 1.102 2.049 1.325 1.177 1.208 1.045 1.086 2.020 1.309 1.154 1.194 1.031 1.090 2.030 1.320 1.156 1.195 1.033 0.014 0.039 0.020 0.041 0.015 0.014 0.022 0.051 0.023 0.045 0.027 0.021 0.025 0.048 0.028 0.050 0.025 0.024 0.022 0.051 0.030 0.045 0.023 0.022 Table 9. Scaling coefficients for different downstream tasks. 100