Outrageously Fast LLMs: Faster Inference and Fine-Tuning with Moefication and LoRA Stanford CS224N Custom Project Mentor: Tony Wang Chi Tsai∗ Department of Computer Science Stanford University chiyotsai@stanford.edu Jay Martin† Department of Computer Science Stanford University jaydm@stanford.edu Abstract In recent years, transformer large language models (LLMs) have achieved wide success in a variety of downstream tasks, in large part by dramatically scaling up parameter counts. This, however, greatly increases the computational burden of training and running models. In this project, we propose a method that balances performance, inference complexity, and training complexity through a two-stage process. First, we reduce the inference complexity by converting the feed-forward networks in a pretrained LLM into a Mixture of Experts (MoE) through a process called MoEfication. This leads to reductions in performance on downstream tasks, however, so we utilize Low-Rank Adapatation (LoRA) matrices to efficiently finetune the MoEified model. Our results demonstrate that we can recover nearly all of the performance loss, resulting in a pipeline for reducing inference time with minimal loss in model quality. 1 Introduction The remarkable recent successes of language models come in no small part from exploiting scale in both training data and model size (Han et al., 2021). Model performance on downstream tasks is highly correlated with the number of model parameters and size of the dataset used for training (Kaplan et al., 2020; Aghajanyan et al., 2023), and as a result the principal driver of recent accomplishments in language modeling has been the continuous release of ever-larger models, and the path forward seen by a number of noted researchers is not merely large models but outrageously large models (Shazeer et al., 2017). This trend, however, greatly increases the computational resources needed to train and run language models. Consequently, a great deal of effort has gone into making transformer models more efficient, utilizing techniques like model pruning and attention approximation (Tay et al., 2022; Pope et al., 2023). In particular, one approach that has seen a recent resurgence since it was first proposed in the 1990s is Mixture of Experts (MoE) (Jacobs et al., 1991), in which a model consists of distinct “experts”, only a subset of which are activated on a given input. This enables training very large models—and benefiting from the enhanced performance brought by scale—while reducing the computational burden of running such large models by only activating parameters that are most relevant to the input (Masoudnia and Ebrahimpour, 2014). But a problem still remains: training large MoE models can still be prohibitively expensive and resource intensive. This is particularly the case for those with limited GPUs available, putting state-of-the-art models out of reach for many researchers. ∗ † Responsible for implementing the training and inferencing frameworks, hyperparameter sweeping. Responsible for project scoping, initial implementation of LoRA, benchmarking. Stanford CS224N Natural Language Processing with Deep Learning To address this, we develop a pipeline for converting outrageously large language models into outrageously fast ones with minimal reduction in performance. We start from the view that the feed-forward networks (FFNs) of transformers act as stores of knowledge (Dai et al., 2021). In particular, the FFNs can be viewed as mixtures-of-experts, i.e. particular clusters of neurons hold information on particular topics or knowledge categories and each cluster is largely independent of the others (Suau et al., 2020). This enables a process of MoEfication, proposed by Zhang et al. (2022), that splits the FFNs of a model into a mixture-of-experts. This process is imperfect, however, and there is a reduction in performance on downstream tasks as a result of the process. We combat this by utilizing low-rank adaption (LoRA) matrices (Hu et al., 2021) to fine-tune the MoEfied modes. We show that we can recover nearly all of the performance loss in a parameter-efficient way. 2 Related Work 2.1 Model Acceleration Model acceleration is a lively area of research that seeks to reduce the time and/or space complexity of models (Choudhary et al., 2020). Important approaches include knowledge distillation, where knowledge is transferred from a larger, more complex model (the “teacher”) to a smaller, simpler model (the “student”) (Sanh et al., 2019; Jiao et al., 2020); model pruning, a compression technique where parameters are systematically removed from a model utilizing strategies that minimize loss in capability (Zhu and Gupta, 2017; Voita et al., 2019; Lin et al., 2020; Zhang et al., 2021); attention approximation, where attention is approximated to reduce computational overhead from attention computations (Wang et al., 2020; Kitaev et al., 2020; Ham et al., 2020; Choromanski et al., 2020); model quantization, where floating-point precision is reduced by mapping continuous floating-point values to a finite set of discrete values (Rokh et al., 2023; Polino et al., 2018; Bai et al., 2021); and dynamic inference, where the computational graph of a model adjusts dynamically based on the input at inference time (Xia et al., 2022; Xin et al., 2020; Hou et al., 2020). 2.2 Mixture of Experts One approach to model acceleration that this project is based around is Mixture of Experts (MoE), part of an avenue of research known as “conditional computation” that seeks to take advantage of the greater capabilities of very large models while reducing training and inference speed by only activating a subset of a model’s parameters on any given input (Bengio, 2013; Bengio et al., 2015). MoE is an approach that precedes modern deep learning, having first been proposed in 1991 by Jacobs et al. (1991) to speed up supervised training and inference. It has since become an active area of research in deep learning and has shown success at reducing computational resources for very large models (Masoudnia and Ebrahimpour, 2014). Notable recent MoE models include the Switch-Transformer (Fedus et al., 2021), BASELayer (Lewis et al., 2021), Hash-Layer (Roller et al., 2021) and GLaM (Du et al., 2022). There is ongoing research into architectural enhancements of MoE models, like improved expert routing (Zhou et al., 2015), differentiable gating (Hazimeh et al., 2021), improved token-expert allocation (Lewis et al., 2021), and hardware optimization (He et al., 2021). 2.3 Efficient Fine-Tuning A related area of research that has become particularly active in this era of scale is efficient fine-tuning (Chen et al., 2023). A number of approaches have been proposed, e.g. adapter tuning (Houlsby et al., 2019), prefix and prompt tuning (Li and Liang, 2021; Lester et al., 2021), and BitFit (Zaken et al., 2021). One approach that has taken on particular significance and which is utilized by this project is low-rank adaptation (LoRA), which reduces fine-tuning to training low-rank matrices (Hu et al., 2021). The introduction of LoRA has spawned its own subfield of research, with new variants of LoRA proposed often, e.g. MoLE (Wu et al., 2024). 2.4 Interpretation of Transformer Feed-Forward Networks As a result of the success of transformer models, substantial research has gone towards developing a theoretical understanding of their operation (Wallace et al., 2019; Kovaleva et al., 2019; Wang and Tu, 2 2020) and their linguistic understanding (Ramnath et al., 2020; Manning et al., 2020). Particular focus has been placed on understanding the attention mechanism (Voita et al., 2019; Vig and Belinkov, 2019; Clark et al., 2019), but recent work has also examined their feed-forward layers (Geva et al., 2022). A general understanding has developed that the FFNs act as stores of memory (Dai et al., 2021). Geva et al. (2021), for example, conceptualize the FFNs as “key-value memories”. Of particular relevance to this project is the view that these stores of knowledge can be viewed as distinct experts—that particular clusters of neurons hold information on particular topics or knowledge categories (Suau et al., 2020). Evidencing this, Zhang et al. (2022) find that only small portions of a transformer’s FFNs are activated on any given input and that these function as coherent units largely independent of other parameters within the FNNs. This understanding has enabled a variety of interesting research paths, like direct editing of factual knowledge through clever manipulation of the FFN parameters (Cao et al., 2021; Meng et al., 2022). 2.5 MoEfication This project makes particular use of an approach based on this understanding of FFNs called MoEfication (Zhang et al., 2022). With MoEfication, the authors show that the FFNs of a pre-trained model can be partitioned into distinct, sparsely activated experts with only modest reductions in performance on downstream tasks. This enables researchers to take existing state-of-the-art models that have already been trained and increase their computational efficiency by reducing the number of FLOPS required at inference. 3 Approach Our project is divided into two phases. First, we use MoEfication to convert the FFNs of a pretrained model into a mixture-of-experts, which increases inference speed but results in modest loss in performance on downstream tasks. Second, we utilize LoRA fine-tuning to recover model quality with limited computational overhead. MoEfication We follow the approach from Zhang et al. (2022) to MoEify a baseline model. Additionally, we utilize He et al. (2021)’s FastMoE PyTorch framework to increase MoE efficiency. In a transformer model, the output from the attention layer is commonly followed by a DenseActDense layer that processes the attention output as y = Wo σ(Wi x), where Wo ∈ Rd×h , Wi ∈ Rh×d , and σ is an elementwise non-linear activation. In our formulation, we reorder and partition Wo into k parts: h i h i⊤ W̃o = Wo(1) . . . Wo(k) and W̃i = Wi(1) . . . Wi(k) Pk (j) (j) so that y = Wo σ(Wi x) = j=1 Wo σ(Wi (x)). (j) (j) We refer to each pair of Wi , Wo each of the same dimension. as an expert. We use k-means clustering to create the k matrices, Next, the experts need to be sparsely activated to reduce inference complexity, so we create a routing function g : Rd → Rk to select the top k experts. Our g is a 2-layer fully-connected network that takes x as input, and tries to predict 1⊤ σ(Wij x) for each j. g is trained offline by collecting the values of the outputs from the attention layer on a training set. Finally, our MoEfied model sums up the top-k experts, which are selected from the output of g(x). Mathematically, the MoEfied output is ỹ =  Pk 1 if |{xl : xl > xj }| ≤ n (j) (j) hard-top-n(g(x))W σ(W (x)) where hard-top-n(x) = o j i j=1 0 otherwise LoRA Finetuning Since the process of MoEfication is lossy, we attempt to recover the performance loss with low-rank adaptors. Due to the sparsity of fine-tuning data, we adapt the idea of Low Rank Adaption matrices (Hu et al., 2021), and make some additional change to preserve our sparse mixture of experts structure in our model. A naive application of LoRA to our model would introduce two set of matrices Ai ∈ Rd×r , Bi ∈ Rr×h , Ao ∈ Rh×r , Bo ∈ Rr×d , and modify the feedforward structure of the transformer as ŷ = (Wo +Ao Bo )σ((Wi +Ai Bi )x) = Wo σ(Wi x)+Ao Bo σ(Ai Bi x)+Wo σ(Ai Bi x)+Ao Bo σ(Wi x) 3 Unfortunately, this is not compatible with sparse mixture of experts, as the effect of the lora matrices would spill over the experts due to the crossover terms Wo σ(Ai Bi x) + Ao Bo σ(Wi x). Our main innovation is to instead introducing a single set of matrices A ∈ Rd×(h/k) , B ∈ R(h/k)×d , and modify the feedforward layer as ỹ = k X (j) hard-top-n(g(x))Wo(j) σ(Wi (x)) + Aσ(Bx) j=1 They key idea here is to introduce an always-active learned expert Aσ(Bx) that is specialized for a particular task. To establish a baseline, we forked the MoE-fication codebase for the MoE-fication script. We also added an additional pipeline to support arbitrary T5 models, MoE-fication on the SST2, MNLI, and RACE datasets, and various fine-tuning supports. The code is publicly available on github here: https://github.com/chiyotsai/MoEfication-with-LoRA. 4 Experiments 4.1 Data To measure the quality improvement, we finetune and measure the accuracy of our method on three different classification tasks: Sentiment Classification on the Stanford Sentiment Tree Bank 2 (SST2) dataset (Socher et al., 2013), Natural Language Inference on the Multi-Genre Natural Language Inference (MNLI) dataset (Williams et al., 2018), and Multiple Choice Question Answering on the ReAding Comprehension From Examination (RACE) (Lai et al., 2017) dataset. The number of examples in each is presented in Table 1. SST2 MNLI RACE Training Set 67.3k 393k 87.9k Validation Set 872 9.82k 4.89k Test Set 1.82k 9.83k 4.93k Table 1: Number of examples in each dataset The T5 models were pretrained on a dataset mixture that contains SST2 and MNLI but not RACE (Colin et al., 2019). Consistent with this, we performed full fine-tuning on RACE to establish a benchmark, while fine-tuning was not necessary for SST2 and MNLI. T5 is a text-to-text model (meaning every task, whether it be autoregressive text generation, classification, translation etc. is presented as feeding the model input text and generating output text), so the datasets require preprocessing. See appendix A.1 4.2 Evaluation Method 4.2.1 Quality Evaluation As the three tasks of interest are classification tasks, we use total accuracy as the principle metric for evaluation. The output classes for each make this an easy measurement to obtain: SST2 has output classes are 0 (negative) and 1 (positive); MNLI has output classes "entailment", "neutral", and "contradiction"; and RACE has output classes are "A", "B", "C", "D". Although other metrics such as F1-score and AUROC are also commonly used for evaluating classification performance, these are not considered as the output classes are uniformly distributed. To obtain a prediction from our language model, we first run the encoder on the input prompts and cache the encoder output state. Next, we use the decoder and the encoder output state to get the probability of each possible output classes. These probabilities are next normalized across the output classes. Finally, we pick the class with the highest output probability as the prediction output. These predictions are compared with the correct labels to compute the accuracy. 4 4.2.2 Efficiency Evaluation We benchmark the inference time of our models on a Nvidia RTX 3090. Wall time is used as the measure of complexity since difficulty of parallelizing Mixture-of-Experts is still an actively researched topic. Algorithm 1 Pseudocode for benchmarking MoE model performance Require: Model f , dataset D, number of iterations N times = [] for i in range(N ) do Sample and tokenize x from D Call torch.cuda.synchronize() start_time ← time.time() Call f (x) Call torch.cuda.synchronize() end_time ← time.time() times.append(end_time − start_time) end for return times[10:N].mean() ▷ Discard the first 10 samples to allow GPU to warm up. 4.3 Experimental Details We use pre-trained T5 models from Huggingface as the baseline. All performance quality experiments are performed using T5-base, which contains h = 3072 hidden neurons in each of its DenseActDense layers. For MoEfication, we use constrained k-means clustering to partition the 3072 hidden neurons into k = 96 experts, with each expert containing 32 neurons. Then, we run inference on the datasets to collect snapshots of the neurons in feedforward layers. These snapshots are used to train a 2-layer feed-forward layer that acts as the routing function, which selects n = 20 of the experts. As T5 has not been fine-tuned on the RACE dataset, we performed an additional full finetuning on RACE to serve as the baseline of our experiment. This finetuning is done with constant learning rate of 1e-4 over 3 epochs. The checkpoint with the best accuracy is chosen as the baseline model. For LoRA-finetuning, we apply rank-32 LoRA matrices to the DenseActDense layers. Rank 32 is chosen as it has the same rank as an expert in our config. We finetune the model on the all three datasets for 3 epochs, with a learning rate of 1e-3. A constant scheduler is used as applying learning rate decay did not appear to improve the performance in our experiment. Furthermore, we pick the largest batch size we could fit in our GPU, with the input length chosen to cover at least 99% of the training data. All tokens above the input length are truncated. The numbers are shown in Table 2. On SST2, we clip the input size to 512 tokens, with a batch size of 64. On Batch Size Input Length SST2 64 512 MNLI 8 1024 RACE 4 1536 Table 2: Dataset parameters used to finetune T5-base. MNLI, we clip the input size to 1024 tokens, with a batch size of 8. On RACE, we clip the input size to SST2 and MNLI for epochs with a batch size of 64, and RACE for 2 epochs with a batch size of 4. To benchmark the complexity of our MoE models, we keep the ratio of activated experts constant (20%), and benchmark our model by following Algorithm 1. 5 Model T5-Base +MoEfied GT +MLP Routing +LoRA +FFT SST2 93.9% 93.5% 92.3% 93.1% 93.2% MNLI 85.6% 85.0% 83.5% 85.3% 84.5 % RACE 71.6% 70.1% 67.5% 69.7% 67.9% Table 3: Accuracy of T5-Base on SST dataset, with MoEfication and further fine-tuning. Model Baseline +MoE MLP +LoRA T5-Small 100.0% 134.8% 135.1% T5-Base 100.0% 98.8% 101.4% T5-Large 100.0% 85.8% 88.9 % T5-3B 100.0% 71.8% 71.1% Table 4: Inference time of modified T5 on RTX 3090, normalized by the baseline inference time. 4.4 4.4.1 Results Quality Results Table 3 shows the comparison of our approach’s classification accuracy with that of the baseline model. The row "T5-Base" shows the baseline model performance. The row "MoEfied GT" shows a genie-assisted bound on the maximum accuracy achievable from a given class of MoE-model. This bound is obtained by first running a full forward pass to record the feedforward layer’s output, then selecting a subset of experts that well-approximates the full forward pass’s output measured by L2 distance. The row "MLP Routing" shows the performance of using a routing function using a multi-layer perception with a single hidden layer. The row "LoRA" shows the performance of applying our expert-style LoRA to the MoEfied model with MLP routing function. Finally, to provide a comparison between our expert-style LoRA with full finetuning, the row "FFT" shows the performance of full finetuning on the MoEfied model with MLP routing function. Our number shows that applying only a single LoRA that’s equivalent to an additional expert, we could significantly improve the performance of MoEfied model, and recover most of the performance lost in the routing function training process. Our classification accuracy is competitive with the genie-assisted bound, and even surpasses it on the MNLI dataset. The comparison with fully finetuned model also shows that increasing the number of tunable parameter in a MoEfied cannot recover the performance lost to the Moefication process. Rather, we need to increase the information passes through the MoEfied feedforward layers. 4.4.2 Complexity Results Table 4 shows the complexity ratio of our MoE-fied model against the baseline model. Somewhat surprisingly, our experiment shows that there is no efficiency gain from MoEfying T5-Base. In the case of MoE with LoRA, we even incur some performance loss. Motivated by this phenomenon, we decided to rerun the performance benchmark across different sizes of T5 models, and plotted the results in figure 1a and 1b. The figures show that the overhead of Mixture-of-Experts results in increased inference time for small models, but these overhead becomes more negligible as model size increases. The relationship between inference time and model size can be well-modeled by a function in the form axb , with Pearson’s correlation being greater than or equal to 97% in all cases. Furthermore, the break-even point for MoEfication is coincidentally around the size of T5-Base, the model we performed the experiment on. 6 (a) Ratio of inference time of MoEfied models as a function of model size. Note that both MoE models and MoE + LoRA models are well approximated by O(size−0.15 ). (b) Inference time of MoEfied models as a function of model size. Note that the baseline model scales as O(size0.88 ), and MoE models and MoE + LoRA models are both O(size0.725 ). Figure 1: Inferece time of MoEfied models. 5 Analysis 5.1 Complexity Analysis We performed further benchmarking based on the results in Table 4 and found that most of the overhead is due to the need of scattering and gathering the feedforward layers’ inputs to the different experts in the GPUs. The additional time spent on computing the routing function is relatively small compared to the increased memory communications in GPUs. In the case of T5-small, we even observed a 35% increase in complexity. With better, more memory-efficient MoE implementation, we might improve the performance of MoEs. But as of now, MoE models remains competitive only in large models regimes. 5.2 Qualitative Analysis To perform subjective evaluation of our MoEfied model, we look at couple examples from the SST2 dataset where our MoEfied model has made a mistake, but the baseline model succeeds. There are 23 such examples, and 15 of them are positive sentiments misclassified as negative, and 8 of them are negative sentiments misclassified as positive. We list a couple examples below: sst2 sentence: the primitive force of this film seems to bubble up from the vast collective memory of the combatants . Label: positive Prediction: negative sst2 sentence: there is nothing outstanding about this film , but it is good enough and will likely be appreciated most by sailors and folks who know their way around a submarine . Label: positive Pred: negative sst2 sentence: sticky sweet sentimentality , clumsy plotting and a rosily myopic view of life in the wwii-era mississippi delta undermine this adaptation . Label: negative Pred: positive Notice that many of the examples where positive sentiment is misidentified as negative have some degrees of ambiguity. In particular, the sentence "there is nothing outstanding about this film , but it is good enough ..." can be subjectively judged as either positive or negative. We believe these examples are instances where the model needs to learn nuanced understanding of the dataset curator’s classification criteria from the training set. 7 Figure 2: Distribution of expert activations. Figure from Zhang et al. (2022) On the other hand, the negative examples showcases another type of failures. These examples consists of more difficult vocabulary with more artistic, poetic prose. These instances represent difficulty problems that require more reasoning capacity and thorough understanding from the model, which is unfortunately reduced due to sparse expert activation. We hypothesize these examples demonstrates a deficiency in our MoE-based system in tackling problems in the long-tail. In Figure 2, we see that our MoEfied models tend to favor a couple commonly used experts, with a pretty steep drop off after 20 or so experts. These uncommonly used experts might represent deeper knowledge that’s not needed in simple tasks, but are required for more complex tasks in our examples. 6 Conclusion In this project, we develop a pipeline for converting a large pretrained model into a more resourceefficient MoE model and efficiently recover most of the degradation in performance emanating from the MoEfication process. We show that MoEfication can achieve 30% inference speed-up on GPU for large models, while performance on several classification benchmarks is nearly unchanged. Additionally, we identify several scaling laws for MoEfication, showing that the benefits of MoEfication increase with model size (but become negative for small models), and that the addition of LoRA has negligible effects on this pattern. Some limitations of our studies are that we have only demonstrated the accuracy on T5-base, which is a 220M parameter encoder-decoder model. Our tests on inference speed of T5-3B, however, show that efficiency increases with model size, so in future works, we would like to check if our approach scales well to billion-parameter models, as well as to decoder-only architectures. Furthermore, while our approach can recover the performance degradation from having a suboptimal routing, we still observe a small gap between the baseline model and the best MoEfied model. Whether this performance difference is fundamental to an MoE system, or it can be further ameliorated with a better expert partitioning method is still an open question. 8 References Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. 2023. Scaling laws for generative mixed-modal language models. Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. 2021. BinaryBERT: Pushing the limit of BERT quantization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4334–4348, Online. Association for Computational Linguistics. Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural networks for faster models. Online. arXiv. Youshua Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of SLSP, pages 1–37, Online. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page 6491–6506, Online. Association for Computational Linguistics. Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821v1. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, and Adrian Weller. 2020. Rethinking attention with performers. CoRR, abs/2009.14794. Tejalal Choudhary, Vipul Misrha, Anurag Goswami, and Jagannathan Sarangapani. 2020. A comprehensive survey on model compression and acceleration. volume 53, pages 5113–5155, Online. Artificial Intelligence Review. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341. Raffel Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. Google Research. Damai Dai, Li Dong, Yaru Hao, Zhigand Sui, Baobao Chang, and Furu Wei. 2021. Knowledge neurons in pretrained transformers. Online. ArXiv. Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022. GLaM: Efficient scaling of language models with mixture-of-experts. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR. William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. pages 8440–8451, Online. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, and Deog-Kyoon Jeong. 2020. Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 328–341. 9 Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun Zhu. 2021. Pre-trained models: Past, present and future. Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, and Ed Chi. 2021. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Advances in Neural Information Processing Systems, volume 34, pages 29335–29347. Curran Associates, Inc. Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. 2021. Fastmoe: A fast mixture-of-expert training system. arXiv preprint arXiv:2103.13262. Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. Dynabert: Dynamic bert with adaptive width and depth. In Advances in Neural Information Processing Systems, volume 33, pages 9782–9793. Curran Associates, Inc. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. In Neural Computation, volume 3, pages 79–87, Online. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics. Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. BASE layers: Simplifying training of large, sparse models. CoRR, abs/2103.16716. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. 10 Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. Dynamic model pruning with feedback. Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. 2020. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054. Saeed Masoudnia and Reza Ebrahimpour. 2014. Mixture of experts: a literature survey. volume 42, pages 275–293, Online. Artificial Intelligence Review. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems, volume 35, pages 17359–17372, Virtual/Online. Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. CoRR, abs/1802.05668. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5. Sahana Ramnath, Preksha Nema, Deep Sahni, and Mitesh M. Khapra. 2020. Towards interpreting BERT for reading comprehension based QA. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3236–3242, Online. Association for Computational Linguistics. Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. 2023. A comprehensive survey on model quantization for deep neural networks in image classification. ACM Transactions on Intelligent Systems and Technology, 14(6):1–50. Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. 2021. Hash layers for large sparse models. CoRR, abs/2106.04426. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. Xavier Suau, Luca Zappella, and Nicholas Apostoloff. 2020. Finding experts in transformer models. Online. ArXiv. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient transformers: A survey. ACM Comput. Surv., 55(6). Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model. CoRR, abs/1906.04284. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics. Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, and Sameer Singh. 2019. AllenNLP interpret: A framework for explaining predictions of NLP models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 7–12, Hong Kong, China. Association for Computational Linguistics. 11 Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768. Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6019–6029, Barcelona, Spain (Online). International Committee on Computational Linguistics. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. Xun Wu, Shaohan Huang, and Furu Wei. 2024. MoLE: Mixture of loRA experts. In The Twelfth International Conference on Learning Representations. Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K. Jha. 2022. Fully dynamic inference with deep neural networks. IEEE Transactions on Emerging Topics in Computing, 10(2):962–972. Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, Online. Association for Computational Linguistics. Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. 2021. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. CoRR, abs/2106.10199. Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Pen Li, Maosong Sun, and Jie Zhou. 2022. Moefication: Transformer feed-forward layers are mixtures of experts. Online. ArXiv. Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, and Maosong Sun. 2021. Know what you don’t need: Single-shot meta-pruning for attention heads. AI Open, 2:36–42. Yanqi Zhou, Lei Tao, Hanxiao Liu, Nan Du, Yangping Huang, Vincent Zhao, Andrew M Dai, Zhifeng Chen, Quoc V Le, and James Laudon. 2015. Mixture-of-experts with expert choice routing. volume 35, pages 7103–7114, Online. Advances in Neural Information Processing Systems. Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. 12 A Appendix A.1 Input Preprocessing and Targets A.1.1 SST2 SST2 is already very close to a text-input text-output structure, so the only input preprocessing needed is to prompt T5 with the desired output type by prepending “sst2 sentence”. T5 has already been fine-tuned on SST2 using this format so did not require additional fine-tuning. Below is an example of the preprocessing: Original: Example Sentence: “that loves its characters and communicates something rather beautiful about human nature” Example label: 1 Preprocessed: Model Input: “sst2 sentence: that loves its characters and communicates something rather beautiful about human nature” Model Target: “positive“ A.1.2 MNLI T5 was also pretrained on MNLI, so we follow T5’s preprocessing steps: Original: Example Premise: “Conceptually cream skimming has two basic dimensions - product and geography.” Example Hypothesis: “Product and geography are what make cream skimming work.” Example Attribution: “Neutral“ Preprocessed: Model Input: “mnli hypothesis: Conceptually cream skimming has two basic dimensions - product and geography. premise: Conceptually cream skimming has two basic dimensions - product and geography.” Model Target: “neutral“ A.1.3 RACE RACE was not included in T5’s training mixture, so we performed custom preprocessing and fine-tuning: Original: Example Article: “There is not enough oil in the world now. As time goes by, it becomes less and less, so what are we going to do when it runs ou...” Example Question: “According to the passage, which of the following statements is TRUE?” Example Options: ["There is more petroleum than we can use now.", "Trees are needed for some other things besides making gas.", "We got electricity from ocean tides in the old days.", "Gas wasn’t used to run cars in the Second World War."] Example Answer: “B“ Preprocessed: Model Input: “question: According to the passage, which of the following statements is TRUE? options: A: There is more petroleum than we can use now. B: Trees are needed for some other things besides making gas. C: We got electricity from ocean tides in the old days. D: Gas wasn’t used to run cars in the Second World War. article: There is not enough oil in the world now. As time goes by, it becomes less and less, so what are we going to do when it runs ou...” Model Target: “B“ 13 We put the question and options before the article to avoid information loss due to truncation (which affects <1% of examples) during tokenization. 14