Brainformers: Trading Simplicity for Efficiency Yanqi Zhou 1 Nan Du 1 Yanping Huang 1 Daiyi Peng 1 Chang Lan 1 Da Huang 1 Siamak Shakeri 1 David So 1 Andrew Dai 1 Yifeng Lu 1 Zhifeng Chen 1 Quoc Le 1 Claire Cui 1 James Laudon 1 Jeff Dean 1 Scaling Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2× faster training convergence and 5× faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations. 2.7 Steps Per Second 1.50 2.4 1.25 2.3 2.2 2.0 1.00 Branformer Perplexity GLaM Perplexity Brainformer Steps Per Sec GLaM Steps Per Sec 2.25 2.50 2.75 3.00 3.25 3.50 3.75 Acticated Params (Millions) in Log Scale 0.75 0.50 4.00 Figure 1: Brainformer Vs. GLaM in Scaling. Brainformer improves model quality at much faster training step time. mann et al., 2022; Shoeybi et al., 2019), better training data quality (Du et al., 2022), and sparsely activated model architectures (Du et al., 2022; Lepikhin et al., 2021; Roller et al., 2021; Lewis et al., 2021). Among the efficient transformer language models (Wang et al., 2020; Choromanski et al., 2020; Tay et al., 2021; Hua et al., 2022), there is a focus on improving attention-layer efficiency using low-rank approaches or approximations. However, recent work has also identified that dense feedforward layers constitute most of the computational cost for common sequence lengths (≤2048), particularly when the model is large (Du et al., 2022; Zhou et al., 2022). To further improve compute efficiency such as total FLOPs used during training to reach convergence, sparsely gated Mixture-of-Experts (Lepikhin et al., 2021; Fedus et al., 2021; Du et al., 2022; Zhou et al., 2022; Roller et al., 2021; Lewis et al., 2021; Jaszczur et al., 2021) have become prevalent, giving the model a larger overall capacity to improve quality while holding computational cost fixed. Sparsely activated models not only reduce the computational cost, but also have better specialization by training different experts on different data distributions through the use of a routing function without reducing the effective training time for each expert. The MoE architectures in this line of work are based on uniform transformer blocks or interleaving dense and sparse layers (Du et al., 2022) and a fixed top-k routing. In recent years, large neural networks derived from from the Transformer architecture (Vaswani et al., 2017) have demonstrated superior results on language understanding and generative tasks. Many improvements on Transformer variants have come from scaling the size of models (Raffel et al., 2020; Brown et al., 2020a; Shoeybi et al., 2019; Chowdhery et al., 2022), scaling the training tokens (HoffGoogle Deepmind. qiz@google.com>. 1.75 2.5 2.1 1. Introduction 1 2.00 2.6 Log Perplexity arXiv:2306.00008v2 [cs.LG] 25 Apr 2024 Abstract Correspondence to: Yanqi Zhou