TSPipe: Learn from Teacher Faster with Pipelines Hwijoon Lim 1 Yechan Kim 2 Sukmin Yun 1 Jinwoo Shin 1 2 Dongsu Han 1 2 Pipeline Idle 1. Introduction Knowledge distillation (KD) (Hinton et al., 2015) has shown remarkable success with the teacher-student (TS) framework in transferring knowledge from a teacher network to a student network. Motivated by this, the TS framework has been 1 School of Electrical Engineering, KAIST, Daejeon, Republic of Korea 2 Kim Jaechul Graduate School of AI, KAIST, Daejeon, Republic of Korea. Correspondence to: Dongsu Han . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1 The source code is available at https://github.com/ kaist-ina/TSPipe. GPU 0 GPU 1 GPU 2 GPU 3 Bubble Forward Backward student+teacher TSPipe The teacher-student (TS) framework, training a (student) network by utilizing an auxiliary superior (teacher) network, has been adopted as a popular training paradigm in many machine learning schemes, since the seminal work—Knowledge distillation (KD) for model compression and transfer learning. Many recent self-supervised learning (SSL) schemes also adopt the TS framework, where teacher networks are maintained as the moving average of student networks, called the momentum networks. This paper presents TSPipe, a pipelined approach to accelerate the training process of any TS frameworks including KD and SSL. Under the observation that the teacher network does not need a backward pass, our main idea is to schedule the computation of the teacher and student network separately, and fully utilize the GPU during training by interleaving the computations of the two networks and relaxing their dependencies. In case the teacher network requires a momentum update, we use delayed parameter updates only on the teacher network to attain high model accuracy. Compared to existing pipeline parallelism schemes, which sacrifice either training throughput or model accuracy, TSPipe provides better performance trade-offs, achieving up to 12.15x higher throughput. 1 GPipe Abstract student Update GPU 0 GPU 1 GPU 2 GPU 3 Forward Forward Forward teacher student teacher 0 0.25 Backward student 0.5 Forward Forward Backward Update student teacher student 0.75 time Figure 1. TSPipe achieves high training throughput by eliminating pipeline bubbles. Top: With GPipe, GPUs are idle, exhibiting pipeline bubbles. Gray blocks indicate the forward pass, and orange blocks indicate the backward pass. Bottom: Timeline for TSPipe. Green, blue, and orange blocks indicate the forward pass of the teacher network, the forward pass of the student network, and the backward pass of the student network, respectively. Note two figures share the same time axis, which is normalized. used in a broader range of applications—vision (Pham et al., 2021), natural language processing (Sanh et al., 2019), and deep reinforcement learning (Yin & Pan, 2017). In particular, many recent studies in self-supervised learning (SSL) for vision (He et al., 2020; Chen et al., 2021; Grill et al., 2020; Li et al., 2021a; Zhou et al., 2021) have successfully learned visual representations from a large number of unlabeled data using the TS framework. However, both KD and SSL often suffer from the extensive amounts of resource requirements (e.g., GPU memory and computation) for training. For example, KD often leverages large teacher networks—in Natural Language processing (NLP), the state-of-the-art pre-trained language models have up to 175B parameters (Brown et al., 2020; Zhang et al., 2022), which requires 700 GB of GPU memory only for the model itself. Many recent SSL methods also employ a largescale of architectures for better representation learning, e.g., MoCo-v3 (Chen et al., 2021) adopting ViT (Dosovitskiy et al., 2020), a transformer-based model, takes 128 GPUdays with ViT-B (86 M parameters) to converge. In addition, SSL methods often require extensive training epochs for model convergence—BYOL (Grill et al., 2020) needs an order of magnitude more training epochs to achieve the TSPipe: Learn from Teacher Faster with Pipelines accuracy proximal to the supervised learning counterpart. In some cases, even a cutting-edge GPU cannot accommodate such large models, which led to the adoption of Model Parallelism (MP) that splits a model into multiple GPUs. However, MP suffers from either serious underutilization due to the dependency among layers (inter-layer MP) (Huang et al., 2019) or extreme slowdown in multinode environments due to the inter-node communication overhead (intra-layer MP) (Narayanan et al., 2021). Thus, increasing the number of GPUs hardly contributes to speedup in training with MP. To overcome this issue, many recent works introduce pipeline parallelism (Huang et al., 2019; Narayanan et al., 2019; 2021; Park et al., 2020; Li et al., 2021b). However, existing solutions still fail to achieve high training throughput while maintaining scalability and model accuracy. For example, in Figure 1, GPipe (Huang et al., 2019), one of the well-known pipeline parallelism schemes, only utilizes 57% of GPU time and fails to fully schedule training tasks. This is due to the fundamental challenge that pipeline parallelism faces—dependencies between layers cannot be eliminated without changing training semantics at the risk of accuracy degradation. This paper presents TSPipe, a novel approach to accelerate the training of any TS framework including KD and SSL by pipelining multiple GPUs. TSPipe is the only training scheme that achieves three highs—high training throughput, high model accuracy, and high scalability. We achieve these by leveraging the following unique properties of the TS framework; 1) The teacher network does not need a backward pass. 2) The parameters of the teacher network are never updated, or updated in a steady and stable manner (i.e, momentum coefficient, τ = 0.996). Unlike other schemes that do not distinguish the teacher from the student, TSPipe schedules them separately to take advantage of the property of the TS framework. TSPipe interleaves the computations of the teacher network between the computations of the student, which enables us to eliminate all pipeline bubbles without additional memory footprint for activation stashing. This allows TSPipe to train larger models since activation memory accounts for the majority of total memory usage. TSPipe further applies delayed parameter updates as in other schemes (Narayanan et al., 2021; Xu et al., 2020; Park et al., 2020), to fully schedule the pipeline. However, unlike the existing schemes, TSPipe considers asymmetric parameter update. To be specific, we suggest applying delayed parameter update only on the teacher network, as the slow update of the teacher network parameters allows TSPipe to mitigate the performance drop of the student network. As a result, TSPipe provides high training throughput and a better trade-off between the GPU memory footprint and utilization, without loss of the model performance. We demonstrate the efficiency of TSPipe by training various KD and SSL schemes. For example, When we train MoCov3 under multiple-sized ViT architectures with 16 GPUs, TSPipe achieves up to 12.15x higher training throughput compared to inter-layer MP (Shoeybi et al., 2019). When we perform KD from ViT networks to ResNet with 8 GPUs, TSPipe achieves up to 4.68x higher training throughput over inter-layer MP. We also evaluate the learned representation quality for SSL where we adopt asymmetric parameter update. TSPipe preserves the same accuracy as the inter-layer MP under ResNet-18 with respect to the linear evaluation protocol (Chen et al., 2020). To the best of our knowledge, TSPipe is the first framework for training parallelism that targets the TS framework. 2. Background and Related Work This paper focuses on the general teacher-student framework of Knowledge distillation (Hinton et al., 2015), which is an effective learning scheme to transfer the knowledge from a powerful teacher network to a student. We remark many recent SSL frameworks (He et al., 2020; Chen et al., 2021; Grill et al., 2020; Roh et al., 2021) also belong to a form of TS framework with a slight variation, which leverages two encoder networks in training: the online (student) network θ and the target (teacher) network ξ. The former is the primary network for encoding the final representations directly updated by the loss gradients, and the parameters ξ of is the latter target (momentum) network (Tarvainen & Valpola, 2017) updated by an exponential moving average of parameters θ of the former as: ξ ← τ ξ + (1 − τ )θ, (1) where τ ∈ [0, 1] is a momentum coefficient. One key idea of our work is that such TS frameworks do not require backpropagating gradients of the target (or teacher) network during training. Many KD and SSL models feature 100M+ parameters (e.g., ViT (Dosovitskiy et al., 2020)), which cannot be trained with a single GPU due to the memory constraint. Pure data parallelism that does not split a model across GPUs cannot be used to train large models that do not fit in a single GPU’s memory. Mechanisms for distributed training are discussed below. Model parallelism (MP) (Shoeybi et al., 2019; Shazeer et al., 2018; Chilimbi et al., 2014) splits a model into multiple partitions and places each partition into a single GPU. This enables training larger models. Specifically, MP can be further classified into inter-layer MP and intra-layer MP. Inter-layer MP partitions a model layer-wise, and each parti- TSPipe: Learn from Teacher Faster with Pipelines DP (Data Parallelism) SSL MoCo-v3 ViT-Large maxj P Li ∈Pj TSPipe 2W N Ai k maxj P Li ∈Pj 2W N Ai k maxj P Li ∈Pj 1 1 N u u+N −1 1 Total memory 39.70 GiB 24.78 GiB 24.78 GiB 24.78 GiB Ideal utilization Out of Memory 12.5% 53% 100% Total memory 34.24 GiB 28.02 GiB 28.02 GiB 28.02 GiB Ideal utilization Out of Memory 12.5% 53% 100% Ideal pipeline utilization KD DistilBERT BERT-xxlarge Ai i