Auxiliary Modality Learning with Generalized Curriculum Distillation Yu Shen 1 Xijun Wang 1 Peng Gao 1 Ming C. Lin 1 Abstract cars, but it’s reasonable to equip a few developer’s cars with Lidar for training. However, this specific type of learning task, i.e., ”test with fewer modalities than during training”, is not standardized yet. For example, there is no formal term or definition. There have been concepts, such as ”learning with side information” (Hoffman et al., 2016), ”learning with privileged information” (Garcia et al., 2019), ”learning with auxiliary modality” (Piasco et al., 2021), ”learning with partial-modalities” (Wang et al., 2018), and ”modality distillation” (Garcia et al., 2018), etc. We therefore formalize these learning tasks as Auxiliary Modality Learning (AML) in Sec. 3. Driven by the need from real-world applications, Auxiliary Modality Learning (AML) offers the possibility to utilize more information from auxiliary data in training, while only requiring data from one or fewer modalities in testing, to save the overall computational cost and reduce the amount of input data for inferencing. In this work, we formally define “Auxiliary Modality Learning” (AML), systematically classify types of auxiliary modality (in visual computing) and architectures for AML, and analyze their performance. We also analyze the conditions under which AML works well from the optimization and data distribution perspectives. To guide various choices to achieve optimal performance using AML, we propose a novel method to assist in choosing the best auxiliary modality and estimating an upper bound performance before executing AML. In addition, we propose a new AML method using generalized curriculum distillation to enable more effective curriculum learning. Our method achieves the best performance compared to other SOTA methods. To apply AML to real-world tasks, there are some key issues: “what types of auxiliary modalities can be used, and how to add the auxiliary modalities into the network and make them most effective?” We systematically list and classify auxiliary modalities in visual computing and network architectures for AML, and then conduct experiments to address these questions. Specifically, we classify the auxiliary modalities into 3 types: low-level sensing data (Type 1), middle-level equivalent representation (Type 2), and high-level conceptual information (Type 3) in Sec. 4.1.1, according to the types of information. We also classify the network architectures into four types, according to the mechanism that introduces the auxiliary modality. They are auxiliary modalities in the input (Type A), in the middle (Type B), in the end (Type C), and in the teacher (Type D), as defined in Sec. 4.1.2. In addition, we design experiments to see which architecture and which auxiliary modality perform best within each task and across tasks (Sec. 4.1.3) that provides experimental guidelines and theoretical foundation to our method in Sec. 5. 1. Introduction Learning from images and videos is among some of the most popular research focuses (Esteva et al., 2021; Guo et al., 2022; Chai et al., 2021), as RGB images are informative and easy to acquire. In addition, RGB camera is cheap and can be easily deployed. There are also works considering multiple modalities, i.e., multi-modal learning (Wang, 2021; Jiang et al., 2021; Joshi et al., 2021). Furthermore, some works consider to use multiple modalities in training but use fewer modalities during test since in certain applications it’s difficult to use all modalities during inference. For example, it is expensive to deploy Lidar on commodity self-driving Given this formal framing, we can apply AML to real-world tasks. There remains the question of explainability: “Why AML can work without auxiliary modality in the test?” It’s not obvious that adding auxiliary modality only in training can always help improve test performance with only the main modality. In Sec. 4.2, we explore this line of inquiries from optimization and data perspectives. Specifically, we introduce a new concept of “supermodel” to support our claim, which also offers insights and inspiration to design the new AML method presented in Sec. 5.2. * Equal contribution 1 Department of Computer Science, University of Maryland, College Park, Maryland, USA. Correspondence to: Yu Shen . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Based on the detailed analysis in Sec. 4, we propose a simple 1 Auxiliary Modality Learning with Generalized Curriculum Distillation yet effective method, Smart Auxiliary Modality Distillation (SAMD), that can smartly choose the best auxiliary modality and perform a special auxiliary modality distillation with generalized curriculum distillation. Firstly, in Sec. 4.1.3, we show that different auxiliary modalities can contribute to different tasks at a different level, thus we propose a method to choose the best auxiliary modality and estimate upper-bound performance for a given task before actually executing AML. Inspired by Squeeze-and-Excitation Network (SENet) (Hu et al., 2018), we use channel-level attention in the SE block to estimate the AML performance for each modality, and show the consistently positive correlation between them through experiments on different tasks and auxiliary modalities. See details in Sec. 5.1. Also, in Sec. 4.1.3, we show the knowledge distillation based architecture (Type D) is better than other forms. However, when analyzing the reason for the effectiveness of AML in Sec. 4.2, we find the “supermodel condition”, which helps AML to perform better, is not fully utilized in the general knowledge distillation based architecture. We thus introduce a new method that uses supermodel in a more effective way that allows the teacher network to be aware of the student’s status in a curriculum way, leading to a better distillation. Our method achieves better performance compared to other SOTA methods (Sec. 5.2). of AML from both optimization perspective and data perspective (Sec. 4.2), providing theoretical support to the SAMD method. 2. Related Work Auxiliary Modality Learning aims to use auxiliary modality in training to boost the test performance without the auxiliary modality during inference. Cross-modality Learning and Knowledge Distillation are comparatively promising solutions and we discuss related works in each here. More related works are discussed in Appendix A.8. 2.1. Cross-modality Learning To utilize the prior knowledge between different modalities, Gupta et al. (Gupta et al., 2016) learned the representation of one modality with a pretrained network on another modality. Hoffman et al. (Hoffman et al., 2016) presented early work on modality hallucination, which used a hallucination network with RGB image as input but tried to mimic a depth network, by combining with RGB network to achieve multimodal learning. Some (Garcia et al., 2018; 2019) train the hallucination network with a different process to achieve better performance, while others (Wang et al., 2018; Piasco et al., 2021) use GAN or U-Net to generate another paired modality data with one modality. MSD (Jin et al., 2021) transfers knowledge from a teacher on multimodal tasks by learning the teacher’s behavior within each modality. A recent work (Garcia et al., 2021) trains the different modality data in different pipelines and distills the best modality pipeline knowledge to other modality pipelines. In addition to action recognition, AML has also been applied in medical image processing (Gao et al., 2019; Li et al., 2020). Specifically, Zheng et al. (Zheng, 2015) investigated the effectiveness of shape priors learned from a different modality (e.g., CT) to improve the segmentation accuracy on the target modality (e.g., MRI). Valindria et al. (Valindria et al., 2018) proposed dual-stream encoder-decoder framework, which assigns each modality with a specific branch and extracts cross-modality features with carefully designed parameter sharing strategies. Li et al. (Li et al., 2020) exploited the priors of assisted modality to promote the performance on another modality by enhancing model generalization ability, where only target-modality data is required in the test. Our analysis provides experimental understanding and theoretical underpinning for the simple yet effective method design. To the best of our knowledge, this is the first detailed analysis to guide the design and choices of AML methods for visual computing based on tasks, datasets, and network architectures. In summary, our contributions are: • Systematically list and classify different types of auxiliary modalities and architectures (Sec. 4.1.2) for AML, and analyze the performance behavior of different types of auxiliary modalities and architectures for AML across different datasets, backbones and tasks (Sec. 4.1.3). We find (1) architecture effectiveness is relatively consistent across different tasks, datasets and backbones; (2) auxiliary modality effectiveness is consistent within one task with different datasets and backbones, but not consistent across tasks. • Propose a novel AML method, “Smart Auxiliary Modality Distillation (SAMD)”, that automatically (1) chooses the best auxiliary modality for the main distillation process, and (2) performs knowledge distillation under a special “supermodel condition” to enable the teacher network to be aware of the student’s status. SAMD achieves SOTA results on variant tasks, with improvement up to 10% on end-to-end steering task, 5% on multi-view handwriting classification task, and up to 15.6% across tasks, etc. (Sec. 5). 2.2. Knowledge Distillation Knowledge Distillation can be classified as one-way or mutual-learning knowledge distillation. One-way knowledge distillation mainly distills the knowledge of a fixed teacher model (usually large) to a student model (usually small). In the early days, Hinton et al. (Hinton et al., 2015) proposed compressing the knowledge in an ensemble of multiple models into a single model that is much easier • Analyze and explain the reasons for the effectiveness 2 Auxiliary Modality Learning with Generalized Curriculum Distillation to deploy by mimicking the class distribution via softened softmax from the ensemble teacher. Some studies (Ding et al., 2019; Wen et al., 2019) went further to explore the trade-off between the supervision of soft logits and hard task label. Furthermore, there are also methods exploiting the intermediate feature (Romero et al., 2015; Kim et al., 2018a; Jin et al., 2019) as transferred knowledge, which can improve the middle layer’s representational ability in a student network. Other than the one-way distillation from teacher to student, some focus on mutual knowledge distillation among models trained from scratch. This line of research is especially notable for scenarios without an available pretrained teacher model. A significant work is deep mutual learning (DML) (Zhang et al., 2018). During the training phase, DML uses a pool of randomly initialized models as the student pool, and every student is guided by the output of other students and the task label. (Wang et al., 2023) go a further step and introduce an anchor model to delimit a subspace within the full solution space of the target problem, which can help to ease the distillation difficulty. The goal of AML is to achieve better performance with the help of auxiliary modalities IA than only train on main modalities IM . AML can be found in the real world, e.g., when you cannot solve a problem in class, the teacher gives you some hints so the students can understand the relationship between the problem and the answer better. Then, after class, the student can solve similar types of new problems without hints. In this paper, for visual computing, we fix the main modality as RGB images, but the auxiliary modality can be others, like point cloud, depth map, or other customized formats. AML is useful in the following scenarios: (1) Getting the extra modality data during test is not feasible. For example, the extra modality can be the human-labeled attention map, which is achievable during training, but we cannot ask the user to label the attention map in real time. (2) Getting the extra modality data during test is feasible but expensive. For example, in autonomous driving, we need to use Lidar to get point cloud data. Using point clouds during training only requires several Lidar sensors on the cars for development, but using point clouds during test means every car needs to install Lidar, which is costly. AML can reduce the cost dramatically compared with the solutions that require the Lidar+camera, and can perform better than the solutions that only use camera. Similarly for robot navigation. We share a similar philosophy with distillation, but aim to design a cross-modality learning framework to utilize the hidden information from auxiliary modalities, resulting in a different methodology. 3. Auxiliary Modality Learning 4. Analysis Auxiliary modality learning offers promising potential, but has not been fully examined. Previous works studied auxiliary modality learning in certain applications without a formal definition or a unified terminology. (Hoffman et al., 2016) named this process “learning with side information”, (Garcia et al., 2019) called it “learning with privileged information”, (Piasco et al., 2021) referred to it as “learning with auxiliary modality”, (Wang et al., 2018) suggested “learning with partial-modalities”, (Garcia et al., 2018) introduced the term “modality distillation”, etc. In this paper, we formally define the Auxiliary Modality Learning (AML) as follows: In this section, we aim to do analysis on two key problems of AML when applying on real-world tasks: (1) What kinds of auxiliary modalities can we use, and how can we add them into the network to make them effective? (2) Why AML can work without auxiliary modality in test? 4.1. Auxiliary Modality and Architecture In this section, we first systematically list and classify the types of auxiliary modalities and the types of auxiliary modality learning architecture, then analyze how different auxiliary modalities and architectures can affect auxiliary modality learning through experiments. Definition 3.1 Given data with one set of modalities IM and data with another set of modalities IA , if a model M can take both IM and IA as input during training, but only use IM during test, then we call model M an auxiliary modality model, IM as the main modality data and IA as the auxiliary modality data. Furthermore, we call the training process of an auxiliary modality model as auxiliary modality learning. 4.1.1. T YPES OF AUXILIARY M ODALITY Previous auxiliary modality learning works usually only consider one or several given types of auxiliary modality without systematic analysis (Hoffman et al., 2016; Garcia et al., 2019; Piasco et al., 2021; Wang et al., 2018; Garcia et al., 2018; Gupta et al., 2016; Jin et al., 2021). This is usually because of the limitation of data sources, e.g., limited sensor types. However, there is actually a wide range of auxiliary modality options that can be used. Except for the sensing data directly from the sensors (like depth map or infra-red image), other data generated from the original Formally, the training process is minθ L(Mθ , (IM , IA ), GT ), where θ is the weights of the model, L is the loss function, and GT is the ground truth, while the test process is Mθ (IM ). 3 Auxiliary Modality Learning with Generalized Curriculum Distillation the existing architecture designs for the auxiliary modality learning systematically. We classify the possible auxiliary modality learning architectures into four types: Type A: Auxiliary modality in the input, same architecture as multi-modality learning during training, but only use the main modality branch for test, as shown in Fig. 1(a). General multi-modality architecture is already been studied (Xiao et al., 2020), but multi-modality based AML still needs to be explored. Figure 1. Architectures for auxiliary modality learning. Type A: Auxiliary modality in the input. Type B: Auxiliary modality in the middle. Type C: Auxiliary modality in the end. Type D: Auxiliary modality in the teacher network. The dashed area in each type is the test pipeline that only use main modality. See Sec. 4.1.2. Type B: Auxiliary modality in the middle as supervision. The basic idea is to generate auxiliary modality with the main modality first, and then use the multi-modality architecture, as shown in Fig. 1(b). Existing works like (Wang et al., 2018; Li et al., 2020; Wei et al., 2016) show the effectiveness of this type of solution. image (like segmentation image or frequency image) or even annotated by a human expert (like attention map) can also be used as an auxiliary modality. In our work, we classify the potentially useful auxiliary modalities that are commonly seen in daily life and show their effectiveness through experiments. Type C: Auxiliary modality in the end as supervision, same architecture as multi-task learning (Ruder, 2017) or indirect supervision (Chang et al., 2010), but only need to use the original task pipeline for test, as shown in Fig. 1(c). The basic idea is the original task and the auxiliary modality generation task share certain common features, thus the auxiliary modality can help the learning of the original task. Formally, we suggest the following three types of data, which can be used as auxiliary modality in visual computing, according to the information contained in them: Type 1: Low-level sensing data with additional information. For example, given the main modality is RGB image, depth map or infra-red image can be used as an auxiliary modality, which is already commonly used (Wang et al., 2018; Xiao et al., 2020). The additional depth or infra-red information can be used when the RGB image is not able to capture enough information like at night. Type D: Auxiliary modality in a teacher network and teach a student network without auxiliary modality, refer to cross-modality knowledge distillation, as shown in Fig. 1(d). Existing works like (Garcia et al., 2021; 2018) show the effectiveness of this type of solution. Type 2: Middle-level representation with equivalent information but in different spaces. For example, RGB image can be transferred to/from frequency space with 2D FFT (a one-to-one mapping). Although they contain the same information, one presentation in one space may have a closer relation to the goal, helping the network to learn better. Notice existing works mostly focus on Type B and D, but few discuss or conduct experiments with Type A and C, which are also potential solutions. 4.1.3. E XPERIMENTS Type 3: High-level conceptual data with compacted information. For example, expert annotated image with emphasized key features (like attention image). This kind of auxiliary modality helps reduce the redundant noises and helps the network focus on key elements quickly. We conduct experiments to see how different auxiliary modalities and architectures of AML perform within single task and across tasks. Single Task In this experiment, we consider four factors, auxiliary modality, architectures, backbones and datasets. Our goal is to explore whether there exist general rules under different settings for broader applicability. Different datasets have different properties, e.g. data distribution and data size, while different backbones consist of varying model types and model complexity. We design experiments to answer: (1) Given fixed dataset and backbone, do all the architectures help auxiliary modality learning? What’s the order among them w.r.t. performance improvement? (2) Given fixed datasets and backbones, do all the auxiliary modalities help auxiliary modality learning? What’s the order among them w.r.t. performance improvement? (3) Are the previous two answers consistent across different Type 1 is the most common type of auxiliary modality, but Type 2 and 3 are also auxiliary modalities that can potentially contribute to the task. See example tasks for different modalities in Appendix A.1. 4.1.2. A RCHITECTURES FOR AML Existing works explore variant ways to achieve the goal of auxiliary modality learning. However, to the best of our knowledge, no one compares architectures in a systematic way (Hoffman et al., 2016; Garcia et al., 2019; Piasco et al., 2021; Wang et al., 2018; Garcia et al., 2018; Gupta et al., 2016; Jin et al., 2021). In this section, we list and compare 4 Auxiliary Modality Learning with Generalized Curriculum Distillation datasets and backbones? training can do better than only using the main modality. This is also supported by other works (Hoffman et al., 2016; Garcia et al., 2019; Piasco et al., 2021; Wang et al., 2018; Garcia et al., 2018). However, most of them use experimental results to illustrate their effectiveness, there is no detailed analysis to demonstrate why AML can work. In this section, we explain why AML works from two perspectives. The experiment setting is described in Appendix A.1. In Table 5 (Appendix A.2), we show performance comparison (Mean Accuracy %) with a combination of three auxiliary modalities, four architectures, two backbones, and two datasets. Within this task, we observe: (1) Knowledge distillation based architecture (Type D) perform best, followed by generation based architecture (Type B). Multi-task based architecture does slightly better than baseline (Type C), while Multi-modality based architecture sometimes hurt the performance (Type A). 4.2.1. O PTIMIZATION P ERSPECTIVE Here we explain why AML can work from the optimization perspective. (1) The optimal solution of AML is no worse than learning with the main modality. Inspired by “superset”, we first introduce a new concept “supermodel”. (2) All types of auxiliary modality used can help improve performance. Attention image (Type 3) achieves the highest performance improvement, followed by depth map (Type 1). Frequency image (Type 2) only performs slightly better than baseline (the order is proven to be not consistent across tasks in Finding (5) below). (A) Definition 4.1 Given a model MθA (IA ) (weights θA and (B) input IA ), and a model MθB (IB ) (weights θB and in(A) put IB ), if for any θA , there is a θB , s.t. MθA (IA ) = (3) Within this task, the effectiveness order for different auxiliary modalities or architectures are consistent across different datasets or backbones. (B) MθB (IB ) for any arbitrary valid input data IA and its superset IB . Model MB is called a “supermodel” of MA . Multiple Tasks. However, the rules observed in a single task are not necessarily TRUE across tasks. In this experiment, we consider four factors: auxiliary modality, architectures, backbones and datasets. We design experiments to answer: (4) Is the effectiveness of different architectures consistent across different tasks? (5) Is the effectiveness of different auxiliary modalities consistent across different tasks? See an example of the supermodel in Fig. 5. We then introduce a lemma based on the supermodel: Lemma 4.1 Given a model M and its supermodel M(s) , the optimal training loss of M(s) (which (s) is arg minθ(s) L(Mθ(s) (I (s) ), GT )) is less than or equal to the optimal training loss of M (which is arg minθ L(Mθ (I), GT )). where L is the loss function and GT is the ground truth. The experiment setting is presented in Appendix A.1. In Table 6 (Appendix A.2), we show performance comparison with a combination of two auxiliary modalities and four architectures across three tasks. Notice when comparing across tasks, we only focus on the relative accuracy order of one task, since the metrics of different tasks are different. We find: See the proof in Appendix A.4. Now we consider the single network architectures (Type A, B, C in Sec. 4.1.2) and the teacher network of Type D, all of them are supermodels of their related main modality pipeline network. Specifically, we can black out the auxiliary modality related branch (e.g., for Type A and C, use the pipeline in the dashed box, for Type B, blackout the auxiliary modality generation branch, for teacher network in Type D, it’s the same as Type 1) by setting the weights of connection layers to specific values (e.g., zeros, depends on the specific type of layer), then the model takes both main and auxiliary modalities will have exactly the same results of the model with only main modality, thus meeting the supermodel definition A.1. According to Lemma. 4.1, the AML model is no worse than the original model with only the main modality. (4) The effectiveness order of different architecture types is consistent for different tasks. (5) The effectiveness order of different auxiliary modalities may be different for different tasks, but consistent within one task. Finding (4) supports our choice to design based on architecture Type D (Sec. 5.2). Finding (5) motivates the need to select the best auxiliary modality at the beginning of a task, but no need to re-select it for using another architecture type, backbone, or dataset (Sec. 5.1). (2) In the case of the same performance, AML allows the optimizer to search in a higher dimension with a higher possibility to find a path learning with the main modality. 4.2. Why AML Works? We provide experimental results in Sec. 4.1.3 to show auxiliary modality learning can work, i.e., although only test with the main modality, using auxiliary modality during Suppose an optimizer g takes model M and its initial weights θ0 , loss function L, training data IM as input, and 5 Auxiliary Modality Learning with Generalized Curriculum Distillation output a path of model weights: g(M, θ0 , L, IM ) = {θ0 , θ1 , ..., θp1 } = P1 where p1 is the step number, θp1 = θ∗ is the optimal solution, and P1 is the path. Then the AML process on its supermodel M(s) is g(M(s) , θ0 ⊕ δ0 , L, (IM , IA )) ′ ′ = {θ0 ⊕ δ0 , θ1 ⊕ δ1 , ..., θp2 ⊕ δp2 } = P 2 where ⊕ is the dimension-level connection, δ as the weights ′ for the auxiliary dimension, δ0 = δp2 = 0, θp2 = θ∗ . This means that only the start and end positions are on the same dimension as the main modality, while in-between it can explore on a higher dimension (main+auxiliary modality). For any path P1∗ , there is a path P2∗ that represents the ′ same path (by setting p2 = p1 , δi = 0 and use θi = θi ∗ ∗ for i = 0, 1, ..., p1 ). But for a P2 , there’s no P1 that can represents the same path when there is a δi ̸= 0 in P2∗ . It shows even with the same start and end points, the AML can have more path options, which may be easier to be found by a given optimizer, e.g., the blue path in Fig. 2 is a gradient descent path in a higher dimension, while the red path in low dimension needs to go uphill in the middle, which is more difficult for the gradient-based optimizer to find solutions. Figure 2. “Why AML works” from optimizer perspective. The blue path (AML) is easier to be found by a gradient-based optimizer, since there is no uphill as with the main modality (red). This observation explains why middle-level and high-level conceptual data (Type 2 and 3 in Sec. 4.1.1) can help AML. 5. Smart Auxiliary Modality Distillation (SAMD) Based on the detailed analysis in Sec. 4, we propose a simple yet effective method “Smart Auxiliary Modality Distillation” to choose the best auxiliary modality and do an auxiliary modality distillation. 4.2.2. DATA P ERSPECTIVE 5.1. Auxiliary Modality Choice for a Given Task Next, we explain why AML works from data perspective. As discussed in Sec. 4.1.1, there are three types of auxiliary modality that are potentially useful, and each type can have multiple kinds of modalities. Given the conclusion in Sec. 4.1.3, there is no consistent best auxiliary modality that can be used for all the tasks, we need to choose the best auxiliary modality that can boost the performance most for a given task. Suppose there are n types of auxiliary modalities, do we need to train n times to find out the best one? The answer is no. In this section, we propose a method that can assist in deciding the importance order for a set of auxiliary modalities within one training process. (1) The auxiliary modality can help the main modality training better when main modality data is imbalanced or in shortage. For example, the main modality data has few examples that are the ‘hard cases’, which lead to a wrong decision boundary. This is common in real-world datasets, e.g., the autonomous driving dataset usually has fewer night data, even worse, has few accident data. After adding the auxiliary modality that provides more information on the hard cases, it would be easier to learn a correct decision boundary, then use this information to guide the training process with the main modality. For example, the infra-red image or depth map contains more information than RGB image when captured at night. This observation explains why the low-level sensing data (Type 1 in Sec. 4.1.1) can help AML. See figures in Appendix A.5. Inspired by Squeeze-and-Excitation Network (SENet) (Hu et al., 2018), we use channel-level attention to represent the importance of each modality. Suppose we already have a network f that can take the main modality Im as input and perform prediction for a given task. Now we have n types of auxiliary modalities that potentially can help. We first pack the different modality data in the channel level, and feed them into the Squeeze-and-Excitation (SE) block (Hu et al., 2018), followed by a 1x1 convolutional layer to make the channel number to be the same as the main modality Im , so that the original network f can take that as input and perform prediction. If different modality data have different image sizes then they should be resized (or add a shallow network to pre-process the data, if necessary) before being packed in the channel level. In our experiments, all the image data (2) The auxiliary modality data reveal a simpler mapping function from input to output. As we know, the network is used to learn a mapping function from input to output, e.g., f (IM ) = y. However, the function f may be complex and difficult to learn. Then, one solution is to split the complex function f into two parts, i.e., f (IM ) = f2 (f1 (IM )) = f2 ((IM , IA )) = y, where f1 is “data reformating function” that contains as much as inductive bias (according to the domain expert experiences) for the given task, thus the f2 will be simpler than the original f and easier to be learned. 6 Auxiliary Modality Learning with Generalized Curriculum Distillation Figure 4. Different types of auxiliary modalities used in studies. Mean Accuracy (%) Figure 3. SAMD architecture. In each round, a new curriculum learning is started by resetting the teacher weights. Then we train the model with our online distillation, until the student converges. The teacher network should be a supermodel of the student network to enable reset operation, which helps the teacher be aware of the student’s status and perform more effectively. Accuracy (%) on various angle threshold τ (degree) Method τ = 1.5 τ = 3.0 τ = 7.5 τ = 15 mAcc (Hoffman et al., 2016) (Garcia et al., 2018) (Xiao et al., 2020) (Garcia et al., 2021) Ours (SAMD) 51.7 26.1 28.6 40.2 54.3 70.6 54.1 51.2 67.8 72.2 89.6 81.8 80.0 88.7 90.1 94.7 91.0 92.0 94.3 94.6 83.6 74.6 74.4 81.0 84.4 Method w/o ours with ours Improvement kd (Hinton et al., 2015) hint (Romero et al., 2015) similarity (Tung & Mori, 2019) correlation (Peng et al., 2019) rkd (Park et al., 2019) pkt (Passalis et al., 2020) vid (Ahn et al., 2019) abound (Heo et al., 2019) factor (Kim et al., 2018b) fsp (Yim et al., 2017) 71.5 67.6 75.6 77.0 75.6 75.7 83.4 74.3 76.9 72.0 83.4 83.2 83.9 74.3 84.4 76.4 83.2 72.0 83.4 70.1 11.9 15.6 8.3 -2.7 8.8 0.7 -0.2 -2.3 6.5 -1.9 Table 2. Performance comparison with vs. without our training paradigm (containing reset operation). By applying our training paradigm on other knowledge distillation methods, we can achieve better performance in most cases (up to +15.6%) in either fully paired or merely a small amount of additional modality data. Table 1. Performance comparison on Audi dataset with Nvidia PilotNet (Bojarski et al., 2016). All the methods are trained on RGB+segmentation, and tested on RGB only. Our method outperforms others by up to 10% improvement in accuracy. To apply our training paradigm with reset operation, the framework should meet the supermodel condition (Sec. 4.2), i.e., the teacher network should be a supermodel of the student network. This condition is what differentiates our learning framework from other existing methods. with different modalities have the same size, so they can be packed directly. After training the modified network, the channel weights in the SE block can be used to determine the relative importance for the auxiliary modalities, i.e., the modality that has the largest channel weight is the one that can lead to the best AML performance. See Appendix. A.6. 5.3. Experiments In this section, we conduct experiments for autonomous steering task (Appendix A.1) and 5 other tasks (Appendix A.8). See details on the experiment settings in Appendix A.8. We also use different types of data modalities in our experiments, as shown in Fig. 4. 5.2. Auxiliary Modality Distillation Sec. 4.1.3 shows the knowledge-distillation (KD) based architecture performs best in most cases. However, when analyzing reasons for the effectiveness of AML in Sec. 4.2, we find the “supermodel condition”, which helps AML to perform better, is not fully utilized in the general KD-based architecture. We thereby introduce a new method that uses “supermodel condition” that allows the teacher network to be aware of the student’s status and leads to a better distillation. Comparison with other AML methods. We compare our SAMD with other AML methods, Hoffman et al. (Hoffman et al., 2016), Garcia et al. (Garcia et al., 2018), Xiao et al. (Xiao et al., 2020), and DMCL (Garcia et al., 2021), using Audi dataset (Geyer et al., 2020) and Nvidia PilotNet (Bojarski et al., 2016). For Xiao et al. (Xiao et al., 2020), we adopt the single-sensor version and make it suitable for the Audi dataset by removing the high-level route navigation command and measurement, and using Tao et al. (Tao et al., 2020) as the segmentation generator. In Table 1, ours outperforms others by up to 10%. We update the teacher-student in an online-like paradigm. See framework illustration in Fig. 3. The training paradigm contains t rounds. In each round, we first reset the teacher with the student, then train the teacher independently while training the student with both the general label loss and knowledge distillation loss for k epochs. k should not be too large to avoid the teacher being far away from the student. The training process stops when the student converges between different rounds or until finishing t rounds. See loss function, “reset” definition, and algorithm in Appendix A.7. Effectiveness when combining with different knowledge distillation methods. Since our training paradigm can be applied to existing knowledge distillation methods, we do experiments by combining ours with kd (Hinton et al., 2015), hint (Romero et al., 2015), similarity (Tung & Mori, 7 Auxiliary Modality Learning with Generalized Curriculum Distillation Dataset Train Mod Test Mod Method mAcc Task Train Mod Test Mod Ours Best Others Multi-features Single Feature 70.3 65.2 Audi RSDE RSDE Teacher 83.7 Handwritten Clas (Han et al., 2021) Audi RSDE RSDE RGB RGB Best Others Ours 72.9 74.3 Waypoint Pred (Prakash et al., 2021) Image Point Cloud Image 79.5 71.4 SullyChen RDE RDE Teacher 81.0 Materials Clas (Wilson et al., 2022) Image Audio Image 83.2 76.8 SullyChen RDE RDE RGB RGB Best Others Ours 88.9 89.7 Bird-eye-view Seg (Li et al., 2022a) Multi-view Point Cloud Single-view Point Cloud 45.30 44.91 Honda RSDE RSDE Teacher 79.8 Honda RSDE RSDE RGB RGB Best Others Ours 77.4 78.1 Table 4. Performance comparison on different tasks with different auxiliary modalities. Our method outperforms other methods on all tasks. See details in Appendix A.8. Table 3. Comparison on different datasets and different modalities. “RSDE” refers to RGB + segmentation + depth map + edge map, and “RDE” for RGB + depth map + edge map. Our method outperforms others on different datasets and different additional modalities by up to +11% accuracy improvement. “Best others” stands for the best performance among 4 methods in Table 1. only one vehicle is used during test. We get 0.78% accuracy improvement. See Table 4 for a simplified comparison, and more details in Appendix A.8. Relation of Channel-level Importance and AML Performance. To show the channel-level attention for different auxiliary modalities is positively correlated to the final performance of AML with different auxiliary modalities, we conduct experiments on three tasks with the same setting stated in Sec. 4.1.3, then use the same three auxiliary modalities and an additional random noise modality (whose importance should be the lowest). As shown in Table 11, in Task 1, the importance order from the channel-level attention is attention image > depth map > frequency image, the performance order from AML is exactly the same. The same phenomenon can be observed in Task 2 and 3. This confirms that we only need to perform one-time training to select the best modality for a given task. See Appendix A.8. 2019), correlation (Peng et al., 2019), rkd (Park et al., 2019), pkt (Passalis et al., 2020), abound (Heo et al., 2019), factor (Kim et al., 2018b), fsp (Yim et al., 2017), using Audi dataset (Geyer et al., 2020) and ResNet (He et al., 2016). From Table 2, our method achieves up to 15.6% improvement in both settings, showing the effectiveness of our training paradigm (with reset operation). See Appendix A.8. Comparison on different datasets and modalities. We also perform comparison with other knowledge distillation methods on different datasets (Audi (Geyer et al., 2020), Honda (Ramanishka et al., 2018), and SullyChen (Chen, 2018)) and different modalities (RGB, segmentation, depth map, and edge map). Specifically, Audi dataset contains ground truth segmentation, and other segmentation is generated by Tao et al. (Tao et al., 2020), while the depth map is generated by (Bian et al., 2019) and the edge map is generated by DexiNet (Poma et al., 2020). In Table 3, Our method outperforms others in nearly all cases by up to +11% accuracy improvement. See more details in Appendix A.8. 6. Conclusion This paper introduces ’Auxiliary Modality Learning (AML)’. We first formalize the concept of AML in terms of types of auxiliary modality and architectures for AML. We analyze how types of auxiliary modality and architectures can affect AML performance on a single task and , across tasks: best architecture is consistent within a task or across tasks, while best auxiliary modality is consistent within one task but not consistent across tasks. We also analyze the effectiveness of AML in optimization and data perspectives to provide theory support for AML. Given these findings, we propose a novel method, SAMD, to first determine the best auxiliary modality, and then do a special auxiliary modality distillation to enable the teacher network to be aware of the student’s status, leading to a better distillation that achieves the SOTA performance. Comparison on other tasks and modalities. We perform comparison on multi-feature handwritten classification task (Han et al., 2021). We regard the six feature sets as six modalities, and treat each of them as a target modality in each experiment. Our method outperforms others with 5.1% on average. We also conducted experiments on another end-to-end autonomous driving task, “way-point prediction” task (Prakash et al., 2021). We use RGB image as main modality, and point cloud as auxiliary modality, and achieve 19% improvement on average route completion, compared to RGB image baseline. In the materials classification task (Wilson et al., 2022), we use RGB image as main modality, while using sound wave as auxiliary modality, achieving 6.4% performance gain. For the bird-eye-view segmentation task (Li et al., 2022a), point-cloud from multiple vehicles are used during training, and point cloud from Limitations and Future Work: In modality distillation, we reasonably assume that the teacher network is a supermodel of the student’s, as this task focuses on the reduction of modality, instead of model size, like general knowledge distillation. A possible future direction for AML is to further examine the impact of auxiliary modality data size, e.g., can we use only a small amount of auxiliary modality data 8 Auxiliary Modality Learning with Generalized Curriculum Distillation to achieve better performance? What if data is not paired with the main modality? Are there better architectures? Architectures that can take unpaired input data instead of paired data would be a future direction. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 7. Acknowledgment Ding, Q., Wu, S., Sun, H., Guo, J., and Xia, S.-T. Adaptive regularization of labels. arXiv preprint arXiv:1908.05474, 2019. This research is partially supported in part by ARO DURIP Grant, ARL Cooperate Agreement, Barry Mersky and Capital One Endowed Professorships. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Koltun, V. Carla: An open urban driving simulator. In Conference on robot learning, pp. 1–16. PMLR, 2017. References Ahn, S., Hu, S. X., Damianou, A., Lawrence, N. D., and Dai, Z. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163– 9171, 2019. Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., Liu, Y., Topol, E., Dean, J., and Socher, R. Deep learning-enabled medical computer vision. NPJ digital medicine, 4(1):1–9, 2021. Gao, Z., Chung, J., Abdelrazek, M., Leung, S., Hau, W. K., Xian, Z., Zhang, H., and Li, S. Privileged modality distillation for vessel border detection in intracoronary imaging. IEEE transactions on medical imaging, 39(5):1524– 1534, 2019. Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., and Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems, 32:35–45, 2019. Garcia, N. C., Morerio, P., and Murino, V. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118, 2018. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016. Garcia, N. C., Morerio, P., and Murino, V. Learning with privileged information via adversarial discriminative modality distillation. IEEE transactions on pattern analysis and machine intelligence, 42(10):2581–2593, 2019. Chai, J., Zeng, H., Li, A., and Ngai, E. W. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Machine Learning with Applications, 6:100134, 2021. Garcia, N. C., Bargal, S. A., Ablavsky, V., Morerio, P., Murino, V., and Sclaroff, S. Distillation multiple choice learning for multimodal action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2755–2764, 2021. Chai, W. and Wang, G. Deep vision multimodal learning: Methodology, benchmark, and trend. Applied Sciences, 12(13):6588, 2022. Chang, M.-W., Srikumar, V., Goldwasser, D., and Roth, D. Structured output learning with indirect supervision. In ICML, pp. 199–206, 2010. Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A. S., Hauswald, L., Pham, V. H., Mühlegg, M., Dorn, S., Fernandez, T., Jänicke, M., Mirashi, S., Savani, C., Sturm, M., Vorobiov, O., Oelker, M., Garreis, S., and Schuberth, P. A2D2: Audi Autonomous Driving Dataset. 2020. URL https://www.a2d2.audi. Chen, H., Wang, X., Guan, C., Liu, Y., and Zhu, W. Auxiliary learning with joint task and data scheduling. In International Conference on Machine Learning, pp. 3634– 3647. PMLR, 2022. Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., and Hu, S.-M. Attention mechanisms in computer vision: A survey. Computational Visual Media, pp. 1–38, 2022. Chen, S. A collection of labeled car driving datasets, https://github.com/sullychen/driving-datasets, 2018. Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., Maling, G. C., Nelson, D. E., Rader, C. M., and Welch, P. D. What is the fast fourier transform? Proceedings of the IEEE, 55(10):1664–1674, 1967. Gupta, S., Hoffman, J., and Malik, J. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2827–2836, 2016. 9 Auxiliary Modality Learning with Generalized Curriculum Distillation Han, Z., Zhang, C., Fu, H., and Zhou, J. T. Trusted multiview classification. arXiv preprint arXiv:2102.02051, 2021. Li, K., Yu, L., Wang, S., and Heng, P.-A. Towards crossmodality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 775–783, 2020. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. Li, Y., Ren, S., Wu, P., Chen, S., Feng, C., and Zhang, W. Learning distilled collaboration graph for multi-agent perception. Advances in Neural Information Processing Systems, 34:29541–29552, 2021. Heo, B., Lee, M., Yun, S., and Choi, J. Y. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3779–3787, 2019. Li, Y., Ma, D., An, Z., Wang, Z., Zhong, Y., Chen, S., and Feng, C. V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters, 7(4):10914–10921, 2022a. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. Advances in Neural Information Processing Systems (NIPS), 2015. Li, Z., Li, X., Yang, L., Zhao, B., Song, R., Luo, L., Li, J., and Yang, J. Curriculum temperature for knowledge distillation. arXiv preprint arXiv:2211.16231, 2022b. Hoffman, J., Gupta, S., and Darrell, T. Learning with side information through modality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 826–834, 2016. Liebel, L. and Körner, M. Auxiliary tasks in multi-task learning. arXiv preprint arXiv:1805.06334, 2018. Hou, M., Tang, J., Zhang, J., Kong, W., and Zhao, Q. Deep multimodal multilinear fusion with high-order polynomial pooling. Advances in Neural Information Processing Systems, 32, 2019. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014. Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018. Liu, Y.-C., Tian, J., Glaser, N., and Kira, Z. When2com: Multi-agent perception via communication graph grouping. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 4106–4115, 2020a. Jiang, X., Ma, J., Xiao, G., Shao, Z., and Guo, X. A review of multimodal image matching: Methods and applications. Information Fusion, 73:22–71, 2021. Jin, W., Sanjabi, M., Nie, S., Tan, L., Ren, X., and Firooz, H. Modality-specific distillation. arXiv preprint arXiv:2101.01881, 2021. Liu, Y.-C., Tian, J., Ma, C.-Y., Glaser, N., Kuo, C.-W., and Kira, Z. Who2com: Collaborative perception via learnable handshake communication. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6876–6883. IEEE, 2020b. Jin, X., Peng, B., Wu, Y., Liu, Y., Liu, J., Liang, D., Yan, J., and Hu, X. Knowledge distillation via route constrained optimization. In IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1345–1354, 2019. Park, W., Kim, D., Lu, Y., and Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976, 2019. Joshi, G., Walambe, R., and Kotecha, K. A review on explainability in multimodal deep neural nets. IEEE Access, 2021. Passalis, N., Tzelepi, M., and Tefas, A. Probabilistic knowledge transfer for lightweight deep representation learning. IEEE Transactions on Neural Networks and Learning Systems, 32(5):2030–2039, 2020. Kim, J., Park, S., and Kwak, N. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems (NIPS), pp. 2765–2774, 2018a. Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., and Zhang, Z. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016, 2019. Kim, J., Park, S., and Kwak, N. Paraphrasing complex network: Network compression via factor transfer. arXiv preprint arXiv:1802.04977, 2018b. Piasco, N., Sidibé, D., Gouet-Brunet, V., and Demonceaux, C. Improving image description with auxiliary modality 10 Auxiliary Modality Learning with Generalized Curriculum Distillation for visual localization in challenging conditions. International Journal of Computer Vision, 129(1):185–202, 2021. mri. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 547–556. IEEE, 2018. Wang, L., Gao, C., Yang, L., Zhao, Y., Zuo, W., and Meng, D. Pm-gans: Discriminative representation learning for action recognition using partial-modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 384–401, 2018. Poma, X. S., Riba, E., and Sappa, A. Dense extreme inception network: Towards a robust cnn model for edge detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1923– 1932, 2020. Wang, X., Liu, D., Kan, M., Han, C., Wu, Z., and Shan, S. Triplet knowledge distillation. arXiv preprint arXiv:2305.15975, 2023. Prakash, A., Chitta, K., and Geiger, A. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077–7087, 2021. Wang, Y. Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1s):1–25, 2021. Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7699–7707, 2018. Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732, 2016. Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. International Conference on Learning Representations (ICLR), 2015. Wen, T., Lai, S., and Qian, X. Preparing lessons: Improve knowledge distillation with better supervision. arXiv preprint arXiv:1911.07471, 2019. Ruder, S. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017. Wilson, J., Rewkowski, N., and Lin, M. C. Audio-visual depth and material estimation for robot navigation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9239–9246. IEEE, 2022. Shen, Y., Zheng, L., Shu, M., Li, W., Goldstein, T., and Lin, M. C. Gradient-free adversarial training against image corruption for learning-based steering. In Neural Information Processing Systems (NIPS), 2021. Xiang, L., Ding, G., and Han, J. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 247–263. Springer, 2020. Tao, A., Sapra, K., and Catanzaro, B. Hierarchical multiscale attention for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020. Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O., and López, A. M. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2020. Tian, Y., Krishnan, D., and Isola, P. Contrastive representation distillation. In International Conference on Learning Representations, 2020. Tung, F. and Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374, 2019. Xu, N., Mao, W., and Chen, G. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 371–378, 2019. UCI. Multiple Features Data Set, https://archive.ics.uci.edu/ml/datasets/multiple+features, 0. Xu, R., Xiong, C., Chen, W., and Corso, J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015. Valada, A., Radwan, N., and Burgard, W. Deep auxiliary learning for visual localization and odometry. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 6939–6946. IEEE, 2018. Yim, J., Joo, D., Bae, J., and Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133– 4141, 2017. Valindria, V. V., Pawlowski, N., Rajchl, M., Lavdas, I., Aboagye, E. O., Rockall, A. G., Rueckert, D., and Glocker, B. Multi-modal learning from unpaired images: Application to multi-organ segmentation in ct and 11 Auxiliary Modality Learning with Generalized Curriculum Distillation Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250, 2017. Zadeh, A., Liang, P. P., Mazumder, N., Poria, S., Cambria, E., and Morency, L.-P. Memory fusion network for multiview sequential learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Zhang, Y., Xiang, T., Hospedales, T. M., and Lu, H. Deep mutual learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4320–4328, 2018. Zheng, Y. Cross-modality medical image detection and segmentation by transfer learning of shapel priors. In 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pp. 424–427. IEEE, 2015. 12 Auxiliary Modality Learning with Generalized Curriculum Distillation A. Appendix A.1. Details on Experimental Settings Single task. We use autonomous driving task since there are datasets for this task that contain all types of auxiliary modality in Sec. 4.1.1. Specifically, the input is one RGB image and the output is one float value which represents the steering angle. Classical computer vision tasks, like object classification or detection, mostly do not use datasets with low-level sensing data other than RGB image (like depth map is not available in ImageNet or COCO). We use Audi dataset (Geyer et al., 2020) and Honda dataset (Ramanishka et al., 2018) in this experiment. Also, we use depth map (Type 1), frequency image (Type 2), and attention image (Type 3) as Auxiliary modalities. We generate depth map with (Bian et al., 2019), frequency image with standard 2D fast Fourier transform (Cochran et al., 1967), and attention image with segmentation map provided by Audi dataset. We implement all four types of auxiliary modality learning architectures introduced in Sec. 4.1.2, and choose the Nvidia PilotNet (Bojarski et al., 2016) and ResNet (He et al., 2016) as the main backbones. Mean accuracy defined in (Shen et al., 2021) is used as the evaluation metric. Figure 5. A simple example of supermodel. N et1 contains two blocks f1 and f2 . N et2 contains the same block f1 and f2 , and another block h which is possible to be set as an identical function. (A) IB ), if for any θA , there is a θB , such that MθA (IA ) = (B) MθB (IB ) for any arbitrary valid input data IA and its superset IB . We call model MB as a “supermodel” of MA . We show a simple example of supermodel in Fig. 5. N et1 contains two blocks f1 and f2 . N et2 contains the same block f1 and f2 , and another block h. If there is a set of specific weights θ0 for h that can meet hθ0 (x) = x for any valid x, then N et2 is a supermodel of N et1 , according to Definition. A.1. In this case, for any specific weights of N et1 , we can always construct a set of weights for N et2 that has exactly the same performance of N et1 , which means the optimal solution for training N et2 will be no worse than N et1 . Furthermore, if these two models are trained in parallel, the supermodel can be “repositioned” to the same status of the base model at any time by the construction method above. This property can be used in knowledge distillation to let the teacher get back to the student’s position and help find a better way at any time the student is stuck. Another example is for the same architecture with different numbers of layers, e.g., ResNet152 is a supermodel of ResNet50. Multiple tasks. We use Audi dataset (Geyer et al., 2020) for end-to-end steering task, COCO dataset (Lin et al., 2014) for real-world classification task, and a customized dataset for customized classification task. We use semantic segmentation label contained in Audi and COCO to generate related attention images. We use blur-level estimation task as the customized task, following (Shen et al., 2021) to add blur perturbation onto the Audi dataset, and use the level ID as the ground truth, see Fig. 6. Also, we use attention image and frequency image as auxiliary modalities, and implement all four types of auxiliary modality learning architectures introduced in Sec. 4.1.2. We choose Nvidia PilotNet (Bojarski et al., 2016) for steering task, ResNet (He et al., 2016) for the classification task, and modified PilotNet for the customized classification task (change the header of the network to general classification header). We use mean accuracy (Shen et al., 2021) for steering task, accuracy for real-world classification and customized classification. A.4. Prove of Lemma. 4.1 Lemma A.1 Given a model M and its supermodel M(s) , the optimal training loss of M(s) (which (s) is arg minθ(s) L(Mθ(s) (I (s) ), GT )) is less than or equal to the optimal training loss of M (which is arg minθ L(Mθ (I), GT )). where L is the loss function and GT is the ground truth. A.2. Experiment Results for Auxiliary Modality Types and Architectures We show experimental results for auxiliary modality in Table 5 and architectures in Table 6. See analysis in Sec. 4.1. Prove: Let θ∗ = arg minθ L(Mθ (I), GT ) represent the weights that lead to the best training performance for model M, then according to the definition of supermodel, there (s) is a θ(s)∗ that meet Mθ∗ (I) = Mθ(s)∗ (I (s) ), equivalent to A.3. Supermodel Example (s) We first introduce the “supermodel” definition: L(Mθ∗ (I), GT ) = L(Mθ(s)∗ (I (s) ), GT ). That is, there’s at least one solution for training M(s) can get the same performance as training M. Furthermore, if θ∗ is the optimal solution that achieves the minimal training loss of M(s) , (A) Definition A.1 Given a model MθA (IA ) (weights θA and (B) input IA ), and a model MθB (IB ) (weights θB and input 13 Auxiliary Modality Learning with Generalized Curriculum Distillation Audi (Geyer et al., 2020) PilotNet (Bojarski et al., 2016) ResNet (He et al., 2016) Honda (Ramanishka et al., 2018) Attention Frequency Depth Attention Frequency Depth Archi Type A Archi Type B Archi Type C Archi Type D 66.3 71.6 70.1 73.4 64.9 66.5 65.7 67.9 65.8 68.4 68.8 70.8 74.5 75.9 74.3 77.4 72.9 73.2 73.7 74.8 73.4 75.1 74.2 76.7 Archi Type A Archi Type B Archi Type C Archi Type D 78.5 80.5 79.6 82.4 77.9 79.1 78.5 79 78.2 80.1 79.3 81.8 82.1 84.7 83.6 85.2 81.1 82.4 82.1 83 81.9 83.9 83.1 84.3 Table 5. Performance improvement comparison (Mean Accuracy %) with different auxiliary modalities, architectures, backbones and datasets. The relative effectiveness for different architectures is consistent under different datasets, backbones, and auxiliary modalities within one task. Similarly, The relative effectiveness for different auxiliary modalities is consistent under different datasets, backbones, and architectures within one task. task 1 Archi Type A Archi Type B Archi Type C Archi Type D task 2 task 3 Attention Frequency Attention Frequency Attention Frequency 66.3 71.6 70.1 73.4 64.9 66.5 65.7 67.9 70.1 82.1 80.7 84.3 69.3 73.6 71.1 75.6 64.3 68.4 65.3 70.3 65.2 72.5 70.8 74.9 Table 6. Performance comparison (Mean Accuracy %) across tasks. The effectiveness order of different architectures is consistent across tasks, but not for auxiliary modalities. A.6. Modality Choice then the equal condition in Lemma 4.1 holds, if not, the less condition holds. We show a modified network to extract channel-level importance and estimate modality effectiveness with SE block in Fig. 8. Suppose we already have a network f that can take the main modality Im as input and perform prediction for a given task. Now we have n types of auxiliary modalities that potentially can help. We first pack the different modality data in the channel level, and feed them into the Squeezeand-Excitation (SE) block (Hu et al., 2018), followed by a 1x1 convolutional layer to ensure the channel number is the same as the main modality Im , so that the original network f can take it as input and perform prediction. Notice those discussions are all on the training space, and we assume that better training performance will lead to better test performance in general. Otherwise, given the test set is unknown during training, model A is guaranteed no worse than B in test if and only if model A is no worse than B for every possible data points in test domain (or there will be at least one test set that contains data points that model A is worse than B ), upon which no existing work can provide any theoretical guarantee. A.5. More Explanation on Why AML Can Work In practice, when we start to solve an AML task, we may have multiple auxiliary modality available, but collecting a full dataset for all of them may be time-consuming. We can first collect a small set of data with all modalities, and use our method to decide which or which sets of auxiliary modality is needed. After that, we can collect all the useful modality data on a larger scale, try different backbones, tune hyper-parameters, etc. Finding (5) in Sec. 4.1.3 motivates the need to select the best auxiliary modality at the beginning of a task, but no need to re-select even when using another architecture type, backbone, or dataset. For estimating the upper-bound, we need a full set of all modalities. The model used in this step is the “supermodel” of the teacher model in the next step, and thus it can help estimate the upper-bound performance, given Lemma 4.1. In Fig. 7, the main modality data has few examples in the hard and challenging case area, which leads to a wrong decision boundary. This is common in real-world datasets, e.g., autonomous driving datasets usually have fewer datasets for night-time driving, and even fewer on accidents. After adding the auxiliary modality that provides more information in the hard case area, it would be much easier to learn a correct decision boundary, then use this information to guide the training process with the main modality. For example, the infra-red image or depth map contains more information than RGB image when captured at night. This explains why the low-level sensing data (Type 1 in Sec. 4.1.1) can help AML. 14 Auxiliary Modality Learning with Generalized Curriculum Distillation Figure 6. Tasks for our experiments. LEFT: end-to-end steering task, input image, output steering angle. MIDDLE: classification task, input image, output object category. RIGHT: classification task, input image, output blur level. Figure 7. Auxiliary modality helps construct the decision boundary around the difficult cases (e.g. lack of data coverage). Circles are main modality data, and squares are auxiliary modality data. where θstu is the parameter of the student network, LM is the loss function, and η is the learning rate. Meanwhile, we design a teacher that takes {IM , IA } as input, and update via an independent feature network Ftea (F1 , F2 , F3 in Fig. 3) and a predictor D that share weights with that of the student network. The teacher network is updated via θtea ← θtea − η∇LA (D(Ftea ({IM , IA })), GT ) . (2) The teacher and student learn different representations related to the same task by being exposed to different modalities. The teacher has access to the auxiliary modality IA , the knowledge of the teacher is distilled to assist the student through a consistency loss Lcon that measures the pairwise distance between Fstu (IM ) and Ftea (IM , IA ) as part of the student’s objective LM , specifically, Figure 8. Modified network to extract channel-level importance and estimate modality effectiveness with SE block. (Hu et al., 2018) See more descriptions in Sec. 5.1. LM = Lsup (D(Fstu (IM )), GT ) + A.7. AML in SAMD βLcon (Fstu (IM ), Ftea ({IM , IA })) Formally, given a task, we denote a learner composed of a feature network F and a predictor of fully-connected layers D. We design a student that takes IM as input, and update via iterations of mini-batches, (3) where Lsup is a term that supervises the learning on the main modality. (A) M θstu ← θstu − η∇L Definition A.2 Given a model MθA (IA ) (weights θA and (1) (B) input IA ), and its supermodel MθB (IB ) (weights θB and 15 Auxiliary Modality Learning with Generalized Curriculum Distillation input IB ), we define “reset B with A” to be the process of (A) (B) constructing a new θB that meet MθA (IA ) = MθB (IB ) for given θA and any arbitrary valid input data IA and its superset IB . for different knowledge distillation methods following (Tian et al., 2020). We pick epoch number in each round k = 5 from ablation study of k = 1, 2, 5, 20. We set the round number n = 400 for Audi dataset and n = 40 for Honda dataset. In the experiments, each training process is finished within 24 hours. The main task is the steering task introduced in the single task setting in Appendix A.1. A simple example is, suppose B is a supermodel of A (e.g., B = A + A′ ), reset B with A is constructing θB = [θA , 0], where θA is the weights of A and 0 is the weights of A′ . In Fig. 3, the teacher network is a supermodel of the student network, because for any weights of student network, we can construct a teacher network that meet D(Ftea ({IM , IA })) = D(Fstu ({IM })) by resetting the F1 weights with F4 weights, F2 weights with F5 weights, and set F3 weights to 0. Indeed the reset operation in our method requires that the teacher model is a supermodel of the student model. Comparison on other tasks. To show the generalizability of our method, we perform comparison on multi-feature handwritten classification task (Han et al., 2021) in Table 7. The dataset (UCI, 0) consists of six features of handwritten numerals (‘0’–‘9’) with 2,000 samples in total. We regard the six feature sets as six modalities, and treat each of them as target modality in each experiment. Our method outperforms others by 5.1% higher accuracy on average. As shown in Algorithm 1, the training paradigm contains t rounds. In each round, we first reset the teacher with the student, then train the teacher independently while training student with both the general label loss and knowledge distillation loss for k epochs. k should not be too large to avoid the teacher being far away from the student. The training process stops when the student converges between different rounds or until finishing t rounds. Accuracy (%) on different modalities (ID:1∼6) Method 1 2 3 4 5 6 mean Best Others Ours 84.92 89.40 62.98 65.20 68.75 72.80 61.10 69.50 70.35 73.15 43.17 51.75 65.2 70.3 Table 7. Performance comparison on handwritten classification task. Our method outperforms other KD methods listed in Table 1 by 5.1% higher accuracy on average. Algorithm 1 SAMD Training Paradigm Input: Training data from main modality IM , training data from auxiliary modality IA (chosen by method in Sec. 5.1) Output: student network weights θstu Initialisation: Training Round number t, epoch number in each round k, loss correlation β, network weights θstu and θtea . for r = 1 to t do Reset teacher weights with student weights for e = 1 to k do Feed IM and IA into teacher, update teacher weights θtea with Eq. 2 end for for e = 1 to k do Feed IM and IA into teacher, and feed IM into student, update student weights θstu with Eq. 1 and loss 3 end for end for We also conducted experiments on another end-to-end autonomous driving task, “way-point prediction” task. Following the setting of (Prakash et al., 2021), we consider the task of navigation along a set of predefined routes in different areas, such as motorways, urban regions, and residential districts. A sequence of sparse goal locations in GPS coordinates, provided by a global planner and the related discrete navigational commands (e.g. “follow lane”, “turn left/right”, and “change lane”), constitute the routes. Only the sparse GPS locations are used in our method. Each route consists of several scenarios, which are initialized at predefined locations and test the agent’s ability to handle various adversarial situations, such as obstacle avoidance, unprotected turns at intersections, vehicles running red lights, and pedestrians emerging from occluded regions crossing the road at random locations. The agent needs to complete the route within a certain amount of time, while following traffic regulations and dealing with large numbers of dynamic agents. For dataset, we use the CARLA (Dosovitskiy et al., 2017) simulator for training and testing, specifically CARLA 0.9.10 which includes 8 publicly available towns. We use 7 towns for training and hold out Town05 for evaluation, as in (Prakash et al., 2021). We use both RGB and LiDAR for training in AML, but only RGB data for testing. The results are shown in Table 8. Our method benefits from the auxillary LiDAR modality in training using AML, with only RGB data during query. This set of experimental results demonstrates the effectiveness of AML. A.8. Additional SAMD Results Setting. All experiments are conducted using one Intel(R) Xeon(TM) W-2123 CPU, two Nvidia GTX 1080 GPUs, and 32G RAM. We use the SGD optimizer with learning rate 0.001 and batch size 128 for training. The number of epochs is 2,000. The loss correlation β is set with different values 16 Auxiliary Modality Learning with Generalized Curriculum Distillation Model RGB RGB+PC Ours(new) DS↑ 21.0 11.2 22.1 RC↑ 60.5 52.9 79.5 IP↓ 0.49 0.37 0.37 CP↓ 0.01 0.02 0.01 CV↓ 0.15 0.22 0.07 CL↓ 0.08 0.01 0.04 RLI↓ 0.14 0.38 0.26 SSI↓ 0.04 0.02 0.04 is generated by (Bian et al., 2019) and the edge map is generated by DexiNet (Poma et al., 2020). In Table 12, our method outperform others in practically all cases by up to +11% accuracy improvement. Table 8. Performance comparison on long-route waypoints prediction between base (train and test on RGB), multi-modality (train and test on RGB + point cloud), and ours (train on RBG + point cloud, test using only RGB). DS: Avg. driving score, RC: Avg. route completion, IP: Avg. infraction penalty, CP: Collisions with pedestrians, CV: Collisions with vehicles, CL: Collisions with layout, RLI: Red lights infractions, SSI: Stop sign infractions. Effectiveness when combining with different knowledge distillation methods. Since our training paradigm can be applied on existing knowledge distillation methods, we do experiments by combining ours with kd (Hinton et al., 2015), hint (Romero et al., 2015), similarity (Tung & Mori, 2019), correlation (Peng et al., 2019), rkd (Park et al., 2019), pkt (Passalis et al., 2020), abound (Heo et al., 2019), factor (Kim et al., 2018b), fsp (Yim et al., 2017). From Table. 13, our method achieves up to 15.6% improvement in both settings, showing the effectiveness of our training paradigm (containing reset operation). Accuracy (%) on various angle threshold τ (degree) Method τ = 1.5 τ = 3.0 τ = 7.5 τ = 15 τ = 75 mAcc Seg GT Seg Infer 50.6 48.3 70.9 69.5 85.4 85.3 96.1 95.7 99.2 98.6 80.44 79.48 Relation of Channel-level Importance and AML Performance. To show the channel-level attention for different auxiliary modalities is positively correlated to the final performance of AML with different auxiliary modalities, we conduct experiments on three tasks with the same setting stated in Sec. 4.1.3, then use the same three auxiliary modalities and an additional random noise modality (whose importance should be the lowest). We use knowledge distillation based architecture (Type D), since it’s consistently better than other architectures (see Sec. 4.1.3). Table 9. Performance comparison between ground truth and generated segmentation. The results show that the inferred segmentation can do nearly as well as ground truth segmentation, when serving as auxiliary modality (within 1% of difference). Therefore, we can use pre-trained models to generate auxiliary modality conveniently. In addition, we apply our method on audio modality based on an audio-visual depth and material estimation work (Wilson et al., 2022). We use RGB image as the main modality, and audio wave as the auxiliary modality. The task is material and depth classification. We use the same dataset in the original audio-visual work, which contains about 16,000 pairs of RGB image and audio wave. Since there’s no open-source code, we reimplement the original work, then apply our method to it. Our method outperforms other KD methods listed in Table 1 by 6.4%. As shown in Table 11, in Task 1, the importance order from the channel-level attention is attention image > depth map > frequency image, and the performance order from AML is also attention image > depth map > frequency image. The same phenomenon can be observed in Task 2 and 3. This shows we only need to perform one-time training to select the best modality for a given task. Comparison of ground truth and generated auxiliary modality. We conduct experiment with ground truth segmentation and generated segmentation (Tao et al., 2020) to see how much it will influence the performance. The model used to generate segmentation for Audi dataset (Geyer et al., 2020) is trained on Cityscapes dataset (Cordts et al., 2016). Table 9 shows that the generated segmentation can do nearly as well as ground truth segmentation, when serving as auxiliary modality (i.e. within 1% of difference), thus we can use pre-trained models to generate auxiliary modality conveniently. Finally, we apply our method on a bird-eye-view segmentation task (Li et al., 2022a). During training, a mixed point cloud from multiple viewpoints is used as input, while a point cloud from one viewpoint is used during test. We use the same virtual autonomous driving dataset (Li et al., 2022a), which contains 48,000 datapoints for training, 6,000 datapoints for test, and 6,000 datapoints. We apply our method based on the DiscoNet (Li et al., 2021). In Table 10, we show our method achieves the best performance compared to other methods. Comparison on different datasets and modalities. We also perform comparison with other knowledge distillation methods on different datasets (Audi (Geyer et al., 2020), Honda (Ramanishka et al., 2018), and SullyChen (Chen, 2018)) and different modalities (RGB, segmentation, depth map, and edge map). Specifically, Audi dataset contains ground truth segmentation, and other segmentation is generated by Tao et al. (Tao et al., 2020), while the depth map A.9. Tasks, Datasets, Backbones Tasks. We use autonomous driving tasks and 5 additional tasks in other domains. These include: object classification in the multi-task experiment (Sec. 4.1.3), handwritten classification, waypoint prediction, materials classification, and bird-eye-view segmentation experiments (in Table 4 from 17 Auxiliary Modality Learning with Generalized Curriculum Distillation Method Vehicle Sidewalk Terrain Road Building Pedestrian Vegetation mIoU Lower-bound Co-lower-bound 45.93 47.67 42.39 48.79 47.03 50.92 65.76 70 25.38 25.26 20.59 10.78 35.83 39.46 40.42 41.84 When2com (Liu et al., 2020a) Who2com (Liu et al., 2020b) DiscoNet (Li et al., 2021) Ours 48.43 48.4 56.66 56.52 33.06 32.76 46.98 47.43 36.89 36.04 50.22 49.72 57.74 57.51 68.62 67.72 29.2 29.17 27.36 30.59 20.37 20.36 22.02 22.23 39.17 39.08 42.5 42.86 37.84 37.62 44.91 45.30 Upper-bound 64.09 41.34 48.2 67.05 29.07 31.54 45.04 46.62 Table 10. Performance comparison on bird-eye-view segmentation task. Our method achieves the best performance compared to three other methods, with only 1.32% performance difference from the upper-bound. We follow the same setting of (Li et al., 2021) for the lower-bound, co-lower-bound and upper-bound. Channel-level Importance Task 1 Task 2 Task 3 AML Performance Attention Frequency Depth Noise Attention Frequency Depth Noise 0.32 0.65 0.09 0.08 0.12 0.26 0.12 - 6.9e-6 2.7e-6 3.2e-6 73.4 84.3 70.3 67.9 75.6 74.9 70.8 - 65.2 70.1 60.8 Table 11. Relation between relative orders of channel-level importance and AML performance for different auxiliary modalities. The relative modality orders are consistent between channel-level importance and AML performance within each task, therefore we can use channel-level importance to choose the best auxiliary modality before AML. Sec. 5.3, and Appendix A.8). Datasets. We use 10 datasets in total. They are: Honda, Audi, COCO, a customized dataset (Sec. 4.1.3), SullyChen Driving data, CityScapes, 4 datasets for handwritten classification (described in Appendix A.1), waypoint prediction, materials classification, and bird-eye-view segmentation, used in Sec. 5.3 and in Appendix A.8. Audi dataset (Geyer et al., 2020), or Audi Autonomous Driving Dataset (A2D2), is a dataset that features 2D semantic segmentation, 3D point clouds, 3D bounding boxes, and vehicle bus data. It includes more than 40,000 frames with semantic segmentation image and point cloud labels, of which more than 12,000 frames also have annotations for 3D bounding boxes. In addition, the authors provide unlabelled sensor data (approx. 390,000 frames) for sequences with several loops, recorded in three cities. In our experiment, we use the ”Gaimersheim” package which contains about 15,000 images with about 30 FPS. For efficiency, we adopt a similar approach as in (Bojarski et al., 2016) by further downsampling the dataset to 15 FPS to reduce similarities between adjacent frames, keep about 7,500 images and align them with steering labels. Backbones. We conduct experiments and analysis on 8 backbones in total. They are: (1) PiloNet and (2) ResNet for steering and object classification (Sec. 4.1.3), (3) TMC for handwritten classification, (4) Multi-Modal Fusion Transformer for waypoint prediction, (5) EchoCNN-AV for materials classification, (6-8) When2com, Who2com, and DiscoNet for bird-eye-view segmentation (Sec. 5.3). A.10. Dataset Description SullyChen dataset (Chen, 2018) is designed for the steering task with the longest continuous driving image sequence without road branching. Images are sampled from videos at 30 frames per second (FPS). We downsample the dataset to 5 FPS. The resulting dataset contains ≈10,000 images. Honda dataset (Ramanishka et al., 2018), or HRI Driving Dataset (HDD), is a challenging dataset to enable research on learning driver behavior in real-life environments. The dataset includes 100+ long-time driving videos with 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors. We first select 30 videos that are most suitable for learning to steer task, then we extract 110,000 images from them at 1 FPS, and align them with the steering labels. COCO (Lin et al., 2014) is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (¿200K labeled), 1.5 million object instances, 80 object categories, 91 stuff 18 Auxiliary Modality Learning with Generalized Curriculum Distillation Accuracy (%) on different angle threshold τ (degree) Method τ = 1.5 τ = 3.0 τ = 7.5 τ = 15 τ = 30 τ = 75 mAcc RGB+seg Teacher 42.7 68.0 88.0 94.4 96.6 98.6 81.4 RGB RGB best others ours 30.3 52.6 51.0 72.7 78.2 91.3 88.4 95.0 94.4 97.0 98.2 98.3 73.4 84.5 RSDE RSDE Teacher 49.9 72.1 89.5 94.9 97.1 98.6 83.7 Audi RSDE RSDE RGB RGB best others ours 27.7 30.2 47.8 50.3 77.4 79.7 90.8 91.0 95.6 96.2 98.3 98.6 72.9 74.3 SullyChen RDE RDE Teacher 41.1 63.7 88.6 95.9 97.9 99.1 81.0 SullyChen RDE RDE RGB RGB best others ours 59.5 63.4 82.1 83.0 93.9 94.3 98.2 98.2 99.5 99.5 100.0 100.0 88.9 89.7 Honda RSDE RSDE Teacher 41.3 61.1 83.9 94.0 98.3 99.9 79.8 Honda RSDE RSDE RGB RGB best others ours 38.9 37.9 57.7 57.7 79.7 81.7 91.7 93.5 97.5 98.2 99.3 99.6 77.4 78.1 Dataset Train Mod Test Mod Audi RGB+seg Audi RGB+seg RGB+seg Audi Table 12. Comparison on different datasets and different modalities. “RSDE” refers to results from RGB + segmentation + depth map + edge map, and “RDE” for RGB + depth map + edge map. Our method outperforms others on different datasets and different additional modalities by up to +11% accuracy improvement. categories, 5 captions per image, 250,000 people with keypoints. that the student model passed by. Xiang et al. (Xiang et al., 2020) do curriculum on *instance* level with *multiple* teachers, Li et al. (Li et al., 2022b) do curriculum on *hyperparameter* level (which is the temperature for knowledge distillation) with one teacher, while ours do curriculum on *parameter* level with one teacher. Other datasets used in Table 4. Handwritten classification dataset (UCI, 0) consists of six features of handwritten numerals (‘0’–‘9’) with 2,000 samples in total. In the endto-end autonomous driving task, we use the CARLA (Dosovitskiy et al., 2017) simulator for training and testing, specifically CARLA 0.9.10 which includes 8 publicly available towns. We use 7 towns for training and hold out Town05 for evaluation, as in (Prakash et al., 2021). In the audio-visual depth and material estimation work (Wilson et al., 2022), we use the same dataset in the original audio-visual work, which contains about 16,000 pairs of RGB images and audio waves. In the bird-eye-view segmentation task (Li et al., 2022a), we also use the same virtual autonomous driving dataset (Li et al., 2022a), which contains 48,000 datapoints for training, 6,000 datapoints for test, and 6,000 datapoints. Multimodal learning works (Chai & Wang, 2022) use the same types of modality during training and test, but ours focus on modality reduction. Some of them (Zadeh et al., 2017; Hou et al., 2019) use matrix-based fusion, some (Xu et al., 2015) use MLP-based fusion, and some (Zadeh et al., 2018; Xu et al., 2019) use attention-based fusion. AML improves the ability of a primary task to generalize to *unseen* data, by training on additional auxiliary tasks alongside this primary task, while ours don’t have multiple tasks. For example, Liebel et al. (Liebel & Körner, 2018) propose a method that using auxiliary task to boost the performance of the ultimately desired main tasks, Valada et al. (Valada et al., 2018) propose VLocNet, a new convolutional neural network architecture for 6-DoF global pose regression and odometry estimation from consecutive monocular images, and recently Chen et al. (Chen et al., 2022) propose to learn a joint task and data schedule for auxiliary learning, which captures the importance of different data samples in each auxiliary task to the target task. A.11. More Related Works Except for the cross-modality learning and knowledge distillation works introduced in Sec. 2, there are other related works from curriculum distillation, multimodal learning and auxiliary learning. Curriculum distillation aims to do knowledge distillation in a curriculum way. Jin et al. (Jin et al., 2019) proposes RCO that supervises the student model with some anchor points selected from the parameter space route that the teacher model passed by, while ours is using *online* distillation with start points selected from the parameter space route 19 Auxiliary Modality Learning with Generalized Curriculum Distillation Accuracy on different threshold τ (%) Method τ = 1.5 Teacher (img+seg) Student (img) 42.7 27.3 τ = 15 τ = 30 τ = 75 Mean 94.4 90.2 96.6 95.4 98.6 98.1 81.4 72.9 Existing Distillation Methods 28.4 47.7 73.2 87.2 31.7 50.2 69.5 77.0 33.0 55.9 80.8 90.5 36.2 59.1 81.5 91.7 32.9 53.6 80.3 91.8 34.2 55.4 80.8 90.4 49.7 71.2 89.9 94.8 32.8 53.9 77.8 88.9 36.8 59.2 82.0 90.6 30.8 51.6 74.9 85.8 94.3 83.7 95.1 95.3 96.2 94.9 96.7 94.6 94.7 91.6 98.4 93.8 98.3 98.2 98.5 98.5 98.3 98.0 97.9 97.4 71.5 67.6 75.6 77.0 75.6 75.7 83.4 74.3 76.9 72.0 Existing Distillation Methods with Our Training Paradigm kd (Hinton et al., 2015) 49.7 71.2 89.9 94.8 96.7 hint (Romero et al., 2015) 48.6 71.0 90.1 94.8 96.7 similarity (Tung & Mori, 2019) 52.1 71.8 90.0 94.8 96.6 correlation (Peng et al., 2019) 31.8 52.7 78.1 89.7 95.2 rkd (Park et al., 2019) 54.3 72.2 90.1 94.7 96.6 pkt (Passalis et al., 2020) 34.5 56.9 82.9 90.3 95.5 vid (Ahn et al., 2019) 48.6 71.0 90.1 94.8 96.7 abound (Heo et al., 2019) 29.6 49.5 74.4 87.3 93.5 factor (Kim et al., 2018b) 49.7 71.2 89.9 94.8 96.7 fsp (Yim et al., 2017) 28.8 48.2 71.5 83.9 91.2 98.3 98.3 98.3 98.3 98.3 98.4 98.3 97.8 98.3 97.4 83.4 83.2 83.9 74.3 84.4 76.4 83.2 72.0 83.4 70.1 kd (Hinton et al., 2015) hint (Romero et al., 2015) similarity (Tung & Mori, 2019) correlation (Peng et al., 2019) rkd (Park et al., 2019) pkt (Passalis et al., 2020) vid (Ahn et al., 2019) abound (Heo et al., 2019) factor (Kim et al., 2018b) fsp (Yim et al., 2017) τ = 3.0 τ = 7.5 Train Vanilla 68.0 88.0 49.0 77.4 Improvement 11.9 15.6 8.3 -2.7 8.8 0.7 -0.2 -2.3 6.5 -1.9 Table 13. Performance comparison with vs. without our training paradigm (containing reset operation). By applying our training paradigm on other knowledge distillation methods, we can achieve better performance in most cases (up to +15.6%) in either fully paired or merely a small amount of additional modality data. 20