A Closer Look at Self-Supervised Lightweight Vision Transformers

Shaoru Wang 1 2 Jin Gao 1 2 Zeming Li 3 Xiaoqin Zhang 4 Weiming Hu 1 2 5

Abstract

sive labeled data. SSL focuses on various pretext tasks for
pre-training. Among them, several works (He et al., 2020;
Chen et al., 2020; Grill et al., 2020; Caron et al., 2020;
Chen et al., 2021a; Caron et al., 2021) based on contrastive
learning (CL) have achieved comparable or even better accuracy than supervised pre-training when transferring the
learned representations to downstream tasks. Recently, another trend focuses on masked image modeling (MIM) (Bao
et al., 2021; He et al., 2021; Zhou et al., 2022), which perfectly fits Vision Transformers (ViTs) (Dosovitskiy et al.,
2020) for vision tasks, and achieves improved generalization performance. Most of these works, however, involve
large networks with little attention paid to smaller ones.
Some works (Fang et al., 2020; Abbasi Koohpayegani et al.,
2020; Choi et al., 2021) focus on CL on small convolutional
networks (ConvNets) and improve the performance by distillation. However, the pre-training of lightweight ViTs is
considerably less studied.

Self-supervised learning on large-scale Vision
Transformers (ViTs) as pre-training methods has
achieved promising downstream performance.
Yet, how much these pre-training paradigms promote lightweight ViTs’ performance is considerably less studied. In this work, we develop and
benchmark several self-supervised pre-training
methods on image classification tasks and some
downstream dense prediction tasks. We surprisingly find that if proper pre-training is adopted,
even vanilla lightweight ViTs show comparable performance to previous SOTA networks
with delicate architecture design. It breaks the
recently popular conception that vanilla ViTs
are not suitable for vision tasks in lightweight
regimes. We also point out some defects of
such pre-training, e.g., failing to benefit from
large-scale pre-training data and showing inferior performance on data-insufficient downstream
tasks. Furthermore, we analyze and clearly show
the effect of such pre-training by analyzing the
properties of the layer representation and attention maps for related models. Finally, based on
the above analyses, a distillation strategy during pre-training is developed, which leads to further downstream performance improvement for
MAE-based pre-training. Code is available at
https://github.com/wangsr126/mae-lite.

Efficient neural networks are essential for modern ondevice computer vision. Recent studies on achieving topperforming lightweight models mainly focus on designing
network architectures (Sandler et al., 2018; Howard et al.,
2019; Graham et al., 2021; Ali et al., 2021; Heo et al., 2021;
Touvron et al., 2021b; Mehta & Rastegari, 2022; Chen et al.,
2021b; Pan et al., 2022), while little attention is paid to how
to optimize the training strategies for these models. We believe the latter is also of vital importance, and the utilization
of pre-training is one of the most hopeful approaches along
this way, since it has achieved great progress on large models. To this end, we develop and benchmark recently popular self-supervised pre-training methods, e.g., CL-based
MoCo-v3 (Chen et al., 2021a) and MIM-based MAE (He
et al., 2021), along with fully-supervised pre-training for
lightweight ViTs as the baselines on ImageNet and other
classification tasks, as well as some dense prediction tasks,
e.g., object detection and segmentation. We surprisingly
find that if proper pre-training is adopted, even vanilla
lightweight ViTs show comparable performance to previous SOTA networks with delicate design, e.g., we achieve
79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny
(5.7M). The finding is intriguing since the result indicates
that proper pre-training could bridge the performance gap
between naive network architectures and delicately designed
ones to a great extent, while naive architectures usually have

1. Introduction
Self-supervised learning (SSL) has shown great progress
in representation learning without heavy reliance on expen1
State Key Laboratory of Multimodal Artificial Intelligence
Systems, Institute of Automation, Chinese Academy of Sciences
2
School of Artificial Intelligence, University of Chinese Academy
of Sciences 3 Megvii Research 4 Key Laboratory of Intelligent Informatics for Safety & Emergency of Zhejiang Province, Wenzhou University 5 School of Information Science and Technology, ShanghaiTech University. Correspondence to: Jin Gao
<jin.gao@nlpr.ia.ac.cn>.

Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).

1

A Closer Look at Self-Supervised Lightweight Vision Transformers

Compared Methods. Baseline: We supervisedly train a
ViT-Tiny from scratch for 300 epochs on the training set of
ImageNet-1k (dubbed IN1K). It achieves 74.5% top-1 accuracy on the validation set of ImageNet-1k, surpassing that
in the original architecture (72.2% (Touvron et al., 2021a))
through modifying the number of heads to 12 from 3, and
further reaches 75.8% by adopting our improved training
recipe (see Appendix A.1), which finally serves as our strong
baseline to examine the pre-training. We denote this model
from supervised training as DeiT-Tiny.

faster inference speed, by getting rid of some complicated
operators. We also point out some defects of such pretraining, e.g., failing to benefit from large-scale pre-training
data and showing inferior performance on data-insufficient
downstream tasks.
These findings motivate us to dive deep into the working
mechanism of these pre-training methods for lightweight
ViTs. More specifically, we introduce a variety of model
analysis methods to study the pattern of layer behaviors
during pre-training and fine-tuning, and investigate what
really matters for downstream performance. First, we find
that lower layers of the pre-trained models matter more than
higher ones if sufficient downstream data is provided, while
higher layers matter in data-insufficient downstream tasks.
Second, we observe that the pre-training with MAE makes
the attention of the downstream models more local and concentrated, i.e., introduces locality inductive bias, which may
be the key to the performance gain. Based on the above analyses, we also develop a distillation strategy for MAE-based
pre-training, which significantly improves the pre-training
of lightweight ViTs. Better downstream performance is
achieved especially on data-insufficient classification tasks
and detection tasks.

MAE: MAE (He et al., 2021) is selected as a representative
for MIM-based pre-training methods, which has a simple
framework with low training cost. We largely follow the
design of MAE except that the encoder is altered to ViTTiny. Several basic factors and components are adjusted
to fit the smaller encoder (see Appendix A.2). By default,
we do pre-training on IN1K for 400 epochs, and denote the
pre-trained model as MAE-Tiny.
MoCov3: We also implement a contrastive SSL pre-training
counterpart, MoCo-v3 (Chen et al., 2021a), which is selected for its simplicity. We also do 400-epoch pre-training
and denote the pre-trained model as MoCov3-Tiny. Details
are provided in Appendix A.3.
Some other methods, e.g., MIM-based SimMIM (Xie et al.,
2022) and CL-based DINO (Caron et al., 2021) are also
involved, but are moved to Appendix B.3 due to space limitation.

2. Preliminaries and Experimental Setup
ViTs. We use ViT-Tiny (Touvron et al., 2021a) in our study
to examine the effect of the pre-training on downstream performance, which contains 5.7M parameters. We adopt the
vanilla architecture, consisting of a patch embedding layer
and 12 Transformer blocks with an embedding dimension of
192, except that the number of heads is increased to 12 as we
find it can improve the model’s expressive power. ViT-Tiny
is chosen for study because it is an ideal experimental object,
on which almost all existing pre-training methods can be
directly applied. And it has a rather naive architecture: nonhierarchical, and with low human inductive bias in design.
Thus the influence of the model architecture design on our
analyses can be eliminated to a great extent.

3. How Well Does Pre-Training Work on
Lightweight ViTs?
In this section, we first benchmark the aforementioned pretrained models on ImageNet, and then further evaluate their
transferability to other datasets and tasks.
3.1. Benchmarks on ImageNet Classification Tasks
Which pre-training method performs best? We first develop and benchmark the pre-training methods on ImageNet,
involving the baseline that does not adopt any pre-training,
supervised pre-training on the training set of ImageNet21k (a bigger and more diverse dataset, as roughly ten
times the size of IN1K, dubbed IN21K) and the aforementioned self-supervised pre-training with MoCo-v3 and MAE.
As reported in Tab. 1, most of these supervised and selfsupervised pre-training methods improve the downstream
performance, whilst MAE outperforms others and consumes
moderate training cost. The results indicate that the vanilla
ViTs have great potential, which can be unleashed via proper
pre-training. It encourages us to further explore how the enhanced ViTs perform compared to recent SOTA ConvNets
and ViT derivatives.

Evaluation Metrics. We adopt fine-tuning as the default
evaluation protocol considering that it is highly correlated
with utility (Newell & Deng, 2020), in which all the layers
are tuned by initializing them with the pre-trained models. By default, we do the evaluation on ImageNet (Deng
et al., 2009) by fine-tuning on the training set and evaluating
on the validation set. Several other downstream classification datasets (e.g., Flowers (Nilsback & Zisserman, 2008),
Aircraft (Maji et al., 2013), CIFAR100 (Krizhevsky et al.,
2009), etc.) and object detection and segmentation tasks on
COCO (Lin et al., 2014) are also exploited for comparison.
For a more thorough study, analyses based on linear probing
evaluation are presented in Appendix B.2.
2

A Closer Look at Self-Supervised Lightweight Vision Transformers
Table 1. Comparisons on pre-training methods. We report top-1 accuracy on the validation set of ImageNet-1k. IN1K and IN21K
indicate the training set of ImageNet-1k and ImageNet-21k. The pre-training time is measured on 8×V100 GPU machine. ‘ori.’ represents
the supervised training recipe from Touvron et al. (2021a) and ‘impr.’ represents our improved recipe (see Appendix A.1).
Pre-training
Fine-tuning
Methods
Data
Epochs
Time (hour)
recipe
Top-1 Acc. (%)
Supervised (Steiner et al., 2021)
Supervised (Steiner et al., 2021)
MoCo-v3 (Chen et al., 2021a)
MAE (He et al., 2021)
†

IN21K w/ labels
IN21K w/ labels
IN1K w/o labels
IN1K w/o labels

30
300
400
400

20
200
52
23

ori.
impr.
impr.
impr.
impr.
impr.

74.5
75.8
76.9
77.8
76.8†
78.0

Global average pooling is used instead of the default configuration based on the class token during the fine-tuning. See Appendix A.1 for details.

How do the enhanced ViTs with pre-training rank
among SOTA lightweight networks? To answer the
question, we further compare the enhanced ViT-Tiny with
MAE pre-training to previous lightweight ConvNets and
ViT derivatives. We report top-1 accuracy along with the
model parameter count and the throughput in Tab. 3. We
denote the fine-tuned model based on MAE-Tiny as MAETiny-FT. Specifically, we extend the fine-tuning epochs to
1000 following Touvron et al. (2021a) and adopt relative
position embedding. Under this strong fine-tuning recipe,
the pre-training still contributes a 1.2 performance gain, ultimately reaching 79.0% top-1 accuracy. It sets a new record
for lightweight vanilla ViTs, even without distillation during
the supervised training phase on IN1K. It can also be seen
that the pre-training can accelerate the downstream convergence, which helps to surpass that trained from scratch
for 1000 epochs (77.8%) with only 300-epoch fine-tuning
(78.5%).

Table 2. Effect of pre-training data. Top-1 accuracy is reported.
Datasets
MoCo-v3
MAE
IN1K
76.8
78.0
1% IN1K 76.2 (-0.6) 77.9 (-0.1)
10% IN1K 76.5 (-0.3) 78.0 (+0.0)
IN1K-LT 76.1 (-0.7) 77.9 (-0.1)
IN21K
76.9 (+0.1) 78.0 (+0.0)

is no for the examined pre-training methods. We consider
IN21K, a much larger dataset. The number of pre-training iterations is kept constant for a fair comparison. However, few
improvements are observed for both MoCo-v3 and MAE as
shown in Tab. 2. We further consider two subsets of IN1K
containing 1% and 10% of the total examples (1% IN1K
and 10% IN1K) balanced in terms of classes (Assran et al.,
2021) and one subset with long-tailed class distribution (Liu
et al., 2019) (IN1K-LT). Surprisingly, marginal performance
declines are observed for MAE when pre-training on these
subsets, showing more robustness than MoCo-v3 in terms
of the pre-training data scale and class distribution.

We conclude that the enhanced ViT-Tiny is on par with or
even outperforms most previous ConvNets and ViT derivatives with comparable parameters or throughput. This
demonstrates that we can also achieve SOTA performance
based on a naive network architecture by adopting proper
pre-training, rather than designing complex ones. Significantly, naive architecture usually has faster inference speed
and is friendly to deployment.

3.2. Benchmarks on Transfer Performance
We further examine the transferability of these models pretrained on IN1K, involving their transfer performance on
some other classification tasks and dense prediction tasks.
In addition to the self-supervised MAE-Tiny and MoCov3Tiny, DeiT-Tiny is also involved, as a fully-supervised counterpart which is trained on IN1K for 300 epochs.

We also notice that there are some works applying supervised pre-training (Ridnik et al., 2021), CL-based selfsupervised pre-training (Fang et al., 2020) and MIMbased self-supervised pre-training (Woo et al., 2023) on
lightweight ConvNets. However, we find that ViT-Tiny benefits more from the pre-training (e.g., +1.2 vs. +0.5 for
ConvNeXt V2-F). We attribute it to that the plain architecture of ViT-Tiny with less artificial design may possess more
model capacity.

Can the pre-trained models transfer well on datainsufficient tasks? We introduce several classification
tasks (Nilsback & Zisserman, 2008; Parkhi et al., 2012;
Maji et al., 2013; Krause et al., 2013; Krizhevsky et al.,
2009; Van Horn et al., 2018) to investigate their transferability. We conduct the transfer evaluation by fine-tuning these
pre-trained models on these datasets (see Appendix A.4 for
more details). As shown in Tab. 4, using various pre-training
methods shows better performance than using random initialization, but the relative superiority and inferiority comparisons between these pre-training methods exhibit distinct
characteristics from those on ImageNet. We find that down-

Can the pre-training benefit from more data? One may
be curious about whether it is possible to achieve better
downstream performance by involving more pre-training
data, as it does on large models. Unfortunately, the answer
3

A Closer Look at Self-Supervised Lightweight Vision Transformers
Table 3. Comparisons with previous SOTA networks on ImageNet-1k. We report top-1 accuracy along with throughput and parameter
count. The throughput is borrowed from timm (Wightman, 2019), which is measured on a single RTX 3090 GPU with a batch size fixed
to 1024 and mixed precision. ‘†’ indicates that distillation is adopted during the supervised training (or fine-tuning). ‘⋆ ’ indicates the
original architecture of ViT-Tiny (the number of attention heads is 3).
fine-tuning
throughput Accuracy
Methods
pre-train data
#param.
epochs
(image/s) Top-1 (%)
ConvNets
ResNet-18 (He et al., 2016)
ResNet-50 (He et al., 2016; Wightman et al., 2021)

-

100
600

11.7M
25.6M

8951
2696

69.7
80.4

EfficientNet-B0 (Tan & Le, 2019)
EfficientNet-B0 (Fang et al., 2020)
EfficientNet-B1 (Tan & Le, 2019)

IN1K w/o labels
-

450
450
450

5.3M
5.3M
7.8M

5369
5369
2953

77.7
77.2 (-0.5)
78.8

MobileNet-v2 (Sandler et al., 2018)
MobileNet-v3 (Howard et al., 2019)
MobileNet-v3†(Ridnik et al., 2021)

IN21K

480
600
600

3.5M
5.5M
5.5M

7909
9113
9113

72.0
75.2
78.0

ConvNeXt V1-F (Liu et al., 2022)
ConvNeXt V2-F (Woo et al., 2023)
ConvNeXt V2-F (Woo et al., 2023)

IN1K w/o labels

600
600
600

5.2M
5.2M
5.2M

1816
1816

77.5
78.0
78.5 (+0.5)

Vision Transformers Derivative
LeViT-128 (Graham et al., 2021)
LeViT-192 (Graham et al., 2021)

-

1000
1000

9.2M
11.0M

13276
11389

78.6
80.0

XCiT-T12/16†(Ali et al., 2021)

-

400

6.7M

3157

78.6

PiT-Ti†(Heo et al., 2021)

-

1000

5.1M

4547

76.4

CaiT-XXS-24†(Touvron et al., 2021b)

-

400

12.0M

1351

78.4

Swin-1G (Liu et al., 2021; Chen et al., 2021b)

-

450

7.3M

-

77.3

Mobile-Former-294M (Chen et al., 2021b)

-

450

11.4M

-

77.9

MobileViT-S (Mehta & Rastegari, 2022)

-

300

5.6M

1900

78.3

EdgeViT-XS (Pan et al., 2022)

-

300

6.7M

-

77.5

300
1000
300
300
1000
1000

5.7M
5.7M
5.7M
5.7M
5.7M
5.7M

4844
4764
4020
4020
4020
4020

72.2
76.6
76.2
78.5 (+2.3)
77.8
79.0 (+1.2)

Vanilla Vision Transformers
DeiT-Tiny⋆ (Touvron et al., 2021a)
DeiT-Tiny⋆ †(Touvron et al., 2021a)
DeiT-Tiny
MAE-Tiny-FT
DeiT-Tiny
MAE-Tiny-FT

IN1K w/o labels
IN1K w/o labels

Table 4. Transfer evaluation on classification tasks and dense-prediction tasks. Self-supervised pre-training approaches generally show
inferior performance to the fully-supervised counterpart. Top-1 accuracy is reported for classification tasks and AP is reported for object
detection (det.) and instance segmentation (seg.) tasks.The description of each dataset is represented as (train-size/test-size/#classes).
Datasets Flowers
Pets
Aircraft
Cars
CIFAR100
iNat18
COCO(det.) COCO(seg.)
Init.
(2k/6k/102) (4k/4k/37) (7k/3k/100) (8k/8k/196) (50k/10k/100) (438k/24k/8142)
(118k/50k/80)
Random

30.2

26.1

9.4

6.8

42.7

58.7

32.7

28.9

supervised
DeiT-Tiny

96.4

93.1

73.5

85.6

85.8

63.6

40.4

35.5

self-supervised
MoCov3-Tiny
MAE-Tiny

94.8
85.8

87.8
76.5

73.7
64.6

83.9
78.8

83.9
78.9

54.5
60.6

39.7
39.9

35.1
35.4

stream data scale matters. The self-supervised pre-training
approaches achieve downstream performance far behind the
fully-supervised counterpart, while the performance gap is
narrowed more or less as the data scale of the downstream

task increases. Moreover, MAE even shows inferior results
to MoCo-v3. We conjecture that it is due to their different
layer behaviors during pre-training and fine-tuning, which
will be discussed in detail in the following section.
4

A Closer Look at Self-Supervised Lightweight Vision Transformers

Can the pre-trained models transfer well on dense prediction tasks? For a more thorough study, we further
conduct evaluations on downstream object detection and
segmentation tasks on COCO (Lin et al., 2014), based on Li
et al. (2021) (see Appendix A.5 for details) with different
pre-trained models as initialization of the backbone. The
results are shown in Tab. 4. The self-supervised pre-training
also lags behind the fully-supervised counterpart.

only reserving several leading blocks of pre-trained models
and randomly initializing the others, and then fine-tuning
them on IN1K (for the sake of simplicity, we only fine-tune
these models for 100 epochs). Fig. 2 shows that reserving
only a certain number of leading blocks achieves a significant performance gain over randomly initializing all the
blocks (i.e., totally training from scratch) for both MAETiny and MoCov3-Tiny. Whereas, further reserving higher
layers leads to marginal gain for MAE-Tiny and MoCov3Tiny, which demonstrates our hypothesis.

4. Revealing the Secrets of the Pre-Training
In this section, we introduce some model analysis methods
to study the pattern of layer behaviors during pre-training
and fine-tuning, and investigate what matters for downstream performances.

Higher layers matter in data-insufficient downstream
tasks. Previous works (Touvron et al., 2021a; Raghu et al.,
2021) demonstrate the importance of a relatively large
dataset scale for fully-supervised high-performance ViTs
with large model sizes. We also observe a similar phenomenon on lightweight ViTs even when the self-supervised
pre-training is adopted as discussed in Sec. 3.2. It motivates
us to study the key factor of downstream performance on
data-insufficient tasks.

4.1. Layer Representation Analyses
We first adopt Centered Kernel Alignment (CKA) method1
(Cortes et al., 2012; Nguyen et al., 2020) to analyze the layer
representation similarity across and within networks. Specifically, CKA computes the normalized similarity in terms of
the Hilbert-Schmidt Independence Criterion (HSIC (Song
et al., 2012)) between two feature maps or representations,
which is invariant to the orthogonal transformation of representations and isotropic scaling (detailed in Appendix A.6).

We conduct similar experiments as those in Fig. 2 on smallscale downstream datasets. The results are shown in Fig. 3.
We observe consistent performance improvement as the
number of reserved pre-trained models’ blocks increases.
And the smaller the dataset scale, the more the performance
benefits from the higher layers. It demonstrates that higher
layers are still valuable and matter in data-insufficient downstream tasks. Furthermore, we observe comparable performance for the transfer performance of MAE-Tiny and
MoCov3-Tiny when only a certain number of lower layers
are reserved, while MoCov3-Tiny surpasses when higher
layers are further reserved. It indicates that the higher layers of MoCov3-Tiny work better than MAE-Tiny on datainsufficient downstream tasks, which is also consistent with
our CKA-based analyses shown in Fig. 1, that MoCov3Tiny learns more semantics at an abstract level relevant to
recognition in higher layers (high similarity to reference
recognition models in higher layers) than MAE-Tiny.

Lower layers matter more than higher ones if sufficient
downstream data is provided. We visualize the layer
representation similarity between several pre-trained models
and DeiT-Tiny as heatmaps in Fig. 1. We choose DeiT-Tiny,
a classification model fully-supervisedly trained on IN1K,
as the reference because we consider the higher similarity of
the examined model’s layer to that of DeiT-Tiny indicates its
more relevance to recognition. Although the similarity does
not directly indicate whether the downstream performance
is good, it indeed reflects the pattern of layer representation
to a certain extent. The similarity within DeiT-Tiny is also
presented (the left column).

4.2. Attention Map Analyses

First, We observe a relatively high similarity between MAETiny and DeiT-Tiny for lower layers, while low similarity
for higher layers. In Appendix B.1, we observe similar phenomena with several additional supervisedly trained ViTs as
the reference models. It indicates that fewer semantics are
extracted for MAE-Tiny at a more abstract level in higher
layers. In contrast, MoCov3-Tiny aligns DeiT-Tiny well
across almost all layers. However, the fine-tuning evaluation
in Tab. 1 shows that adopting the MAE-Tiny as initialization
improves the performance more significantly than MoCov3Tiny. Thus, we hypothesize that lower layers matter much
more than higher ones for the pre-trained models. In order
to verify the hypothesis, we design another experiment by
1

The attention maps reveal the behaviors for aggregating
information in the attention mechanism of ViTs, which are
computed from the compatibility of queries and keys by
dot-product operation. We introduce two metrics for further
analyses on the pre-trained models, i.e., attention distance
and attention entropy. The attention distance for the j-th
token of h-th head is calculated as:
X
Dh,j =
softmax(Ah )i,j Gi,j ,
(1)
i
l×l

where Ah ∈ R is the attention map for the h-th attention
head, and Gi,j is the Euclidean distance between the spatial
locations of the i-th and j-th tokens. l is the number of

https://github.com/AntixK/PyTorch-Model-Compare

5

A Closer Look at Self-Supervised Lightweight Vision Transformers
77
76

ImageNet

Acc (%)

75
74
73
72
MAE-Tiny
MoCov3-Tiny

71
70
0

2
4
6
8
10 12
Number of Reserved Blocks

Figure 1. Layer representation similarity within and across models as heatmaps, with x and y Figure 2. Lower layers of pre-trained
axes indexing the layers (the 0 index indicates the patch embedding layer), and higher values models contribute to most gains on
indicate higher similarity.
downstream ImageNet dataset.
90
CIFAR100

50

MAE-Tiny
MoCov3-Tiny
0

2
4
6
8
10 12
Number of Reserved Blocks

60

Acc (%)

Acc (%)

Acc (%)

60

Pets

80

8k/8k/102

70

40

Cars

80

50k/10k/100

80

40

4k/4k/37

60

40

20

MAE-Tiny
MoCov3-Tiny
0

2
4
6
8
10 12
Number of Reserved Blocks

MAE-Tiny
MoCov3-Tiny
20

0

2
4
6
8
10 12
Number of Reserved Blocks

Figure 3. The contributions on performance gain from higher layers of pre-trained models increase as the downstream dataset scale
shrinks, which indicates that higher layers matter in data-insufficient downstream tasks.

tokens. And the attention entropy is calculated as:
X
Eh,j = −
softmax(Ah )i,j log(softmax(Ah )i,j ), (2)

tokens in all attention heads attend to solely a few nearby tokens. The attention distance and entropy for different heads
are still distributed in a wide range (except for several last
layers), which indicates that the heads have diverse specializations, making the models aggregate both local and global
tokens with both concentrated and broad focuses.

i

Specifically, the attention distance reveals how much local
vs. global information is aggregated, and a lower distance
indicates that each token focuses more on neighbor tokens.
The attention entropy reveals the concentration of the attention distribution, and lower entropy indicates that each
token attends to fewer tokens. We analyze the distributions
of the average attention distance and entropy across all the
tokens in different attention heads, as shown in Fig. 4.

Then, we focus on the comparison between MAE-Tiny and
MoCov3-Tiny, trying to give some explanations for their
diverse downstream performances observed in Sec. 3. As
shown in Fig. 4, we observe that MoCov3-Tiny (the green
box-whisker) generally has more global and broad attention
than MAE-Tiny (the orange box-whisker). Even several
leading blocks have a narrower range of attention distance
and entropy than MAE-Tiny. We think this characteristic of
MoCov3-Tiny makes the downstream fine-tuning with it as
initialization take “shortcuts”, i.e., directly paying attention
to global features and overlooking local patterns, which
may be unfavorable for fine-grained recognition. It leads to
inferior downstream performance on ImageNet, but fair on
Flowers, CIFAR100, etc., for which the “shortcuts” may be
barely adequate. As for MAE-Tiny, its distinct behaviors in
higher layers with rather low attention distance and entropy
may make it hard to transfer to data-insufficient downstream
tasks, thus resulting in inferior performance on these tasks.

The pre-training with MAE makes the attention of the
downstream models more local and concentrated. First,
we compare MAE-Tiny-FT with DeiT-Tiny. The former
adopts MAE-Tiny as initialization and then is fine-tuned on
IN1K, and the latter is supervisedly trained from scratch
(Random Init.) on IN1K. As shown in Fig. 4, we observe
very similar attention behaviors between them, except that
the attention of MAE-Tiny-FT (the purple box-whisker) is
more local (with lower attention distance) and concentrated
(with lower attention entropy) in middle layers compared
with DeiT-Tiny (the red box-whisker). We attribute it to
the introduction of the MAE-Tiny as pre-training (the orange box-whisker), which has lower attention distance and
entropy, and may bring locality inductive bias compared
with random initialization (the blue box-whisker). It is noteworthy that the locality inductive bias does not mean that

5. Distillation Improves Pre-Trained Models
In the previous section, we have conjectured that it is hard
for MAE to learn good representation relevant to recognition
6

A Closer Look at Self-Supervised Lightweight Vision Transformers

Figure 4. Attention distance and entropy analyses. We visualize the distributions of the Figure 5. Distillation compresses the
average attention distance and entropy across all tokens in different attention heads w.r.t. good representation of the teacher (MAEthe layer number with box-whisker plots.
Base) to the student (D-MAE-Tiny).

and student’s layers. It is formulated as:
Acc (%)

77

Lattn = MSE(AT , M AS ),

(3)

′

where AT ∈ Rh×l×l and AS ∈ Rh ×l×l refer to the attention maps of the corresponding teacher’s and student’s
layers, with h and h′ attention heads respectively. l is the
′
number of tokens. A learnable mapping matrix M ∈ Rh×h
is introduced to align the number of heads. MSE denotes
mean squared error.

76.5
w/o distill.
76
3

6
9
Distilled Layer Index

12

Figure 6. Distillation on attention maps of higher layers improves
performance most.

During the pre-training, the teacher processes the same unmasked image patches as the student encoder. The parameters of the student network are updated based on the joint
backward gradients from the distillation loss and the original
MAE’s reconstruction loss, while the teacher’s parameters
remain frozen throughout the pre-training process.

in higher layers, which results in unsatisfactory performance
on data-insufficient downstream tasks. A natural question
is that can it gain more semantic information by scaling up
the models. We further examine a large pre-trained model,
MAE-Base (He et al., 2021), and find it achieves a better
alignment with the reference model, as shown in the top
subfigure of Fig. 5. It indicates that it is possible to extract
features relevant to recognition in higher layers for the
scaled-up encoder in MAE pre-training. These observations
motivate us to compress the knowledge of large pre-trained
models to tiny ones with knowledge distillation under the
MIM framework.

Distill on lower or higher layers? We first examine applying the above layer-wise distillation on which pair of
teacher’s and student’s layers contributes to the most performance gain. We conduct experiments by constructing
the above attention-based distillation loss between pair of
layers at 1/4, 2/4, 3/4, or 4/4 depth of the teacher and
student respectively, i.e., the 3rd, 6th, 9th, or 12th layer
for both the teacher (MAE-Base) and the student (MAETiny). As shown in Fig. 6, distilling on the attention maps
of the last transformer blocks promotes the performance
most, surpassing those distilling on lower layers (for the
sake of simplicity, we only fine-tune the pre-trained models
on IN1K for 100 epochs). It is consistent with the analyses

Distillation methods. Specifically, a pre-trained MAEBase (He et al., 2021) is introduced as the teacher network.
The distillation loss is constructed based on the similarity
between the attention maps of the corresponding teacher’s
7

A Closer Look at Self-Supervised Lightweight Vision Transformers
Table 5. Distillation improves downstream performance on classification tasks and object detection and segmentation tasks. Top-1
accuracy is reported for classification tasks and AP is reported for object detection (det.) and instance segmentation (seg.) tasks.
Datasets
Flowers
Pets
Aircraft
Cars CIFAR100 iNat18 ImageNet COCO(det.) COCO(seg.)
Init.
supervised
DeiT-Tiny
self-supervised
MAE-Tiny
D-MAE-Tiny

96.4

93.1

73.5

85.8

85.6

85.8
76.5
64.6
78.8
95.2 (+9.4) 89.1 (+12.6) 79.2 (+14.6) 87.5 (+8.7)

in Sec. 4. Specifically, the lower layers learn good representation themselves during the pre-training with MAE, and
thus distilling on these layers contributes to marginal improvement, while the higher layers rely on a good teacher
to guide them to capture rich semantic features.

78.9
85.0 (+6.1)

63.6

-

60.6
78.0
63.6 (+3.0) 78.4 (+0.4)

40.4

35.5

39.9
42.3 (+2.4)

35.4
37.4 (+2.0)

integrating ViTs and ConvNets (Graham et al., 2021; Heo
et al., 2021; Mehta & Rastegari, 2022; Chen et al., 2021b),
while few works focus on how to optimize the networks.
Knowledge Distillation is a mainstream approach for
model compression (Buciluǎ et al., 2006), in which a large
teacher network is trained first and then a more compact
student network is optimized to approximate the teacher
(Hinton et al., 2015; Romero et al., 2014). Touvron et al.
(2021a) achieves better accuracy on ViTs by adopting a
ConvNet as the teacher. With regard to the compression of
the pre-trained networks, some works (Sanh et al., 2019;
Jiao et al., 2020; Wang et al., 2021; Sun et al., 2020) attend
to distill large-scale pre-trained language models. In the
area of computer vision, a series of works (Fang et al., 2020;
Abbasi Koohpayegani et al., 2020; Choi et al., 2021) focus
on transferring knowledge of large pre-trained networks
based on CL to lightweight ConvNets. There are few works
focusing on improving the quality of lightweight pre-trained
ViTs based on MIM by distillation thus far.

Distillation improves downstream performance. We
further evaluate the distilled pre-trained model on several
downstream tasks. For simplicity, we only apply distillation
on the last layers. The resulting model is denoted as DMAE-Tiny. The visualization result at the bottom of Fig. 5
shows that the good representation relevant to the recognition of the teacher is compressed to the student. Especially
the quality of higher layers is improved. The distillation
contributes to better downstream performance as shown in
Tab. 5, especially on data-insufficient classification tasks
and dense prediction tasks. In Appendix C.3, we also show
that our distillation technique can help other ViT students
beyond ViT-Tiny to achieve better downstream performance.

6. Related Works
7. Discussions

Self-supervised learning (SSL) focuses on different pretext tasks (Gidaris et al., 2018; Zhang et al., 2016; Noroozi &
Favaro, 2016; Dosovitskiy et al., 2014) for pre-training without using manually labeled data. Among them, contrastive
learning (CL) has been popular and shows promising results
on various convolutional networks (ConvNets) (He et al.,
2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2020)
and ViTs (Chen et al., 2021a; Caron et al., 2021). Recently,
methods based on masked image modeling (MIM) achieve
state-of-the-art on ViTs (He et al., 2021; Bao et al., 2021;
Zhou et al., 2022). It has been demonstrated that these
methods can scale up well on larger models, while their
performance on lightweight ViTs is seldom investigated.

Limitations. Our study is restricted to classification tasks
and some dense-prediction tasks. We leave the exploration
of more tasks for further work.
Conclusions. We investigate the self-supervised pretraining of lightweight ViTs, and demonstrate the usefulness of the advanced lightweight ViT pre-training strategy
in improving the performance of downstream tasks, even
comparable to most delicately-designed SOTA networks
on ImageNet. Some properties about the pre-training are
revealed, e.g., these methods fail to benefit from large-scale
pre-training data, and show more dependency on the downstream dataset scale. We also present some insights on
what matters for downstream performance. They may indicate potential future directions in improving pre-training
on lightweight models, the value of which has also been
demonstrated as it guides the design of our proposed distillation strategy and helps to achieve much better downstream
performance. We expect our research may provide useful experience and advance the study of self-supervised learning
on lightweight ViTs.

Vision Transformers (ViTs) (Dosovitskiy et al., 2020)
apply a Transformer architecture (a stack of attention modules (Vaswani et al., 2017)) on image patches and show very
competitive results in various visual tasks (Touvron et al.,
2021a; Liu et al., 2021; Li et al., 2022). The performance
of ViTs has been largely improved thanks to better training
recipes (Touvron et al., 2021a; Steiner et al., 2021; Touvron
et al., 2022). As for lightweight ViTs, most works focus on
8

A Closer Look at Self-Supervised Lightweight Vision Transformers

Acknowledgment. The authors would like to thank the
anonymous reviewers for their valuable comments and suggestions. This work was supported in part by the National
Key R&D Program of China (Grant No. 2020AAA0105802,
2020AAA0105800), the Natural Science Foundation of
China (Grant No. U22B2056, 61972394, U2033210,
62036011, 62192782, 61721004, 62172413), the Beijing
Natural Science Foundation (Grant No. L223003, JQ22014),
the Major Projects of Guangdong Education Department
for Foundation Research and Applied Research (Grant
No. 2017KZDXM081, 2018KZDXM066), the Guangdong
Provincial University Innovation Team Project (Grant No.
2020KCXTD045), the Zhejiang Provincial Natural Science
Foundation (Grant No. LDT23F02024F02). Jin Gao was
also supported in part by the Youth Innovation Promotion
Association, CAS.

Chen, X., Xie, S., and He, K. An empirical study of training
self-supervised vision transformers. In Int. Conf. Comput.
Vis., pp. 9640–9649, 2021a.
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L.,
and Liu, Z. Mobile-former: Bridging mobilenet and
transformer. ArXiv, abs/2108.05895, 2021b.
Cho, J. H. and Hariharan, B. On the efficacy of knowledge
distillation. In Int. Conf. Comput. Vis., pp. 4794–4802,
2019.
Choi, H. M., Kang, H., and Oh, D. Unsupervised representation transfer for small networks: I believe i can distill
on-the-fly. In Adv. Neural Inform. Process. Syst., 2021.
Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms for
learning kernels based on centered alignment. The Journal of Machine Learning Research, 13:795–828, 2012.

References
Abbasi Koohpayegani, S., Tejankar, A., and Pirsiavash,
H. Compress: Self-supervised learning by compressing representations. Adv. Neural Inform. Process. Syst.,
33:12980–12992, 2020.

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. Randaugment:
Practical automated data augmentation with a reduced
search space. In Adv. Neural Inform. Process. Syst., volume 33, pp. 18613–18624, 2020.

Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M.,
Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. Xcit: Cross-covariance image transformers.
Adv. Neural Inform. Process. Syst., 34, 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database.
In IEEE Conf. Comput. Vis. Pattern Recog., pp. 248–255,
2009.

Assran, M., Caron, M., Misra, I., Bojanowski, P., Joulin,
A., Ballas, N., and Rabbat, M. Semi-supervised learning
of visual features by non-parametrically predicting view
assignments with support samples. In Int. Conf. Comput.
Vis., pp. 8443–8452, 2021.

Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and
Brox, T. Discriminative unsupervised feature learning
with convolutional neural networks. Adv. Neural Inform.
Process. Syst., 27:766–774, 2014.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
Heigold, G., Gelly, S., et al. An image is worth 16x16
words: Transformers for image recognition at scale. In
Int. Conf. Learn. Represent., 2020.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.
ArXiv, abs/1607.06450, 2016.
Bao, H., Dong, L., and Wei, F. Beit: Bert pre-training of
image transformers. ArXiv, abs/2106.08254, 2021.
Buciluǎ, C., Caruana, R., and Niculescu-Mizil, A. Model
compression. In ACM Int. Conf. on Knowledge Discovery
and Data Mining, pp. 535–541, 2006.

Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., and Liu,
Z. Seed: Self-supervised distillation for visual representation. In Int. Conf. Learn. Represent., 2020.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P.,
and Joulin, A. Unsupervised learning of visual features
by contrasting cluster assignments. In Adv. Neural Inform.
Process. Syst., 2020.

Gidaris, S., Singh, P., and Komodakis, N. Unsupervised
representation learning by predicting image rotations. In
Int. Conf. Learn. Represent., 2018.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. Accurate, large minibatch sgd: Training imagenet
in 1 hour. ArXiv, abs/1706.02677, 2017.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J.,
Bojanowski, P., and Joulin, A. Emerging properties in
self-supervised vision transformers. In Int. Conf. Comput.
Vis., pp. 9650–9660, 2021.

Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin,
A., Jégou, H., and Douze, M. Levit: A vision transformer
in convnet’s clothing for faster inference. In Int. Conf.
Comput. Vis., pp. 12259–12269, 2021.

Chen, X., Fan, H., Girshick, R., and He, K. Improved
baselines with momentum contrastive learning. ArXiv,
abs/2003.04297, 2020.
9

A Closer Look at Self-Supervised Lightweight Vision Transformers

Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.,
Buchatskaya, E., Doersch, C., Pires, B., Guo, Z., Azar,
M., et al. Bootstrap your own latent: A new approach to
self-supervised learning. In Adv. Neural Inform. Process.
Syst., 2020.

Li, Y., Mao, H., Girshick, R., and He, K. Exploring plain
vision transformer backbones for object detection. ArXiv,
abs/2203.16527, 2022.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
coco: Common objects in context. In Eur. Conf. Comput.
Vis., pp. 740–755, 2014.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In IEEE Conf. Comput.
Vis. Pattern Recog., pp. 770–778, 2016.

Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X.
Large-scale long-tailed recognition in an open world. In
IEEE Conf. Comput. Vis. Pattern Recog., 2019.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation
learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp.
9729–9738, 2020.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Int. Conf. Comput.
Vis., pp. 10012–10022, 2021.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R.
Masked autoencoders are scalable vision learners. ArXiv,
abs/2111.06377, 2021.

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T.,
and Xie, S. A convnet for the 2020s. In IEEE Conf.
Comput. Vis. Pattern Recog., pp. 11966–11976, 2022.

Heo, B., Yun, S., Han, D., Chun, S., Choe, J., and Oh, S. J.
Rethinking spatial dimensions of vision transformers. In
Int. Conf. Comput. Vis., pp. 11936–11945, 2021.

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient
descent with warm restarts. ArXiv, abs/1608.03983, 2016.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,
A. Fine-grained visual classification of aircraft. ArXiv,
abs/1306.5151, 2013.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,
Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le,
Q. V., and Adam, H. Searching for mobilenetv3. In Int.
Conf. Comput. Vis., 2019.

Mehta, S. and Rastegari, M. Mobilevit: Light-weight,
general-purpose, and mobile-friendly vision transformer.
In Int. Conf. Learn. Represent., 2022.

Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. Deep networks with stochastic depth. In Eur. Conf.
Comput. Vis., pp. 646–661, 2016.

Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., and Ghasemzadeh, H. Improved knowledge
distillation via teacher assistant. In AAAI Conf. on Artificial Intelligence, volume 34, pp. 5191–5198, 2020.

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L.,
Wang, F., and Liu, Q. Tinybert: Distilling bert for natural language understanding. In Findings of Empirical
Methods in Natural Language Process., pp. 4163–4174,
2020.

Newell, A. and Deng, J. How useful is self-supervised
pretraining for visual tasks? In IEEE Conf. Comput. Vis.
Pattern Recog., pp. 7345–7354, 2020.
Nguyen, T., Raghu, M., and Kornblith, S. Do wide and deep
networks learn the same things? uncovering how neural
network representations vary with width and depth. In
Int. Conf. Learn. Represent., 2020.

Jin, X., Peng, B., Wu, Y., Liu, Y., Liu, J., Liang, D., Yan, J.,
and Hu, X. Knowledge distillation via route constrained
optimization. In Int. Conf. Comput. Vis., pp. 1345–1354,
2019.

Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing,
pp. 722–729, 2008.

Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object
representations for fine-grained categorization. In Int.
Conf. Comput. Vis. Worksh., pp. 554–561, 2013.

Noroozi, M. and Favaro, P. Unsupervised learning of visual
representations by solving jigsaw puzzles. In Eur. Conf.
Comput. Vis., pp. 69–84, 2016.

Krizhevsky, A. et al. Learning multiple layers of features
from tiny images. Technical Report, 2009.

Pan, J., Bulat, A., Tan, F., Zhu, X., Dudziak, L., Li, H.,
Tzimiropoulos, G., and Martinez, B. Edgevits: Competing light-weight cnns on mobile devices with vision
transformers. ArXiv, abs/2205.0343, 2022.

Li, Y., Xie, S., Chen, X., Dollar, P., He, K., and Girshick,
R. Benchmarking detection transfer learning with vision
transformers. ArXiv, abs/2111.11429, 2021.
10

A Closer Look at Self-Supervised Lightweight Vision Transformers

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar,
C. V. Cats and dogs. In IEEE Conf. Comput. Vis. Pattern
Recog., pp. 3498–3505, 2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention
is all you need. Adv. Neural Inform. Process. Syst., 30,
2017.

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and
Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inform. Process.
Syst., 34, 2021.

Wang, W., Bao, H., Huang, S., Dong, L., and Wei, F.
Minilmv2: Multi-head self-attention relation distillation
for compressing pretrained transformers. In Findings
of Int. Joint Conf. on Natural Language Process., pp.
2140–2151, 2021.

Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor,
L. Imagenet-21k pretraining for the masses. ArXiv,
abs/2104.10972, 2021.

Wightman, R. Pytorch image models. https://github.
com/rwightman/pytorch-image-models,
2019.

Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta,
C., and Bengio, Y. Fitnets: Hints for thin deep nets. ArXiv,
abs/1412.6550, 2014.

Wightman, R., Touvron, H., and Jégou, H. Resnet strikes
back: An improved training procedure in timm. ArXiv,
abs/2110.00476, 2021.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In IEEE Conf. Comput. Vis. Pattern Recog.,
pp. 4510–4520, 2018.

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon,
I.-S., and Xie, S. Convnext v2: Co-designing and
scaling convnets with masked autoencoders. ArXiv,
abs/2301.00808, 2023.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert,
a distilled version of bert: smaller, faster, cheaper and
lighter. ArXiv, abs/1910.01108, 2019.
Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt,
K. Feature selection via dependence maximization. The
Journal of Machine Learning Research, 13(5), 2012.

Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q.,
and Hu, H. Simmim: A simple framework for masked
image modeling. In IEEE Conf. Comput. Vis. Pattern
Recog., 2022.

Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., and Beyer, L. How to train your vit? data, augmentation, and regularization in vision transformers. ArXiv,
abs/2106.10270, 2021.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y.
Cutmix: Regularization strategy to train strong classifiers
with localizable features. In Int. Conf. Comput. Vis., pp.
6023–6032, 2019.

Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., and Zhou, D.
Mobilebert: a compact task-agnostic bert for resourcelimited devices. In Association for Computational Linguistics, pp. 2158–2170, 2020.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.
mixup: Beyond empirical risk minimization. In Int. Conf.
Learn. Represent., 2018.
Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In Eur. Conf. Comput. Vis., pp. 649–666, 2016.

Tan, M. and Le, Q. Efficientnet: Rethinking model scaling
for convolutional neural networks. In Int. Conf. Machine
Learning., pp. 6105–6114, 2019.

Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A.,
and Kong, T. ibot: Image bert pre-training with online
tokenizer. Int. Conf. Learn. Represent., 2022.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
A., and Jegou, H. Training data-efficient image transformers & distillation through attention. In Int. Conf. Machine
Learning., volume 139, pp. 10347–10357, 2021a.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and
Jégou, H. Going deeper with image transformers. In Int.
Conf. Comput. Vis., pp. 32–42, 2021b.
Touvron, H., Cord, M., and Jégou, H. Deit iii: Revenge of
the vit. ArXiv, abs/2204.07118, 2022.
Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C.,
Shepard, A., Adam, H., Perona, P., and Belongie, S. The
inaturalist species classification and detection dataset. In
IEEE Conf. Comput. Vis. Pattern Recog., 2018.
11

A Closer Look at Self-Supervised Lightweight Vision Transformers

A. Experimental Details
A.1. Evaluation Details for MAE and MoCo-v3 on ImageNet
We follow the common practice of supervised ViT training (Touvron et al., 2021a) for fine-tuning evaluation except for
some hyper-parameters of augmentation. The default setting is in Tab. A1. We use the linear lr scaling rule (Goyal et al.,
2017): lr = base lr×batchsize / 256. We use layer-wise lr decay following (Bao et al., 2021; He et al., 2021), and the decay
rate is tuned respectively for MAE and MoCo-v3.
Besides, we use global average pooling (GAP) after the final block during the fine-tuning of both the MAE and MoCo-v3based pre-trained models, which is, however, not the common practice for MoCo-v3 (Chen et al., 2021a). We adopt it as
it significantly helps to surpass the model using the original configuration based on a class token (76.8% vs. 73.7% top-1
accuracy) for the lightweight ViT-Tiny.
Table A1. Fine-tuning evaluation settings.
config
value

Table A2. Pre-training setting for MoCo-v3.
config
value

optimizer
AdamW
1e-3
base learning rate
weight decay
0.05
β1 , β2 = 0.9, 0.999
optimizer momentum
layer-wise lr decay (Bao et al., 2021)
0.85 (MAE), 0.75 (MoCo-v3)
batch size
1024
learning rate schedule
cosine decay (Loshchilov & Hutter, 2016)
warmup epochs
5
training epochs
{100, 300, 1000}
RandAug(10, 0.5) (Cubuk et al., 2020)
augmentation
colorjitter
0.3
0
label smoothing
mixup (Zhang et al., 2018)
0.2
0
cutmix (Yun et al., 2019)
0
drop path (Huang et al., 2016)

optimizer
AdamW
1.5e-4
base learning rate
weight decay
0.1
optimizer momentum β1 , β2 = 0.9, 0.999
batch size
1024
learning rate schedule
cosine decay
warmup epochs
40
training epochs
400
momentum coefficient
0.99
0.2
temperature

A.2. Pre-Training Details of MAE
Our experimental setup on MAE largely follows those of MAE (He et al., 2021), including the optimizer, learning rate,
batch size, argumentation, etc. But several basic factors and components are adjusted to fit the smaller encoder. We find
MAE prefers a much more lightweight decoder when the encoder is small, thus a decoder with only one Transformer block
is adopted by default and the width is 192. We sweep over 5 masking ratios {0.45, 0.55, 0.65, 0.75, 0.85} and find 0.75
achieves the best performance.
A.3. Pre-Training Details of MoCo-v3
We reimplement MoCo-v3 (Chen et al., 2021a) with ViT-Tiny as encoder and largely follow the original setups. The default
setting is in Tab. A2.
Chen et al. (2021a) observes that instability is a major issue that impacts self-supervised ViT training and causes mild
degradation in accuracy, and a simple trick by adopting fixed random patch projection (the first layer of a ViT model) is
proposed to improve stability in practice. However, we find that stability is not the main issue for small networks. Higher
performance is achieved with a learned patch projection layer. Thus, this technique is not used by default.
A.4. Transfer Evaluation Details on Classification Tasks
We evaluate several pre-trained models with transfer learning in order to measure the generalization ability of these models.
We use 6 popular vision datasets: Flowers-102 (Flowers for short) (Nilsback & Zisserman, 2008), Oxford-IIIT Pets (Pets)
(Parkhi et al., 2012), FGVC-Aircraft (Aircraft) (Maji et al., 2013), Stanford Cars (Cars) (Krause et al., 2013), Cifar100
(Krizhevsky et al., 2009), iNaturalist 2018 (iNat18) (Van Horn et al., 2018). For all these datasets except iNat18, we fine-tune
with SGD (momentum=0.9), and the batch size is set to 512. The learning rates are swept over 3 candidates and the training
epochs are swept over 2 candidates per dataset as detailed in Tab. A3. We adopt a cosine decay learning rate schedule
(Loshchilov & Hutter, 2016) with a linear warm-up. we resize images to 224 × 224. We adopt random resized crop and
random horizontal flipping as augmentations and do not use any regularization (e.g., weight decay, dropout, or the stochastic
12

A Closer Look at Self-Supervised Lightweight Vision Transformers

Dataset

Table A3. Transfer evaluation details.
Learning rate Total epochs and warm-up epochs layer-wise lr decay

Flowers
Pets
Aircraft
Cars
CIFAR100

{0.01, 0.03, 0.1}
{0.01, 0.03, 0.1}
{0.01, 0.03, 0.1}
{0.01, 0.03, 0.1}
{0.03, 0.1, 0.3}

{(150,30),(250,50)}
{(70,14),(150,30)}
{(50,10),(100,20)}
{(50,10),(100,20)}
{(25, 5),(50,10)}

{1.0, 0.75}
{1.0, 0.75}
{1.0, 0.75}
{1.0, 0.75}
{1.0, 0.75}

depth regularization technique (Huang et al., 2016)). For iNat18, we follow the same training configurations as those on
ImageNet.
A.5. Transfer Evaluation Details on Dense Prediction Tasks
We reproduce the setup in (Li et al., 2021), except for replacing the backbone with ViT-Tiny and decreasing the input image
size from 1024 to 640 to make it trainable on a single machine with 8 NVIDIA V100. We fine-tune for up to 100 epochs on
COCO (Lin et al., 2014), with different pre-trained models as initialization of the backbone. We do not use layer-wise lr
decay since we find it useless for the tiny backbone on the detection tasks. The weight decay is 0.05 and the stochastic depth
regularization (Huang et al., 2016) is not used.
A.6. Analysis Methods
We adopt the Centered Kernel Alignment (CKA) metric to analyze the representation similarity (Srep ) within and across networks. Specifically, CKA takes
two feature maps (or representations) X and Y as input and computes their
normalized similarity in terms of the Hilbert-Schmidt Independence Criterion
(HSIC) as
HSIC(K, L)
Srep (X, Y ) = CKA(K, L) = p
,
HSIC(K, K)HSIC(L, L)

(A1)

where K = XX T and L = Y Y T denote the Gram matrices for the two
feature maps. A minibatch version is adopted by using an unbiased estimator
of HSIC (Nguyen et al., 2020) to work at scale with our networks. We select
the normalized version of the output representation of each Transformer block
(consisting of a multi-head self-attention (MHA) block and an MLP block).
Specifically, we select the feature map after the first LayerNorm (LN) (Ba
et al., 2016) of the next block as the representation of this Transformer block
as depicted in Fig. A1.

Figure A1. Transformer block.

B. More Analyses on the Pre-Training
B.1. Analyses with More Models as Reference
In Sec. 4, the analyses are mainly conducted by adopting the supervisedly trained DeiT-Tiny as the reference model. Here,
we additionally introduce stronger recognition models as references to demonstrate the generalizability of our analyses.
Specifically, we use ViT-Base models trained with various recipes as references, e.g., DeiT-Base (supervisedly trained on
IN1K following Touvron et al. (2021a) and achieves 82.0% top-1 accuracy on ImageNet), ViT-Base-21k (supervisedly
trained on IN21K following Steiner et al. (2021)), ViT-Base-21k-1k (first pre-trained on IN21K and then fine-tuned on
IN1K following Steiner et al. (2021), achieving 84.5% top-1 accuracy on ImageNet). The layer representation similarity is
presented in Fig. A2.
First, we observe that our default reference model, DeiT-Tiny, is aligned well with these larger models (as shown in the left
column of Fig. A2). We conjecture that the supervisedly trained ViTs generally have similar layer representation structures.
Based on these stronger reference models, we observe similar phenomena for MAE-Tiny and MoCov3-Tiny as discussed in
Sec. 4, which demonstrates the robustness of our analyses and conclusions w.r.t. different reference models.
Then, we analyze the larger MAE-Base with these newly introduced models as references, as shown in the last column of
Fig. A2. We observe that MAE-Base still aligns relatively well with these much stronger recognition models, which supports
13

A Closer Look at Self-Supervised Lightweight Vision Transformers

Figure A2. Layer representation analyses with DeiT-Base (supervisedly trained on IN1K, the top row), ViT-Base-21k (supervisedly
trained on IN21K, the middle row), and ViT-Base-21k-1k (supervisedly pre-trained on IN21K and fine-tuned on IN1K, the bottom row) as
the reference models.

our claim in Sec. 5 that it is possible to extract features relevant to recognition in higher layers for the scaled-up encoder in
MAE pre-training. It is the prerequisite for the improvement of the pre-trained models from the proposed distillation.
B.2. Analyses Based on Linear Probing Evaluation
Our analyses are mainly based on the fine-tuning evaluation. In this section, we present some experimental results based
on linear probing evaluation, in which only a classifier is tuned during the downstream training while the pre-trained
representations are kept frozen. It reflects how the representations obtained by the pre-trained models are linearly separable
w.r.t. semantic categories.
As shown in Tab. A4, the linear probing performance is consistently lower than the fine-tuning performance. Coupled with
the case that linear probing does not save much training time for evaluating lightweight models, it is not a proper way to
utilize the pre-trained models compared to the fine-tuning setting.
Furthermore, the linear probing evaluation results do not reflect fine-tuned performance according to Tab. A4 and Tab. 4,
especially for those downstream tasks with relatively sufficient labeled data, e.g., iNat18, ImageNet, thus may lead to an
underestimation of the value of some pre-trained models in the practical utility on downstream tasks. We attribute it to
that linear probing only evaluates the final representation of the pre-trained models, which makes it overlook the value of
providing good initialization for lower layers. For instance, MAE-Tiny is better at it than MoCov3-Tiny.
Additionally, the inferior linear probing results of MAE-Tiny to MoCov3-Tiny also support our analyses in Sec. 4.1 that
MoCov3-Tiny learns more semantics at an abstract level relevant to recognition in higher layers than MAE-Tiny. But our
proposed distillation technique can improve the results to a certain extent.
14

A Closer Look at Self-Supervised Lightweight Vision Transformers
Table A4. Linear probing evaluation of pre-trained models on downstream classification tasks. Top-1 accuracy is reported.
Datasets

Flowers

Pets

Aircraft

Cars

CIFAR100

iNat18

ImageNet

supervised
DeiT-Tiny

91.0

92.0

41.2

47.9

73.6

39.8

-

self-supervised
MoCov3-Tiny
MAE-Tiny
D-MAE-Tiny

93.2
48.9
77.1

83.5
25.0
55.5

44.8
12.8
20.1

44.5
8.8
16.4

73.4
31.0
58.4

36.2
1.4
10.7

62.1
23.3
42.0

Init.

Table A5. Comparisons on more pre-training methods. It is a revised version of Tab. 1 in the main paper with more self-supervised
pre-training methods.
Methods

Pre-training
Data

Epochs

Time (hour)

recipe

Fine-tuning
Top-1 Acc. (%)

from scratch
from scratch
Supervised (Steiner et al., 2021)
Supervised (Steiner et al., 2021)
MoCo-v3 (Chen et al., 2021a)
MAE (He et al., 2021)
DINO (Caron et al., 2021)
SimMIM (Xie et al., 2022)
D-MAE-Tiny (ours)

IN21K w/ labels
IN21K w/ labels
IN1K w/o labels
IN1K w/o labels
IN1K w/o labels
IN1K w/o labels
IN1K w/o labels

30
300
400
400
400
400
400

20
200
52
23
83
40
26

ori.
impr.
impr.
impr.
impr.
impr.
impr.
impr.
impr.

74.5
75.8
76.9
77.8
76.8
78.0
77.2
77.9
78.4

Table A6. Transfer evaluation on classification tasks and dense-prediction tasks for more pre-training methods. It is a revised
version of Tab. 4 in the main paper with more self-supervised pre-training methods.
Datasets Flowers
Pets
Aircraft
Cars
CIFAR100
iNat18
COCO(det.) COCO(seg.)
Init.
(2k/6k/102) (4k/4k/37) (7k/3k/100) (8k/8k/196) (50k/10k/100) (438k/24k/8142)
(118k/50k/80)
supervised
DeiT-Tiny

96.4

93.1

73.5

85.6

85.8

63.6

40.4

35.5

self-supervised
MoCov3-Tiny
MAE-Tiny
DINO-Tiny
SimMIM-Tiny
D-MAE-Tiny (ours)

94.8
85.8
95.6
77.2
95.2

87.8
76.5
89.3
68.9
89.1

73.7
64.6
73.6
55.9
79.2

83.9
78.8
84.5
70.4
87.5

83.9
78.9
84.7
77.7
85.0

54.5
60.6
58.7
60.8
63.6

39.7
39.9
41.4
39.3
42.3

35.1
35.4
36.7
34.8
37.4

B.3. Analyses for More Self-Supervised Pre-Training Methods
In the main paper, our analyses mainly focus on MAE (He et al., 2021) and MoCov3 (Chen et al., 2021a). In this section,
more self-supervised pre-training methods are involved. Specifically, another MIM-based method, SimMIM (Xie et al.,
2022), and another CL-based method, DINO (Caron et al., 2021), are evaluated based on the lightweight ViT-Tiny. The
400-epoch pre-trained models are denoted as SimMIM-Tiny and DINO-Tiny respectively.
We first evaluate their downstream performance on ImageNet and other classification tasks, and object detection and
segmentation tasks, as shown in Tab. A5 and Tab. A6. They are also revised versions of Tab. 1 and Tab. 4 in the main paper.
According to the results, we find that MIM-based methods are generally superior to CL-based methods on data-sufficient
tasks, e.g., ImageNet and iNat18, while inferior on data-insufficient tasks. Downstream data scale matters for all these
methods and none of them achieve consistent superiority on all downstream tasks.
Then we explore the layer representation of these models by CKA-based similarity analyses, as shown in Fig. A3. We
observe similar layer representation structures for both MIM family and CL family. For instance, SimMIM-Tiny also learns
poor semantics on higher layers.
Finally, we carry out the attention analyses for these models, as shown in Fig. A4. We also observe consistent properties for
MIM family and CL family. SimMIM-Tiny also tends to focus on local patterns with concentrated attention in higher layers
like MAE-Tiny, while DINO-Tiny behaves like MoCov3-Tiny and has broad and global attention in higher layers.
15

A Closer Look at Self-Supervised Lightweight Vision Transformers

Figure A3. Layer representation analyses for more self-supervised pre-trained models.

Figure A4. Attention analyses for more self-supervised pre-trained models.

C. More Analyses on Distillation
C.1. Illustration of the Distillation Process
We illustrate our distillation process in Fig. A5 for a better presentation and explanation.
Based on the mask auto-encoder, we introduce a teacher ViT, which is pre-trained with MAE. During pre-training, the
teacher processes the same visible image patches as the student encoder, and the attention-based distillation loss is calculated
between the attention maps of the corresponding teacher’s and student’s layers. The parameters of the student are updated
based on the joint backward gradients from the distillation loss and the original MAE’s reconstruction loss, while the
teacher’s parameters remain frozen throughout the pre-training process.
C.2. Attention Map Analyses for the Distilled Pre-trained Models
we analyze the attention distance and entropy of the distilled MAE-Tiny introduced in Sec. 5 (D-MAE-Tiny), which is only
applied distillation on the attention map of the last layer during the pre-training with MAE. As shown in Fig. A6, we observe
more global and broad attention in the higher layers of D-MAE-Tiny compared with MAE-Tiny, which behaves more like
the teacher, MAE-Base. We attribute it to that the distillation on the final layer (i.e., the 12th layer) forces the distilled layer
of the student to imitate the teacher’s attention and also requires the several preceding layers to make changes to meet the
imitation. We reckon that it may be useful to capture semantic features and improve downstream performance.
We also find the attention distance of the last layer shows more diversity: some attention heads are rather global and the
others are local, and all of them are concentrated. We reckon that it shows odd behaviors for the reason that the layer can not
handle both training targets from the reconstruction task and distillation restricted to the model size. But the more plentiful
supervision indeed improves the quality of previous layers and thus achieves better downstream performance.
16

A Closer Look at Self-Supervised Lightweight Vision Transformers

Figure A5. Illustration of the distillation process.

Figure A6. Attention distance and entropy analyses for the distilled MAE-Tiny.

C.3. Applying Distillation on More Networks
To further evaluate our proposed distillation method, we additionally apply it to the pre-training of ViT-Small also with
MAE-Base as the teacher. The configurations of these models are presented in Tab. A7. The transfer evaluation results are
presented in Tab. A8. The transfer performance of the distilled MAE-Small (D-MAE-Small) surpasses the baseline model,
MAE-Small by a large margin, which shows the efficacy of the distillation.
C.4. Distilling with Larger Teachers
We further conduct additional experiments with various models as teachers and compared their performance on various
downstream tasks (see Tab. A9). The configurations of the student model (ViT-Tiny) and teacher models are presented in
Tab. A7. The results indicate that an appropriately sized teacher model provides the most improvement gains in distillation,
which is a common finding in the area of knowledge distillation (Cho & Hariharan, 2019; Jin et al., 2019; Mirzadeh et al.,
2020). To further investigate the impact of teacher size, we conducted CKA-based layer representation analyses of these
teachers, as shown in Fig. A7. It can be seen that a teacher that is too small (MAE-Small) also suffers from degraded
representation on higher layers and can not provide sufficient knowledge, while a teacher that is too large (MAE-Large)
would result in a mismatch of capacity with the tiny student model, considering it has over 50 times more parameters than
ViT-Tiny with different depths and attention head numbers, which leads to a little distinct learned pattern compared to the
reference tiny model, and may not be suitable for the student.
17

A Closer Look at Self-Supervised Lightweight Vision Transformers

Table A7. Configurations of ViTs.
channel dimension
#heads
#layers

Model
ViT-Tiny
ViT-Small
ViT-Base
ViT-Large
‡

192
384
768
1024

12
12‡
12
16

12
12
12
24

#params
6M
22M
86M
304M

Our ViT-Small is with heads=12 following Chen et al. (2021a).

Table A8. Distillation on MAE-Small. Top-1 accuracy for the transfer performance on downstream classification tasks of pre-trained
models w. or w/o. distillation is reported.
Datasets

Flowers

Pets

Aircraft

Cars

CIFAR100

iNat18

ImageNet

supervised
DeiT-Small

97.4

94.2

77.6

88.2

89.2

66.5

80.2

self-supervised
MAE-Small
D-MAE-Small

91.2
95.8 (+4.6)

82.0
91.4 (+9.4)

65.8
80.7 (+14.9)

79.2
88.3 (+9.1)

80.8
87.8 (+7.0)

63.2
66.9 (+3.7)

82.1
82.5 (+0.4)

Init.

Figure A7. Layer representation analyses of the teachers for distillation.

Table A9. Distillation with different sized teachers. Top-1 accuracy for the transfer performance on downstream classification tasks of
the distilled pre-trained models is reported.
Pre-training
Student
Teacher
MAE-Tiny
MAE-Tiny
MAE-Tiny
MAE-Tiny

MAE-Small
MAE-Base
MAE-Large

Flowers

Pets

Aircraft

85.8
89.4
95.2
94.0

76.5
78.6
89.1
87.3

64.6
65.2
79.2
77.1

18

Fine-tuning
Cars CIFAR100

iNat18

ImageNet

78.8
78.9
87.5
85.2

60.6
61.5
63.6
63.1

78.0
78.1
78.4
78.3

78.9
79.6
85.0
84.2