Similarity of Neural Network Representations Revisited

Simon Kornblith 1 Mohammad Norouzi 1 Honglak Lee 1 Geoffrey Hinton 1

arXiv:1905.00414v4 [cs.LG] 19 Jul 2019

Abstract
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained
models. We examine methods for comparing neural network representations based on canonical
correlation analysis (CCA). We show that CCA
belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any
other statistic that is invariant to invertible linear
transformation can measure meaningful similarities between representations of higher dimension
than the number of data points. We introduce
a similarity index that measures the relationship
between representational similarity matrices and
does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA.
Unlike CCA, CKA can reliably identify correspondences between representations in networks
trained from different initializations.

1. Introduction
Across a wide range of machine learning tasks, deep neural
networks enable learning powerful feature representations
automatically from data. Despite impressive empirical advances of deep neural networks in solving various tasks,
the problem of understanding and characterizing the neural network representations learned from data remains relatively under-explored. Previous work (e.g. Advani &
Saxe (2017); Amari et al. (2018); Saxe et al. (2014)) has
made progress in understanding the theoretical dynamics
of the neural network training process. These studies are
insightful, but fundamentally limited, because they ignore
the complex interaction between the training dynamics and
structured data. A window into the network’s representation
can provide more information about the interaction between
machine learning algorithms and data than the value of the
loss function alone.
1

Google Brain. Correspondence to: Simon Kornblith <skornblith@google.com>.
Proceedings of the 36 th International Conference on Machine
Learning, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s).

This paper investigates the problem of measuring similarities between deep neural network representations. An effective method for measuring representational similarity could
help answer many interesting questions, including: (1) Do
deep neural networks with the same architecture trained
from different random initializations learn similar representations? (2) Can we establish correspondences between
layers of different network architectures? (3) How similar are the representations learned using the same network
architecture from different datasets?
We build upon previous studies investigating similarity between the representations of neural networks (Laakso &
Cottrell, 2000; Li et al., 2015; Raghu et al., 2017; Morcos
et al., 2018; Wang et al., 2018). We are also inspired by the
extensive neuroscience literature that uses representational
similarity analysis (Kriegeskorte et al., 2008a; Edelman,
1998) to compare representations across brain areas (Haxby
et al., 2001; Freiwald & Tsao, 2010), individuals (Connolly
et al., 2012), species (Kriegeskorte et al., 2008b), and behaviors (Elsayed et al., 2016), as well as between brains
and neural networks (Yamins et al., 2014; Khaligh-Razavi
& Kriegeskorte, 2014; Sussillo et al., 2015).
Our key contributions are summarized as follows:
• We discuss the invariance properties of similarity indexes and their implications for measuring similarity of
neural network representations.
• We motivate and introduce centered kernel alignment
(CKA) as a similarity index and analyze the relationship
between CKA, linear regression, canonical correlation
analysis (CCA), and related methods (Raghu et al., 2017;
Morcos et al., 2018).
• We show that CKA is able to determine the correspondence between the hidden layers of neural networks
trained from different random initializations and with
different widths, scenarios where previously proposed
similarity indexes fail.
• We verify that wider networks learn more similar representations, and show that the similarity of early layers
saturates at fewer channels than later layers. We demonstrate that early layers, but not later layers, learn similar
representations on different datasets.

Similarity of Neural Network Representations Revisited

2. What Should Similarity Be Invariant To?
This section discusses the invariance properties of similarity
indexes and their implications for measuring similarity of
neural network representations. We argue that both intuitive
notions of similarity and the dynamics of neural network
training call for a similarity index that is invariant to orthogonal transformation and isotropic scaling, but not invertible
linear transformation.
2.1. Invariance to Invertible Linear Transformation
A similarity index is invariant to invertible linear transformation if s(X, Y ) = s(XA, Y B) for any full rank A and
B. If activations X are followed by a fully-connected layer
f (X) = σ(XW + β), then transforming the activations
by a full rank matrix A as X 0 = XA and transforming the
weights by the inverse A−1 as W 0 = A−1 W preserves the
output of f (X). This transformation does not appear to
change how the network operates, so intuitively, one might
prefer a similarity index that is invariant to invertible linear
transformation, as argued by Raghu et al. (2017).
However, a limitation of invariance to invertible linear transformation is that any invariant similarity index gives the
same result for any representation of width greater than or
equal to the dataset size, i.e. p2 ≥ n. We provide a simple
proof in Appendix A.
Theorem 1. Let X and Y be n × p matrices. Suppose s
is invariant to invertible linear transformation in the first
argument, i.e. s(X, Z) = s(XA, Z) for arbitrary Z and
any A with rank(A) = p. If rank(X) = rank(Y ) = n, then
s(X, Z) = s(Y, Z).
There is thus a practical problem with invariance to invertible linear transformation: Some neural networks, especially
convolutional networks, have more neurons in some layers
than there are examples the training dataset (Springenberg
et al., 2015; Lee et al., 2018; Zagoruyko & Komodakis,
2016). It is somewhat unnatural that a similarity index
could require more examples than were used for training.
A deeper issue is that neural network training is not invari-

Net A PC 1

Let X ∈ Rn×p1 denote a matrix of activations of p1 neurons for n examples, and Y ∈ Rn×p2 denote a matrix of
activations of p2 neurons for the same n examples. We
assume that these matrices have been preprocessed to center
the columns. Without loss of generality we assume that
p1 ≤ p2 . We are concerned with the design and analysis of
a scalar similarity index s(X, Y ) that can be used to compare representations within and across neural networks, in
order to help visualize and understand the effect of different
factors of variation in deep learning.

Net B PC 1

Examples Colored By Net A Principal Components

Problem Statement

Net A PC 2

Net B PC 2

Figure 1. First principal components of representations of networks trained from different random initializations are similar.
Each example from the CIFAR-10 test set is shown as a dot colored according to the value of the first two principal components of
an intermediate layer of one network (left) and plotted on the first
two principal components of the same layer of an architecturally
identical network trained from a different initialization (right).

ant to arbitrary invertible linear transformation of inputs
or activations. Even in the linear case, gradient descent
converges first along the eigenvectors corresponding to the
largest eigenvalues of the input covariance matrix (LeCun
et al., 1991), and in cases of overparameterization or early
stopping, the solution reached depends on the scale of the
input. Similar results hold for gradient descent training
of neural networks in the infinite width limit (Jacot et al.,
2018). The sensitivity of neural networks training to linear
transformation is further demonstrated by the popularity of
batch normalization (Ioffe & Szegedy, 2015).
Invariance to invertible linear transformation implies that the
scale of directions in activation space is irrelevant. Empirically, however, scale information is both consistent across
networks and useful across tasks. Neural networks trained
from different random initializations develop representations with similar large principal components, as shown in
Figure 1. Consequently, Euclidean distances between examples, which depend primarily upon large principal components, are similar across networks. These distances are
meaningful, as demonstrated by the success of perceptual
loss and style transfer (Gatys et al., 2016; Johnson et al.,
2016; Dumoulin et al., 2017). A similarity index that is
invariant to invertible linear transformation ignores this aspect of the representation, and assigns the same score to
networks that match only in large principal components or
networks that match only in small principal components.
2.2. Invariance to Orthogonal Transformation
Rather than requiring invariance to any invertible linear
transformation, one could require a weaker condition; invariance to orthogonal transformation, i.e. s(X, Y ) =
s(XU, Y V ) for full-rank orthonormal matrices U and V
such that U T U = I and V T V = I.

Similarity of Neural Network Representations Revisited

Indexes invariant to orthogonal transformations do not share
the limitations of indexes invariant to invertible linear transformation. When p2 > n, indexes invariant to orthogonal
transformation remain well-defined. Moreover, orthogonal transformations preserve scalar products and Euclidean
distances between examples.

sentational similarity matrices reduces to another intuitive
notion of pairwise feature similarity.

Invariance to orthogonal transformation seems desirable for
neural networks trained by gradient descent. Invariance to
orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural
networks (Chen et al., 1993; Orhan & Pitkow, 2018). In
the linear case, orthogonal transformation of the input does
not affect the dynamics of gradient descent training (LeCun
et al., 1991), and for neural networks initialized with rotationally symmetric weight distributions, e.g. i.i.d. Gaussian
weight initialization, training with fixed orthogonal transformations of activations yields the same distribution of
training trajectories as untransformed activations, whereas
an arbitrary linear transformation would not.

hvec(XX T ), vec(Y Y T )i = tr(XX T Y Y T ) = ||Y T X||2F .
(1)

Given a similarity index s(·, ·) that is invariant to orthogonal transformation, one can construct a similarity index
s0 (·, ·) that is invariant to any invertible linear transformation by first orthonormalizing the columns of X and Y ,
and then applying s(·, ·). Given thin QR decompositions
X = QA RA and Y = QB RB one can construct a similarity
index s0 (X, Y ) = s(QX , QY ), where s0 (·, ·) is invariant to
invertible linear transformation because orthonormal bases
with the same span are related to each other by orthonormal
transformation (see Appendix B).
2.3. Invariance to Isotropic Scaling
We expect similarity indexes to be invariant to isotropic scaling, i.e. s(X, Y ) = s(αX, βY ) for any α, β ∈ R+ . That
said, a similarity index that is invariant to both orthogonal
transformation and non-isotropic scaling, i.e. rescaling of
individual features, is invariant to any invertible linear transformation. This follows from the existence of the singular
value decomposition of the transformation matrix. Generally, we are interested in similarity indexes that are invariant
to isotropic but not necessarily non-isotropic scaling.

3. Comparing Similarity Structures
Our key insight is that instead of comparing multivariate
features of an example in the two representations (e.g. via regression), one can first measure the similarity between every
pair of examples in each representation separately, and then
compare the similarity structures. In neuroscience, such
matrices representing the similarities between examples
are called representational similarity matrices (Kriegeskorte et al., 2008a). We show below that, if we use an inner
product to measure similarity, the similarity between repre-

Dot Product-Based Similarity. A simple formula relates
dot products between examples to dot products between
features:

The elements of XX T and Y Y T are dot products between
the representations of the ith and j th examples, and indicate the similarity between these examples according to the
respective networks. The left-hand side of (1) thus measures the similarity between the inter-example similarity
structures. The right-hand side yields the same result by
measuring the similarity between features from X and Y ,
by summing the squared dot products between every pair.
Hilbert-Schmidt Independence Criterion. Equation 1
implies that, for centered X and Y :
1
tr(XX T Y Y T ) = ||cov(X T , Y T )||2F .
(n − 1)2

(2)

The Hilbert-Schmidt Independence Criterion (Gretton et al.,
2005) generalizes Equations 1 and 2 to inner products
from reproducing kernel Hilbert spaces, where the squared
Frobenius norm of the cross-covariance matrix becomes
the squared Hilbert-Schmidt norm of the cross-covariance
operator. Let Kij = k(xi , xj ) and Lij = l(yi , yj ) where k
and l are two kernels. The empirical estimator of HSIC is:
HSIC(K, L) =

1
tr(KHLH),
(n − 1)2

(3)

where H is the centering matrix Hn = In − n1 11T . For
linear kernels k(x, y) = l(x, y) = xT y, HSIC yields (2).
Gretton et al. (2005) originally proposed HSIC as a test
statistic for determining whether two sets of variables are
independent. They prove that the empirical √
estimator converges to the population value at a rate of 1/ n, and Song
et al. (2007) provide an unbiased estimator. When k and
l are universal kernels, HSIC = 0 implies independence,
but HSIC is not an estimator of mutual information. HSIC
is equivalent to maximum mean discrepancy between the
joint distribution and the product of the marginal distributions, and HSIC with a specific kernel family is equivalent
to distance covariance (Sejdinovic et al., 2013).
Centered Kernel Alignment. HSIC is not invariant to
isotropic scaling, but it can be made invariant through normalization. This normalized index is known as centered kernel alignment (Cortes et al., 2012; Cristianini et al., 2002):
CKA(K, L) = p

HSIC(K, L)
HSIC(K, K)HSIC(L, L)

.

(4)

Similarity of Neural Network Representations Revisited

Similarity Index
2
Linear Reg. (RLR
)
2
CCA (RCCA )
CCA (ρ̄CCA )
2
SVCCA (RSVCCA
)
SVCCA (ρ̄SVCCA )
PWCCA
Linear HSIC
Linear CKA
RBF CKA

Formula
||QTY X||2F /||X||2F
||QTY QX ||2F /p1
||QTY QX ||∗ /p1
||(UY TY )T UX TX ||2F /min(||TX ||2F , ||TY ||2F )
T
2
2
||(U
P X ||F , ||TY ||F )
Pp1Y TY ) UX TX ||∗ /min(||T
α
ρ
/||α||
,
α
=
|hh
,
x
i|
1
i
i
j
i=1 i i
j
||Y T X||2F /(n − 1)2
T
T
||Y T X||2F /(||X
p X||F ||Y Y ||F )
tr(KHLH)/ tr(KHKH)tr(LHLH)

Invariant to
Invertible Linear
Orthogonal
Transform
Transform
Y only
3
3
3
3
3
If same subspace kept
3
If same subspace kept
3
7
7
7
3
7
3
7
3

Isotropic
Scaling
3
3
3
3
3
3
7
3
3∗

Table 1. Summary of similarity methods investigated. QX and QY are orthonormal bases for the columns of X and Y . UX and UY
are the left-singular vectors of X and Y sorted in descending order according to the corresponding singular vectors. || · ||∗ denotes the
nuclear norm. TX and TY are truncated identity matrices that select left-singular vectors such that the cumulative variance explained
reaches some threshold. For RBF CKA, K and L are kernel matrices constructed by evaluating the RBF kernel between the examples as
in Section 3, and H is the centering matrix Hn = In − n1 11T . See Appendix C for more detail about each technique.
∗
Invariance of RBF CKA to isotropic scaling depends on the procedure used to select the RBF kernel bandwidth parameter. In our
experiments, we selected the bandwidth as a fraction of the median distance, which ensures that the similarity index is invariant to
isotropic scaling.

For a linear kernel, CKA is equivalent to the RV coefficient
(Robert & Escoufier, 1976) and to Tucker’s congruence coefficient (Tucker, 1951; Lorenzo-Seva & Ten Berge, 2006).
Kernel Selection. Below, we report results of CKA with
a linear kernel and the RBF kernel k(xi , xj ) = exp(−||xi −
xj ||22 /(2σ 2 )). For the RBF kernel, there are several possible
strategies for selecting the bandwidth σ, which controls the
extent to which similarity of small distances is emphasized
over large distances. We set σ as a fraction of the median
distance between examples. In practice, we find that RBF
and linear kernels give similar results across most experiments, so we use linear CKA unless otherwise specified.
Our framework extends to any valid kernel, including kernels equivalent to neural networks (Lee et al., 2018; Jacot
et al., 2018; Garriga-Alonso et al., 2019; Novak et al., 2019).

4. Related Similarity Indexes
In this section, we briefly review linear regression, canonical correlation, and other related methods in the context
of measuring similarity between neural network representations. We let QX and QY represent any orthonormal bases
for the columns of X and Y , i.e. QX = X(X T X)−1/2 ,
QY = Y (Y T Y )−1/2 or orthogonal transformations thereof.
Table 1 summarizes the formulae and invariance properties
of the indexes used in experiments. For a comprehensive
general review of linear indexes for measuring multivariate
similarity, see Ramsay et al. (1984).
Linear Regression. A simple way to relate neural network representations is via linear regression. One can fit

every feature in X as a linear combination of features from
Y . A suitable summary statistic is the total fraction of variance explained by the fit:
2
RLR
=1−

minB ||X − Y B||2F
||QTY X||2F
=
.
2
||X||F
||X||2F

(5)

We are unaware of any application of linear regression to
measuring similarity of neural network representations, although Romero et al. (2015) used a least squares loss between activations of two networks to encourage thin and
deep “student” networks to learn functions similar to wide
and shallow “teacher” networks.
Canonical Correlation Analysis (CCA). Canonical correlation finds bases for two matrices such that, when the
original matrices are projected onto these bases, the correlation is maximized. For 1 ≤ i ≤ p1 , the ith canonical
correlation coefficient ρi is given by:
i
ρi = max
corr(XwX
, Y wYi )
i
i
wX ,wY

j
i
subject to ∀j<i XwX
⊥ XwX

(6)

∀j<i Y wYi ⊥ Y wYj .
i
The vectors wX
∈ Rp1 and wYi ∈ Rp2 that maximize ρi
are the canonical weights, which transform the original data
i
into canonical variables XwX
and Y wYi . The constraints
in (6) enforce orthogonality of the canonical variables.

For the purpose of this work, we consider two summary

Similarity of Neural Network Representations Revisited

statistics of the goodness of fit of CCA:
Pp1 2
ρ
||QTY QX ||2F
2
RCCA = i=1 i =
p1
p1
Pp1
T
ρi
||QY QX ||∗
ρ̄CCA = i=1 =
,
p1
p1

(7)
(8)

where || · ||∗ denotes the nuclear norm. The mean squared
2
CCA correlation RCCA
is also known as Yanai’s GCD measure (Ramsay et al., 1984), and several statistical packages reportP
the sum of the squared canonical correlations
p1
2
p1 RCCA
= i=1
ρ2i under the name Pillai’s trace (SAS Institute, 2015; StataCorp, 2015). The mean CCA correlation
ρ̄CCA was previously used to measure similarity between
neural network representations in Raghu et al. (2017).

Mutual Information. Among non-linear measures, one
candidate is mutual information, which is invariant not only
to invertible linear transformation, but to any invertible transformation. Li et al. (2015) previously used mutual information to measure neuronal alignment. In the context of
comparing representations, we believe mutual information
is not useful. Given any pair of representations produced by
deterministic functions of the same input, mutual information between either and the input must be at least as large as
mutual information between the representations. Moreover,
in fully invertible neural networks (Dinh et al., 2017; Jacobsen et al., 2018), the mutual information between any two
layers is equal to the entropy of the input.

5. Linear CKA versus CCA and Regression
SVCCA. CCA is sensitive to perturbation when the condition number of X or Y is large (Golub & Zha, 1995). To
improve robustness, singular vector CCA (SVCCA) performs CCA on truncated singular value decompositions of
X and Y (Raghu et al., 2017; Mroueh et al., 2015; Kuss
& Graepel, 2003). As formulated in Raghu et al. (2017),
SVCCA keeps enough principal components of the input
matrices to explain a fixed proportion of the variance, and
drops remaining components. Thus, it is invariant to invertible linear transformation only if the retained subspace does
not change.
Projection-Weighted CCA. Morcos et al. (2018) propose a different strategy to reduce the sensitivity of CCA to
perturbation, which they term “projection-weighted canonical correlation” (PWCCA):
Pc
X
αi ρi
ρPW = Pi=1
αi =
|hhi , xj i|, (9)
i=1 αi
j
i
where xj is the j th column of X, and hi = XwX
is the
vector of canonical variables formed by projecting X to the
ith canonical coordinate frame. As we show in Appendix
C.3, PWCCA is closely related to linear regression, since:
Pc
X
α 0 ρ2
2
RLR = Pi=1 i 0 i
αi0 =
hhi , xj i2 .
(10)
i=1 αi
j

Neuron Alignment Procedures. Other work has studied
alignment between individual neurons, rather than alignment between subspaces. Li et al. (2015) examined correlation between the neurons in different neural networks, and
attempt to find a bipartite match or semi-match that maximizes the sum of the correlations between the neurons, and
then to measure the average correlations. Wang et al. (2018)
proposed to search for subsets of neurons X̃ ⊂ X and
Ỹ ⊂ Y such that, to within some tolerance, every neuron
in X̃ can be represented by a linear combination of neurons from Ỹ and vice versa. They found that the maximum
matching subsets are very small for intermediate layers.

Linear CKA is closely related to CCA and linear regression.
If X and Y are centered, then QX and QY are also centered,
so:
r
p2
2
RCCA
= CKA(QX QTX , QY QTY )
.
(11)
p1
When performing the linear regression fit of X with design
2
matrix Y , RLR
= ||QTY X||2F /||X||2F , so:
√
2
RLR
= CKA(XX T , QY QTY )

p1 ||X T X||F
.
||X||2F

(12)

When might we prefer linear CKA over CCA? One way
to show the difference is to rewrite X and Y in terms of
their singular value decompositions X = UX ΣX VXT , Y =
UY ΣY VYT . Let the ith eigenvector of XX T (left-singular
2
vector of X) be indexed as uiX . Then RCCA
is:
2
RCCA
= ||UYT UX ||2F /p1 =

p1 X
p2
X

huiX , ujY i2 /p1 .

(13)

i=1 j=1

Let the ith eigenvalue of XX T (squared singular value of
X) be indexed as λiX . Linear CKA can be written as:
CKA(XX T , Y Y T ) =

||Y T X||2F
T
||X X||F ||Y T Y ||F
Pp1 Pp2 i j i j 2
i=1
j=1 λX λY huX , uY i

= pP

p1
i 2
i=1 (λX )

qP

p2
j 2
j=1 (λY )

.

(14)
Linear CKA thus resembles CCA weighted by the eigenvalues of the corresponding eigenvectors, i.e. the amount
of variance in X or Y that each explains. SVCCA (Raghu
et al., 2017) and projection-weighted CCA (Morcos et al.,
2018) were also motivated by the idea that eigenvectors
that correspond to small eigenvalues are less important, but

Similarity of Neural Network Representations Revisited
2 )
CCA (RCCA

4

SVCCA (ρ̄)

0.8
0.7
0.6
0.5
0.4
0.3

8
6

0.7
0.6
0.5
0.4
0.3
0.2

CCA (ρ̄)
2
CCA (RCCA
)
SVCCA (ρ̄)
2
SVCCA (RCCA
)
PWCCA
Linear Reg.
Linear HSIC
CKA (Linear)
CKA (RBF 0.2)
CKA (RBF 0.4)
CKA (RBF 0.8)

Accuracy
1.4
10.6
9.9
15.1
11.1
45.4
22.2
99.3
80.6
99.1
99.3

Similarity

4

2 )
SVCCA (RCCA

Index

Similarity

0.5
0.4
0.3
0.2

6
2

Layer

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

Similarity

Layer

8

Linear Regression

Table 2. Accuracy of identifying corresponding layers based on
maximum similarity for 10 architecturally identical 10-layer CNNs
trained from different initializations, with logits layers excluded.
For SVCCA, we used a truncation threshold of 0.99 as recommended in Raghu et al. (2017). For asymmetric indexes (PWCCA
and linear regression) we symmetrized the similarity as S + S T .
CKA RBF kernel parameters reflect the fraction of the median
Euclidean distance used as σ. Results not significantly different
from the best result are bold-faced (p < 0.05, jackknife z-test).

Figure 2. CKA reveals consistent relationships between layers of
CNNs trained with different random initializations, whereas CCA,
linear regression, and SVCCA do not. For linear regression, which
is asymmetric, we plot R2 for the fit of the layer on the x-axis with
the layer on the y-axis. Results are averaged over 10 networks on
the CIFAR-10 training set. See Table 2 for a numerical summary.

We first investigate a simple VGG-like convolutional network based on All-CNN-C (Springenberg et al., 2015) (see
Appendix E) on CIFAR-10. Figure 2 and Table 2 show that
CKA passes our sanity check, but other methods perform
substantially worse. For SVCCA, we experimented with
a range of truncation thresholds, but no threshold revealed
the layer structure (Appendix F.2); our results are consistent
with those in Appendix E of Raghu et al. (2017).

2

CKA (Linear)
Layer

8
6
4
2
2

4

6

Layer

8

CKA (RBF 0.4)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

0.9
0.8
0.7
0.6
0.5
0.4
0.3
2

4

6

Layer

8

linear CKA incorporates this weighting symmetrically and
can be computed without a matrix decomposition.
Comparison of (13) and (14) immediately suggests the possibility of alternative weightings of scalar products between
eigenvectors. Indeed, as we show in Appendix D.1, the similarity index induced by “canonical ridge” regularized CCA
(Vinod, 1976), when appropriately normalized, interpolates
2
between RCCA
, linear regression, and linear CKA.

6. Results
6.1. A Sanity Check for Similarity Indexes
We propose a simple sanity check for similarity indexes:
Given a pair of architecturally identical networks trained
from different random initializations, for each layer in the
first network, the most similar layer in the second network
should be the architecturally corresponding layer. We train
10 networks and, for each layer of each network, we compute the accuracy with which we can find the corresponding
layer in each of the other networks by maximum similarity.
We then average the resulting accuracies. We compare CKA
with CCA, SVCCA, PWCCA, and linear regression.

We also investigate Transformer networks, where all layers
are of equal width. In Appendix F.1, we show similarity
between the 12 sublayers of the encoders of Transformer
models (Vaswani et al., 2017) trained from different random
initializations. All similarity indexes achieve non-trivial
accuracy and thus pass the sanity check, although RBF CKA
2
and RCCA
performed slightly better than other methods.
However, we found that there are differences in feature
scale between representations of feed-forward network and
self-attention sublayers that CCA does not capture because
it is invariant to non-isotropic scaling.
6.2. Using CKA to Understand Network Architectures
CKA can reveal pathology in neural networks representations. In Figure 3, we show CKA between layers of individual CNNs with different depths, where layers are repeated
2, 4, or 8 times. Doubling depth improved accuracy, but
greater multipliers hurt accuracy. At 8x depth, CKA indicates that representations of more than half of the network
are very similar to the last layer. We validated that these
later layers do not refine the representation by training an `2 regularized logistic regression classifier on each layer of the
network. Classification accuracy in shallower architectures
progressively improves with depth, but for the 8x deeper

Accuracy

9
8
7
6
5
4
3
2
1

1.0
0.8
0.6

1x Depth (94.1%)

2x Depth (95.0%)

4x Depth (93.2%)

15
10
5
1 2 3 4 5 6 7 8 9

1.0
0.8
0.6

1 2 3 4 5 6 7 8 9

5

5

Layer

10

15

10

15

Layer

1.0
0.8
0.6

8x Depth (91.9%)
60
50
40
30
20
10

30
25
20
15
10
5
5 10 15 20 25 30

10 20 30 40 50 60

1.0
0.8
0.6

5 10 15 20 25 30

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

Similarity

Layer

Similarity of Neural Network Representations Revisited

10 20 30 40 50 60

Layer

Layer

Layer

8 16 24

6
4
2

Layer

Figure 4. Linear CKA between layers of a ResNet-62 model on
the CIFAR-10 test set. The grid pattern in the left panel arises
from the architecture. Right panels show similarity separately for
even layer (post-residual) and odd layer (block interior) activations.
Layers in the same block group (i.e. at the same feature map scale)
are more similar than layers in different block groups.

network, accuracy plateaus less than halfway through the
network. When applied to ResNets (He et al., 2016), CKA
reveals no pathology (Figure 4). We instead observe a grid
pattern that originates from the architecture: Post-residual
activations are similar to other post-residual activations, but
activations within blocks are not.
CKA is equally effective at revealing relationships between
layers of different architectures. Figure 5 shows the relationship between different layers of networks with and without
residual connections. CKA indicates that, as networks are
made deeper, the new layers are effectively inserted in between the old layers. Other similarity indexes fail to reveal
meaningful relationships between different architectures, as
we show in Appendix F.5.
In Figure 6, we show CKA between networks with different layer widths. Like Morcos et al. (2018), we find that
increasing layer width leads to more similar representations
between networks. As width increases, CKA approaches 1;
CKA of earlier layers saturates faster than later layers. Networks are generally more similar to other networks of the
same width than they are to the widest network we trained.

8

0.8

4

0.7

2

2 4 6 8 10121416

2 4 6 8 10 12

Plain-18 Layer

16
14
12
10
8
6
4
2

0.9

6

ResNet-14 Layer
12
10
8
6
4
2

5 10 15 20 25 30

ResNet-32 Layer

0.6
0.5
0.4

CKA (Linear)

Layer

8 16 24

8

Plain-10 Layer

15 30 45 60

24
16
8

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

ResNet-14 Layer

24
16
8

Odd Layers

Plain-10 Layer

Even Layers

Plain-18 Layer

All Layers

0.3
0.2
5 10 15 20 25 30

ResNet-32 Layer

Figure 5. Linear CKA between layers of networks with different
architectures on the CIFAR-10 test set.

CKA (Linear)

60
45
30
15

Similarity

Layer

Figure 3. CKA reveals when depth becomes pathological. Top: Linear CKA between layers of individual networks of different depths on
the CIFAR-10 test set. Titles show accuracy of each network. Later layers of the 8x depth network are similar to the last layer. Bottom:
Accuracy of a logistic regression classifier trained on layers of the same networks is consistent with CKA.

1.0
0.9
0.8
0.7
0.6
0.5

Similarity with Width 4096

Similarity with Same Width

conv1
conv2
conv3
conv4
4

16

64 256 1024 4096 4

Width

16

conv5
conv6
conv7
conv8

64 256 1024 4096

Width

Figure 6. Layers become more similar to each other and to wide
networks as width increases, but similarity of earlier layers saturates first. Left: Similarity of networks with the widest network
we trained. Middle: Similarity of networks with other networks
of the same width trained from random initialization. All CKA
values are computed between 10 networks on the CIFAR-10 test
set; shaded regions reflect jackknife standard error.

CKA (Linear)

Similarity of Neural Network Representations Revisited

Similarity on CIFAR-10

Similarity on CIFAR-100

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

CIFAR-10 Net vs. CIFAR-10 Net
CIFAR-100 Net vs. CIFAR-100 Net
CIFAR-10 Net vs. CIFAR-100 Net

Untrained vs. CIFAR-10 Net
Untrained vs. CIFAR-100 Net

1.0
0.8
0.6
0.4
0.2
0.0

Layer

Layer

Figure 7. CKA shows that models trained on different datasets
(CIFAR-10 and CIFAR-100) develop similar representations, and
these representations differ from untrained models. The left panel
shows similarity between the same layer of different models on the
CIFAR-10 test set, while the right panel shows similarity computed
on CIFAR-100 test set. CKA is averaged over 10 models of each
type (45 pairs).

6.3. Similar Representations Across Datasets
CKA can also be used to compare networks trained on different datasets. In Figure 7, we show that models trained on
CIFAR-10 and CIFAR-100 develop similar representations
in their early layers. These representations require training;
similarity with untrained networks is much lower. We further explore similarity between layers of untrained networks
in Appendix F.3.
6.4. Analysis of the Shared Subspace
Equation 14 suggests a way to further elucidating what CKA
is measuring, based on the action of one representational
similarity matrix (RSM) Y Y T applied to the eigenvectors
uiX of the other RSM XX T . By definition, XX T uiX points
in the same direction as uiX , and its norm ||XX T uiX ||2 is
the corresponding eigenvalue. The degree of scaling and
rotation by Y Y T thus indicates how similar the action of
Y Y T is to XX T , for each eigenvector of XX T . For visualization purposes, this approach is somewhat less useful
than the CKA summary statistic, since it does not collapse
the similarity to a single number, but it provides a more
complete picture of what CKA measures. Figure 8 shows
that, for large eigenvectors, XX T and Y Y T have similar
actions, but the rank of the subspace where this holds is
substantially lower than the dimensionality of the activations. In the penultimate (global average pooling) layer, the
dimensionality of the shared subspace is approximately 10,
which is the number of classes in the CIFAR-10 dataset.

7. Conclusion and Future Work
Measuring similarity between the representations learned
by neural networks is an ill-defined problem, since it is not
entirely clear what aspects of the representation a similarity

Figure 8. The shared subspace of two networks trained on CIFAR10 from different random initializations is spanned primarily by the
eigenvectors corresponding to the largest eigenvalues. Each row
represents a different network layer. Note that the average pooling
layer has only 64 units. Left: Scaling of the eigenvectors uiX of
the RSM XX T from network A by RSMs of networks A and B.
Orange lines show ||XX T uiX ||2 , i.e. the eigenvalues. Purple dots
show ||Y Y T uiX ||2 , the scaling of the eigenvectors of the RSM
of network A by the RSM of network B. Right: Cosine of the
rotation by the RSM of network B, (uiX )T Y Y T uiX /||Y Y T uiX ||2 .

index should focus on. Previous work has suggested that
there is little similarity between intermediate layers of neural networks trained from different random initializations
(Raghu et al., 2017; Wang et al., 2018). We propose CKA as
a method for comparing representations of neural networks,
and show that it consistently identifies correspondences between layers, not only in the same network trained from
different initializations, but across entirely different architectures, whereas other methods do not. We also provide a
unified framework for understanding the space of similarity
indexes, as well as an empirical framework for evaluation.
We show that CKA captures intuitive notions of similarity,
i.e. that neural networks trained from different initializations should be similar to each other. However, it remains
an open question whether there exist kernels beyond the
linear and RBF kernels that would be better for analyzing
neural network representations. Moreover, there are other
potential choices of weighting in Equation 14 that may be
more appropriate in certain settings. We leave these questions as future work. Nevertheless, CKA seems to be much
better than previous methods at finding correspondences between the learned representations in hidden layers of neural
networks.

Similarity of Neural Network Representations Revisited

Acknowledgements
We thank Gamaleldin Elsayed, Jaehoon Lee, Paul-Henri
Mignot, Maithra Raghu, Samuel L. Smith, Alex Williams,
and Michael Wu for comments on the manuscript, Rishabh
Agarwal for ideas, and Aliza Elkin for support.

References
Advani, M. S. and Saxe, A. M. High-dimensional dynamics
of generalization error in neural networks. arXiv preprint
arXiv:1710.03667, 2017.
Amari, S.-i., Ozeki, T., Karakida, R., Yoshida, Y., and
Okada, M. Dynamics of learning in MLP: Natural gradient and singularity revisited. Neural Computation, 30(1):
1–33, 2018.
Björck, Å. and Golub, G. H. Numerical methods for computing angles between linear subspaces. Mathematics of
Computation, 27(123):579–594, 1973.
Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow,
B., Huck, M., Koehn, P., and Monz, C. Findings of the
2018 Conference on Machine Translation (WMT18). In
EMNLP 2018 Third Conference on Machine Translation
(WMT18), 2018.
Chen, A. M., Lu, H.-m., and Hecht-Nielsen, R. On the
geometry of feedforward neural network error surfaces.
Neural Computation, 5(6):910–927, 1993.
Connolly, A. C., Guntupalli, J. S., Gors, J., Hanke, M.,
Halchenko, Y. O., Wu, Y.-C., Abdi, H., and Haxby, J. V.
The representation of biological classes in the human
brain. Journal of Neuroscience, 32(8):2608–2618, 2012.

Freiwald, W. A. and Tsao, D. Y. Functional compartmentalization and viewpoint generalization within the macaque
face-processing system. Science, 330(6005):845–851,
2010.
Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L.
Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations, 2019.
Gatys, L. A., Ecker, A. S., and Bethge, M. Image style
transfer using convolutional neural networks. In IEEE
Conference on Computer Vision and Pattern Recognition,
2016.
Golub, G. H. and Zha, H. The canonical correlations of
matrix pairs and their numerical computation. In Linear
Algebra for Signal Processing, pp. 27–49. Springer, 1995.
Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B.
Measuring statistical dependence with Hilbert-Schmidt
norms. In International Conference on Algorithmic
Learning Theory, 2005.
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A.,
Schouten, J. L., and Pietrini, P. Distributed and overlapping representations of faces and objects in ventral
temporal cortex. Science, 293(5539):2425–2430, 2001.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In IEEE Conference on
Computer Vision and Pattern Recognition, 2016.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
In International Conference on Machine Learning, 2015.

Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms
for learning kernels based on centered alignment. Journal
of Machine Learning Research, 13(Mar):795–828, 2012.

Jacobsen, J.-H., Smeulders, A. W., and Oyallon, E. iRevNet: Deep invertible networks. In International Conference on Learning Representations, 2018.

Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola,
J. S. On kernel-target alignment. In Advances in Neural
Information Processing Systems, 2002.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks.
In Advances in Neural Information Processing Systems,
2018.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on
Learning Representations, 2017.
Dumoulin, V., Shlens, J., and Kudlur, M. A learned representation for artistic style. In International Conference
on Learning Representations, 2017.

Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for
real-time style transfer and super-resolution. In European
Conference on Computer Vision, 2016.

Edelman, S. Representation is representation of similarities.
Behavioral and Brain Sciences, 21(4):449–467, 1998.

Khaligh-Razavi, S.-M. and Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology, 10(11):
e1003915, 2014.

Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland,
M. M., and Cunningham, J. P. Reorganization between
preparatory and movement population responses in motor
cortex. Nature Communications, 7:13239, 2016.

Kriegeskorte, N., Mur, M., and Bandettini, P. A. Representational similarity analysis-connecting the branches of
systems neuroscience. Frontiers in Systems Neuroscience,
2:4, 2008a.

Similarity of Neural Network Representations Revisited

Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka,
J., Esteky, H., Tanaka, K., and Bandettini, P. A. Matching
categorical object representations in inferior temporal
cortex of man and monkey. Neuron, 60(6):1126–1141,
2008b.
Kuss, M. and Graepel, T. The geometry of kernel canonical correlation analysis. Technical report, Max Planck
Institute for Biological Cybernetics, 2003.
Laakso, A. and Cottrell, G. Content and cluster analysis:
assessing representational similarity in neural systems.
Philosophical Psychology, 13(1):47–76, 2000.
LeCun, Y., Kanter, I., and Solla, S. A. Second order properties of error surfaces: Learning time and generalization.
In Advances in Neural Information Processing Systems,
1991.
Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R.,
Schoenholz, S., and Bahri, Y. Deep neural networks
as gaussian processes. In International Conference on
Learning Representations, 2018.
Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J.
Convergent learning: Do different neural networks learn
the same representations? In NIPS 2015 Workshop on
Feature Extraction: Modern Questions and Challenges,
2015.
Lorenzo-Seva, U. and Ten Berge, J. M. Tucker’s congruence coefficient as a meaningful index of factor similarity.
Methodology, 2(2):57–64, 2006.
Morcos, A., Raghu, M., and Bengio, S. Insights on representational similarity in neural networks with canonical
correlation. In Advances in Neural Information Processing Systems, 2018.
Mroueh, Y., Marcheret, E., and Goel, V. Asymmetrically weighted CCA and hierarchical kernel sentence
embedding for multimodal retrieval. arXiv preprint
arXiv:1511.06267, 2015.
Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia,
D. A., Pennington, J., and Sohl-dickstein, J. Bayesian
deep convolutional networks with many channels are
Gaussian processes. In International Conference on
Learning Representations, 2019.
Orhan, E. and Pitkow, X. Skip connections eliminate singularities. In International Conference on Learning Representations, 2018.
Press, W. H.
Canonical correlation clarified by
singular value decomposition, 2011.
URL
http://numerical.recipes/whp/notes/
CanonCorrBySVD.pdf.

Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J.
SVCCA: Singular vector canonical correlation analysis
for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, 2017.
Ramsay, J., ten Berge, J., and Styan, G. Matrix correlation.
Psychometrika, 49(3):403–423, 1984.
Robert, P. and Escoufier, Y. A unifying tool for linear multivariate statistical methods: the RV-coefficient. Applied
Statistics, 25(3):257–265, 1976.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta,
C., and Bengio, Y. FitNets: Hints for thin deep nets. In
International Conference on Learning Representations,
2015.
SAS Institute.
Introduction to Regression Procedures.
2015.
URL https://support.sas.
com/documentation/onlinedoc/stat/141/
introreg.pdf.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact
solutions to the nonlinear dynamics of learning in deep
linear neural networks. In International Conference on
Learning Representations, 2014.
Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHS-based
statistics in hypothesis testing. The Annals of Statistics,
pp. 2263–2291, 2013.
Smith, S. L., Turban, D. H., Hamblin, S., and Hammerla,
N. Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In International
Conference on Learning Representations, 2017.
Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and
Bedo, J. Supervised feature selection via dependence estimation. In International Conference on Machine learning,
2007.
Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional
net. In International Conference on Learning Representations Workshop, 2015.
StataCorp. Stata Multivariate Statistics Reference Manual. 2015. URL https://www.stata.com/
manuals14/mv.pdf.
Sussillo, D., Churchland, M. M., Kaufman, M. T., and
Shenoy, K. V. A neural network that finds a naturalistic
solution for the production of muscle activity. Nature
Neuroscience, 18(7):1025, 2015.
Tucker, L. R. A method for synthesis of factor analysis
studies. Technical report, Educational Testing Service,
Princeton, NJ, 1951.

Similarity of Neural Network Representations Revisited

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information
Processing Systems, pp. 5998–6008, 2017.
Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez,
A. N., Gouws, S., Jones, L., Kaiser, Ł., Kalchbrenner,
N., Parmar, N., et al. Tensor2tensor for neural machine
translation. arXiv preprint arXiv:1803.07416, 2018.
Vinod, H. D. Canonical ridge and econometrics of joint production. Journal of Econometrics, 4(2):147–166, 1976.
Wang, L., Hu, L., Gu, J., Wu, Y., Hu, Z., He, K., and
Hopcroft, J. E. Towards understanding learning representations: To what extent do different neural networks
learn the same representation. In Advances in Neural
Information Processing Systems, 2018.
Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A.,
Seibert, D., and DiCarlo, J. J. Performance-optimized
hierarchical models predict neural responses in higher
visual cortex. Proceedings of the National Academy of
Sciences, 111(23):8619–8624, 2014.
Zagoruyko, S. and Komodakis, N. Wide residual networks.
In British Machine Vision Conference, 2016.

Similarity of Neural Network Representations Revisited

A. Proof of Theorem 1
Theorem. Let X and Y be n × p matrices. Suppose s is invariant to invertible linear transformation in the first argument,
i.e. s(X, Z) = s(XA, Z) for arbitrary Z and any A with rank(A) = p. If rank(X) = rank(Y ) = n, then s(X, Z) =
s(Y, Z).
Proof. Let
X0 =



X
KX



Y0 =




Y
,
KY

where KX is a basis for the null space of the rows of X and KY is a basis for the null space of the rows of Y . Then let
A = X 0−1 Y 0 .




X
Y
A=
=⇒ XA = Y.
KX
KY
Because X 0 and Y 0 have rank p by construction, A also has rank p. Thus, s(X, Z) = s(XA, Z) = s(Y, Z).

B. Orthogonalization and Invariance to Invertible Linear Transformation
Here we show that any similarity index that is invariant to orthogonal transformation can be made invariant to invertible
linear transformation by orthogonalizing the columns of the input.
Proposition 1. Let X be an n × p matrix of full column rank and let A be an invertible p × p matrix. Let X = QX RX and
XA = QXA RXA , where QTX QX = QTXA QXA = I and RX and RXA are invertible. If s(·, ·) is invariant to orthogonal
transformation, then s(QX , Y ) = s(QXA , Y ).
−1
Proof. Let B = RX ARXA
. Then QX B = QXA , and B is an orthogonal transformation:

B T B = B T QTX QX B = QTXA QXA = I.
Thus s(QX , Y ) = s(QX B, Y ) = s(QXA , Y ).

C. CCA and Linear Regression
C.1. Linear Regression
Consider the linear regression fit of the columns of an n × m matrix C with an n × p matrix A:
B̂ = arg min ||C − AB||2F = (AT A)−1 AT C.
B

Let A = QR, the thin QR decomposition of A. Then the fitted values are given by:
Ĉ = AB̂
= A(AT A)−1 AT C
= QR(RT QT QR)−1 RT QT C
= QRR−1 (RT )−1 RT QT C
= QQT C.
The residuals E = C − Ĉ are orthogonal to the fitted values, i.e.
E T Ĉ = (C − QQT C)T QQT C
= C T QQT C − C T QQT C = 0.

Similarity of Neural Network Representations Revisited

Thus:
||E||2F = tr(E T E)
= tr(E T C − E T Ĉ)
= tr((C − Ĉ)T C)
= tr(C T C) − tr(C T QQT C)
= ||C||2F − ||QT C||2F .

(15)

Assuming that C was centered by subtracting its column means prior to the linear regression fit, the total fraction of variance
explained by the fit is:
R2 = 1 −

||E||2F
||C||2F − ||QT C||2F
||QT C||2F
=
1
−
=
.
||C||2F
||C||2F
||C||2F

(16)

Although we have assumed that Q is obtained from QR decomposition, any orthonormal basis with the same span will
suffice, because orthogonal transformations do not change the Frobenius norm.
C.2. CCA
Let X be an n × p1 matrix and Y be an n × p2 matrix, and let p = min(p1 , p2 ). Given the thin QR decompositions of X
and Y , X = QX RX , Y = QY RY such that QTX QX = I, QTY QY = I, the canonical correlations ρi are the singular values
of A = QTX QY (Björck & Golub, 1973; Press, 2011) and thus the square roots of thePeigenvalues of AT A. The squared
p
canonical correlations ρ2i are the eigenvalues of AT A = QTY QX QTX QY . Their sum is i=1 ρ2i = tr(AT A) = ||QTY QX ||2F .
Now consider the linear regression fit of the columns of QX with Y . Assume that QX has zero mean. Substituting QY for
Q and QX for C in Equation 16, and noting that ||QX ||2F = p1 :
Pp
ρ2
||QTY QX ||2F
2
R =
= i=1 i .
(17)
p1
p1
C.3. Projection-Weighted CCA
Let X be an n × p1 matrix and Y be an n × p2 matrix, with p1 ≤ p2 . Morcos et al. (2018) proposed to compute
projection-weighted canonical correlation as:
Pc
X
αi ρi
ρ̄PW = Pi=1
αi =
|hhi , xj i|,
i=1 αi
j
where the xj are the columns of X, and the hi are the canonical variables formed by projecting X to the canonical coordinate
frame. Below, we show that if we modify ρ̄PW by squaring the dot products and ρi , we recover linear regression.:
Pc
X
α 0 ρ2
2
2
RMPW
= Pi=1 i 0 i = RLR
αi0 =
hhi , xj i2 .
α
i=1 i
j
Our derivation begins by forming the SVD QTX QY = U ΣV T . Σ is a diagonal matrix of the canonical correlations ρi , and
2
the matrix of canonical variables H = QX U . Then RMPW
is:
2
RMPW
=

||X T HΣ||2F
||X T H||2F

tr((X T HΣ)T (X T HΣ))
tr((X T H)T (X T H))
tr(ΣH T XX T HΣ)
=
tr(H T XX T H)
tr(X T HΣ2 H T X)
=
tr(X T HH T X)
T
tr(RX
QTX HΣ2 H T QX RX )
=
.
T QT Q U U T QT Q R )
tr(RX
X X
X X X
=

(18)

Similarity of Neural Network Representations Revisited

Because we assume p1 ≤ p2 , U is a square orthogonal matrix and U U T = I. Further noting that QTX H = U and
U Σ = QTX QY V :
2
RMPW
=

T
tr(RX
U Σ2 U T R X )
T QT Q R )
tr(RX
X X X

T
tr(RX
QTX QY V ΣU T RX )
tr(X T X)
tr(X T QY QTY QX RX )
=
tr(X T X)
tr(X T QY QTY X)
=
tr(X T X)
||QTY X||2F
=
.
||X||2F

=

Substituting QY for Q and X for C in Equation 16:
2
RLR
=

||QTY X||2F
2
= RMPW
.
||X||2F

D. Notes on Other Methods
D.1. Canonical Ridge
Beyond CCA, we could also consider the “canonical ridge” regularized CCA objective (Vinod, 1976):
p
σi = max
i
i
wX ,wY

i T
) (Y wYi )
(XwX
p
i ||2 + κ ||wi ||2 ||Y wi ||2 + κ ||wi ||2
||XwX
X
Y
X 2
Y
Y

subject to ∀j<i (wiX )T (X T X + κI)wjX = 0

(19)

∀j<i (wiY )T (Y T Y + κI)wjY = 0.
Given the singular value decompositions X = UX ΣX VXT and Y = UY ΣY VYT , one can form “partially orthogonalized” bases
Q̃X = UX ΣX (Σ2X +κX I)−1/2 and Q̃Y = UY ΣY (Σ2Y +κY I)−1/2 . Given the singular value decomposition of their product
Ũ Σ̃Ṽ T = Q̃TX Q̃Y , the canonical weights are given by WX = VX (Σ2X + κX I)−1/2 Ũ and WY = VY (Σ2Y + κY I)−1/2 Ṽ ,
as previously shown by Mroueh et al. (2015).
P As in the unregularized case (Equation 13), there is a convenient expression
for the sum of the squared singular values σ̃i2 in terms of the eigenvalues and eigenvectors of XX T and Y Y T . Let the ith
left-singular vector of X (eigenvector of XX T ) be indexed as uiX and let the ith eigenvalue of XX T (squared singular value
of X) be indexed as λiX , and similarly let the left-singular vectors of Y Y T be indexed as uiY and the eigenvalues as λiY .
Then:
p1
X

σ̃i2 = ||Q̃TY Q̃X ||2F

(20)

i=1

= ||(Σ2Y + κY I)−1/2 ΣY UYT UX ΣX (Σ2X + κX I)−1/2 ||2F
p1 X
p2
X

λiX λjY
=
huiX , ujY i2 .
i + κ )(λj + κ )
(λ
X
Y
X
Y
i=1 j=1

(21)
(22)

Unlike in the unregularized case, the singular values σi do not measure the correlation between the canonical variables.
Instead, they become arbitrarily small as κX or κY increase. Thus, we need to normalize the statistic to remove the
dependency on the regularization parameters.

Similarity of Neural Network Representations Revisited

Applying von Neumann’s trace inequality yields a bound:
p1
X

σ̃i2 = tr(Q̃Y Q̃TY Q̃X Q̃TX )

(23)

i=1
T
= tr((UY Σ2Y (Σ2Y + κY I)−1 UYT )(UX Σ2X (Σ2X + κX I)−1 UX
))

≤

p1
X

λiX λiY
.
i
(λX + κX )(λiY + κY )
i=1

(24)
(25)

Applying the Cauchy-Schwarz inequality to (25) yields the alternative bounds:

p1
X

v
u p1 
uX
2
σ̃ ≤ t
i

i=1

i=1

v
u p1 
uX
≤t
i=1

v

2 uX
u p1
t

λiX
λiX + κX

i=1

v

2 uX
u p2
t

λiX
λiX + κX

i=1

λiY
λiY + κY

2

λiY
λiY + κY

2

(26)

.

(27)

A normalized form of (22) could be produced by dividing by any of (25), (26), or (27).
If κX = κY = 0, then (25) and (26) are equal to p1 . In this case, (22) is simply the sum of the squared canonical correlations,
2
so normalizing by either of these bounds recovers RCCA
.
If κY = 0, then as κX → ∞, normalizing by the bound from (25) recovers R2 :
λiX λjY
j 2
i
j=1 (λi +κX )(λj +0) huX , uY i

Pp1 Pp2
i=1

lim

Y

X

λiX λiY
i=1 (λiX +κX )(λiY +0)

Pp1

κX →∞

Pp1 Pp2
i=1

= lim

j=1

Pp1

κX →∞



i=1

λiX

λi
X
κX +1


(28)

huiX , ujY i2
(29)

λiX

λi
X +1
κX

Pp1 Pp2

j 2
i
i
j=1 λX huX , uY i
Pp1 i
=
i=1 λX
T
||U UX ΣX ||2F
||QTY X||2F
2
= Y
=
= RLR
.
||X||2F
||X||2F
i=1

(30)
(31)

The bound from (27) differs from the bounds in (25) and (26) because it is multiplicatively separable in X and Y .
Normalizing by this bound leads to CKA(Q̃X Q̃TX , Q̃Y Q̃TY ):
λiX λjY
j 2
i
j=1 (λi +κX )(λj +κY ) huX , uY i

Pp1 Pp2
i=1

r
Pp1 
i=1

=

X

λiX
λiX +κX

Y

2 rP

p2
i=1



λiY
λiY +κY

2

||Q̃TY Q̃X ||2F
= CKA(Q̃X Q̃TX , Q̃Y Q̃TY ).
T
T
||Q̃X Q̃X ||F ||Q̃Y Q̃Y ||F

(32)

(33)

Similarity of Neural Network Representations Revisited

Moreover, setting κX = κY = κ and taking the limit as κ → ∞, the normalization from (27) leads to CKA(XX T , Y Y T ):
λiX λjY
j 2
i
j=1 (λi +κ)(λj +κ) huX , uY i

Pp1 Pp2
lim r

κ→∞

i=1

X

Y

r
Pp1  λiX 2 Pp2  λiY 2
λiX +κ

i=1

λiX λjY
 j

j=1



= lim s

κ→∞
P p1

λiX

i=1

i=1

λi
X
κ +1

λi
X
κ +1

λiY +κ

i=1

Pp1 Pp2

(34)

j
 hui , u i2
X
Y

λ
Y
κ +1


2 s
Pp2
i=1

λiY

2

(35)

λi
Y
κ +1

Pp1 Pp2

j
j 2
i
i
j=1 λX λY huX , uY i

i=1

= qP

p1
i=1

λiX

2 qPp2

i=1

λiY

2

(36)

= CKA(XX T , Y Y T ).
Overall, the hyperparameters of the canonical ridge objective make it less useful for exploratory analysis. These hyperparameters could be selected by cross-validation, but this is computationally expensive, and the resulting estimator would be
biased by sample size. Moreover, our goal is not to map representations of networks to a common space, but to measure the
similarity between networks. Appropriately chosen regularization will improve out-of-sample performance of the mapping,
but it makes the meaning of “similarity” more ambiguous.
D.2. The Orthogonal Procrustes Problem
The orthogonal Procrustes problem consists of finding an orthogonal rotation in feature space that produces the smallest
error:
Q̂ = arg min ||Y − XQ||2F subject to QT Q = I.

(37)

Q

The objective can be written as:
||Y − XQ||2F = tr((Y − XQ)T (Y − XQ))
= tr(Y T Y ) − tr(Y T XQ) − tr(QT X T Y ) + tr(QT X T XQ)
= ||Y ||2F + ||X||2F − 2tr(Y T XQ).

(38)

Thus, an equivalent objective is:
Q̂ = arg max tr(Y T XQ) subject to QT Q = I.

(39)

Q

The solution is Q̂ = U V T where U ΣV T = X T Y , the singular value decomposition. At the maximum of (39):
tr(Y T X Q̂) = tr(V ΣU T U V T ) = tr(Σ) = ||X T Y ||∗ = ||Y T X||∗ ,

(40)

which is similar to what we call “dot product-based similarity” (Equation 1), but with the squared Frobenius norm of Y T X
(the sum of the squared singular values) replaced by the nuclear norm (the sum of the singular values). The Frobenius norm
of Y T X can be obtained as the solution to a similar optimization problem:
||Y T X||F = max tr(Y T XW ) subject to tr(W T W ) = 1.
W

(41)

In the context of neural networks, Smith et al. (2017) previously proposed using the solution to the orthogonal Procrustes
problem to align word embeddings from different languages, and demonstrated that it outperformed CCA.

Similarity of Neural Network Representations Revisited

E. Architecture Details
All non-ResNet architectures are based on All-CNN-C (Springenberg et al., 2015), but none are architecturally identical. The
Plain-10 model is very similar, but we place the final linear layer after the average pooling layer and use batch normalization
because these are common choices in modern architectures. We use these models because they train in minutes on modern
hardware.
Tiny-10
3 × 3 conv. 16-BN-ReLu ×2
3 × 3 conv. 32 stride 2-BN-ReLu
3 × 3 conv. 32-BN-ReLu ×2
3 × 3 conv. 64 stride 2-BN-ReLu
3 × 3 conv. 64 valid padding-BN-ReLu
1 × 1 conv. 64-BN-ReLu
Global average pooling
Logits
Table E.1. The Tiny-10 architecture, used in Figures 2, 8, F.3, and F.5. The average Tiny-10 model achieved 89.4% accuracy.

Plain-(8n + 2)
3 × 3 conv. 96-BN-ReLu ×(3n − 1)
3 × 3 conv. 96 stride 2-BN-ReLu
3 × 3 conv. 192-BN-ReLu ×(3n − 1)
3 × 3 conv. 192 stride 2-BN-ReLu
3 × 3 conv. 192 BN-ReLu ×(n − 1)
3 × 3 conv. 192 valid padding-BN-ReLu
1 × 1 conv. 192-BN-ReLu ×n
Global average pooling
Logits
Table E.2. The Plain-(8n + 2) architecture, used in Figures 3, 5, 7, F.4, F.5, F.6, and F.7. Mean accuracies: Plain-10, 93.9%; Plain-18:
94.8%; Plain-34: 93.7%; Plain-66: 91.3%

Width-n
3 × 3 conv. n-BN-ReLu ×2
3 × 3 conv. n stride 2-BN-ReLu
3 × 3 conv. n-BN-ReLu ×2
3 × 3 conv. n stride 2-BN-ReLu
3 × 3 conv. n valid padding-BN-ReLu
1 × 1 conv. n-BN-ReLu
Global average pooling
Logits
Table E.3. The architectures used for width experiments in Figure 6.

Similarity of Neural Network Representations Revisited

F. Additional Experiments
F.1. Sanity Check for Transformer Encoders

Layer Norm

Residual

To Next Sublayer

Residual +

Sublayer

12
9
6
3

Scale
Attention/FFN
2 )
CCA (RCCA

Self-Attention or
Feed-Forward Network
2 )
SVCCA (RCCA

Channel-wise Scale

Sublayer

12
9
6
3

Layer Normalization
From Previous Sublayer

Linear Regression

Sublayer

12
9
6
3

Figure F.2. Architecture of a single sublayer of the Transformer
encoder used for our experiments. The full encoder includes 12 sublayers, alternating between self-attention and feed-forward network
sublayers.

CKA (Linear)

Sublayer

12
9
6
3

Index

CKA (RBF 0.4)

Sublayer

12
9
6
3
3 6 9 12

Sublayer

3 6 9 12

Sublayer

3 6 9 12

Sublayer

3 6 9 12

Sublayer

Figure F.1. All similarity indices broadly reflect the structure of
Transformer encoders. Similarity indexes are computed between
the 12 sublayers of Transformer encoders, for each of the 4 possible
places in each sublayer that representations may be taken (see Figure F.2), averaged across 10 models trained from different random
initializations.

CCA (ρ̄)
2
CCA (RCCA
)
SVCCA (ρ̄)
2
SVCCA (RCCA
)
PWCCA
Linear Reg.
CKA (Linear)
CKA (RBF 0.2)
CKA (RBF 0.4)
CKA (RBF 0.8)

Layer Norm Scale Attn/FFN Residual
85.3
87.8
78.2
85.4
88.5
78.1
78.6
76.5
92.3
80.8

85.3
87.8
83.0
86.9
88.9
83.7
95.6
73.1
96.5
95.8

94.9
95.3
89.5
90.8
96.1
76.0
86.0
70.5
89.1
93.6

90.9
95.2
75.9
84.7
87.0
36.9
73.6
76.2
98.1
90.0

Table F.1. Accuracy of identifying corresponding sublayers based
maximum similarity, for 10 architecturally identical 12-sublayer
Transformer encoders at the 4 locations in each sublayer after which
the representation may be taken (see Figure F.2). Results not significantly different from the best result are bold-faced (p < 0.05,
jackknife z-test).

When applied to Transformer encoders, all similarity indexes we investigated passed the sanity check described in Section 6.1.
We trained Transformer models using the tensor2tensor library (Vaswani et al., 2018) on the English to German
translation task from the WMT18 dataset (Bojar et al., 2018) (Europarl v7, Common Crawl, and News Commentary v13
corpora) and computed representations of each of the 75,804 tokens from the 3,000 sentence newstest2013 development
set, ignoring end of sentence tokens. In Figure F.1, we show similarity between the 12 sublayers of the encoders of 10
Transformer models (45 pairs) trained from different random initializations. Each Transformer sublayer contains four
operations, shown in Figure F.2; results vary based which operation the representation is taken after. Table F.1 shows the
accuracy with which we identify corresponding layers between network pairs by maximal similarity.
The Transformer architecture alternates between self-attention and feed-forward network (FFN) sublayers. The checkerboard
pattern in representational similarity after the self-attention/feed-forward network operation in Figure F.1 indicates that
representations of attention sublayers are more similar to other attention sublayers than to FFN sublayers, and similarly,
representations of FFN sublayers are more similar to other FFN than to feed-forward network layers. CKA reveals a
checkerboard pattern for activations after the channel-wise scale operation (before the attention/FFN operation) that other
methods do not. Because CCA is invariant to non-isotropic scaling, CCA similarities before and after channel-wise scaling
are identical. Thus, CCA cannot capture this structure, even though the structure is consistent across networks.

Similarity of Neural Network Representations Revisited

F.2. SVCCA at Alternative Thresholds

Layer

2
RSVCCA
Threshold 0.5

8
6
4
2

Layer

2
RSVCCA
Threshold 0.9

8
6
4
2
2 4 6 8
Layer

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

2
RSVCCA
Threshold 0.6

2
RSVCCA
Threshold 0.7

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2
RSVCCA
Threshold 0.99

SVCCA ρ̄ Threshold 0.9
0.7
0.6
0.5
0.4
0.3
0.2

2 4 6 8
Layer

2 4 6 8
Layer

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

2
RSVCCA
Threshold 0.8

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

SVCCA ρ̄ Threshold 0.99
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
2 4 6 8
Layer

Figure F.3. Same as Figure 2 row 2, but for more SVCCA thresholds than the 0.99 threshold suggested by Raghu et al. (2017). No
threshold reveals the structure of the network.

Layer

5 10 15

Layer

Plain-66

60
50
40
30
20
10

Layer

Figure F.4. Similarity of the Plain-18 network at initialization. Left:
Similarity between layers of the same network. Middle: Similarity
between untrained networks with different initializations. Right:
Similarity between untrained and trained networks.

60
50
40
30
20
10

ResNet-62

1.0
0.8
0.6
0.4
0.2

CKA (Linear)

5 10 15

Plain-34
30
25
20
15
10
5

10
20
30
40
50
60

Layer

1.0
0.8
0.6
0.4
0.2

10
20
30
40
50
60

5 10 15

Vs. Trained
15
10
5

5
10
15
20
25
30

Different Nets
15
10
5

Layer

Layer

Same Net
15
10
5

CKA (Linear)

F.3. CKA at Initialization

Layer

Layer

Figure F.5. Similarity between layers at initialization for deeper
architectures.

F.4. Additional CKA Results

2 4 6 8

Layer

2 4 6 8

Layer

2 4 6 8

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

All Examples

Layer

Figure F.6. Networks with and without batch normalization trained
from different random initializations learn similar representations
according to CKA. The largest difference between networks is at
the last convolutional layer. Optimal hyperparameters were separately selected for the batch normalized network (93.9% average
accuracy) and the network without batch normalization (91.5%
average accuracy).

Within Class
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

8
6
4
2
2

4

6

Layer

8

2

4

6

Layer

CKA (Linear)

BN vs. No BN

Layer

No BN vs. No BN

CKA (Linear)

Layer

BN vs. BN
8
6
4
2

8

Figure F.7. Within-class CKA is similar to CKA based on all examples. To measure within-class CKA, we computed CKA separately
for examples belonging to each CIFAR-10 class based on representations from Plain-10 networks, and averaged the resulting CKA
values across classes.

Similarity of Neural Network Representations Revisited

F.5. Similarity Between Different Architectures with Other Indexes
2 )
CCA (RCCA

Tiny-10 Layer

CCA (ρ̄)
0.7
0.6
0.5
0.4

8
6
4
2

Tiny-10 Layer

PWCCA

Linear Regression
0.8
0.7
0.6
0.5

8
6
4
2
2 4 6 8 10 12

ResNet-14 Layer

2 4 6 8 10 12

ResNet-14 Layer

0.6
0.5
0.4
0.3
0.2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

SVCCA (ρ̄)

CKA (Linear)

2 4 6 8 10 12

ResNet-14 Layer

0.8
0.7
0.6
0.5
0.4
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

2 )
SVCCA (RCCA

0.6
0.5
0.4
0.3
0.2

CKA (RBF)

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

2 4 6 8 10 12

ResNet-14 Layer

Figure F.8. Similarity between layers of different architectures (Tiny-10 and ResNet-14) for all methods investigated. Only CKA reveals
meaningful correspondence. SVCCA results resemble Figure 7 of Raghu et al. (2017). In order to achieve better performance for
CCA-based techniques, which are sensitive to the number of examples used to compute similarity, all plots show similarity on the
CIFAR-10 training set.