Similarity of Neural Network Representations Revisited Simon Kornblith 1 Mohammad Norouzi 1 Honglak Lee 1 Geoffrey Hinton 1 arXiv:1905.00414v4 [cs.LG] 19 Jul 2019 Abstract Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations. 1. Introduction Across a wide range of machine learning tasks, deep neural networks enable learning powerful feature representations automatically from data. Despite impressive empirical advances of deep neural networks in solving various tasks, the problem of understanding and characterizing the neural network representations learned from data remains relatively under-explored. Previous work (e.g. Advani & Saxe (2017); Amari et al. (2018); Saxe et al. (2014)) has made progress in understanding the theoretical dynamics of the neural network training process. These studies are insightful, but fundamentally limited, because they ignore the complex interaction between the training dynamics and structured data. A window into the network’s representation can provide more information about the interaction between machine learning algorithms and data than the value of the loss function alone. 1 Google Brain. Correspondence to: Simon Kornblith . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). This paper investigates the problem of measuring similarities between deep neural network representations. An effective method for measuring representational similarity could help answer many interesting questions, including: (1) Do deep neural networks with the same architecture trained from different random initializations learn similar representations? (2) Can we establish correspondences between layers of different network architectures? (3) How similar are the representations learned using the same network architecture from different datasets? We build upon previous studies investigating similarity between the representations of neural networks (Laakso & Cottrell, 2000; Li et al., 2015; Raghu et al., 2017; Morcos et al., 2018; Wang et al., 2018). We are also inspired by the extensive neuroscience literature that uses representational similarity analysis (Kriegeskorte et al., 2008a; Edelman, 1998) to compare representations across brain areas (Haxby et al., 2001; Freiwald & Tsao, 2010), individuals (Connolly et al., 2012), species (Kriegeskorte et al., 2008b), and behaviors (Elsayed et al., 2016), as well as between brains and neural networks (Yamins et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Sussillo et al., 2015). Our key contributions are summarized as follows: • We discuss the invariance properties of similarity indexes and their implications for measuring similarity of neural network representations. • We motivate and introduce centered kernel alignment (CKA) as a similarity index and analyze the relationship between CKA, linear regression, canonical correlation analysis (CCA), and related methods (Raghu et al., 2017; Morcos et al., 2018). • We show that CKA is able to determine the correspondence between the hidden layers of neural networks trained from different random initializations and with different widths, scenarios where previously proposed similarity indexes fail. • We verify that wider networks learn more similar representations, and show that the similarity of early layers saturates at fewer channels than later layers. We demonstrate that early layers, but not later layers, learn similar representations on different datasets. Similarity of Neural Network Representations Revisited 2. What Should Similarity Be Invariant To? This section discusses the invariance properties of similarity indexes and their implications for measuring similarity of neural network representations. We argue that both intuitive notions of similarity and the dynamics of neural network training call for a similarity index that is invariant to orthogonal transformation and isotropic scaling, but not invertible linear transformation. 2.1. Invariance to Invertible Linear Transformation A similarity index is invariant to invertible linear transformation if s(X, Y ) = s(XA, Y B) for any full rank A and B. If activations X are followed by a fully-connected layer f (X) = σ(XW + β), then transforming the activations by a full rank matrix A as X 0 = XA and transforming the weights by the inverse A−1 as W 0 = A−1 W preserves the output of f (X). This transformation does not appear to change how the network operates, so intuitively, one might prefer a similarity index that is invariant to invertible linear transformation, as argued by Raghu et al. (2017). However, a limitation of invariance to invertible linear transformation is that any invariant similarity index gives the same result for any representation of width greater than or equal to the dataset size, i.e. p2 ≥ n. We provide a simple proof in Appendix A. Theorem 1. Let X and Y be n × p matrices. Suppose s is invariant to invertible linear transformation in the first argument, i.e. s(X, Z) = s(XA, Z) for arbitrary Z and any A with rank(A) = p. If rank(X) = rank(Y ) = n, then s(X, Z) = s(Y, Z). There is thus a practical problem with invariance to invertible linear transformation: Some neural networks, especially convolutional networks, have more neurons in some layers than there are examples the training dataset (Springenberg et al., 2015; Lee et al., 2018; Zagoruyko & Komodakis, 2016). It is somewhat unnatural that a similarity index could require more examples than were used for training. A deeper issue is that neural network training is not invari- Net A PC 1 Let X ∈ Rn×p1 denote a matrix of activations of p1 neurons for n examples, and Y ∈ Rn×p2 denote a matrix of activations of p2 neurons for the same n examples. We assume that these matrices have been preprocessed to center the columns. Without loss of generality we assume that p1 ≤ p2 . We are concerned with the design and analysis of a scalar similarity index s(X, Y ) that can be used to compare representations within and across neural networks, in order to help visualize and understand the effect of different factors of variation in deep learning. Net B PC 1 Examples Colored By Net A Principal Components Problem Statement Net A PC 2 Net B PC 2 Figure 1. First principal components of representations of networks trained from different random initializations are similar. Each example from the CIFAR-10 test set is shown as a dot colored according to the value of the first two principal components of an intermediate layer of one network (left) and plotted on the first two principal components of the same layer of an architecturally identical network trained from a different initialization (right). ant to arbitrary invertible linear transformation of inputs or activations. Even in the linear case, gradient descent converges first along the eigenvectors corresponding to the largest eigenvalues of the input covariance matrix (LeCun et al., 1991), and in cases of overparameterization or early stopping, the solution reached depends on the scale of the input. Similar results hold for gradient descent training of neural networks in the infinite width limit (Jacot et al., 2018). The sensitivity of neural networks training to linear transformation is further demonstrated by the popularity of batch normalization (Ioffe & Szegedy, 2015). Invariance to invertible linear transformation implies that the scale of directions in activation space is irrelevant. Empirically, however, scale information is both consistent across networks and useful across tasks. Neural networks trained from different random initializations develop representations with similar large principal components, as shown in Figure 1. Consequently, Euclidean distances between examples, which depend primarily upon large principal components, are similar across networks. These distances are meaningful, as demonstrated by the success of perceptual loss and style transfer (Gatys et al., 2016; Johnson et al., 2016; Dumoulin et al., 2017). A similarity index that is invariant to invertible linear transformation ignores this aspect of the representation, and assigns the same score to networks that match only in large principal components or networks that match only in small principal components. 2.2. Invariance to Orthogonal Transformation Rather than requiring invariance to any invertible linear transformation, one could require a weaker condition; invariance to orthogonal transformation, i.e. s(X, Y ) = s(XU, Y V ) for full-rank orthonormal matrices U and V such that U T U = I and V T V = I. Similarity of Neural Network Representations Revisited Indexes invariant to orthogonal transformations do not share the limitations of indexes invariant to invertible linear transformation. When p2 > n, indexes invariant to orthogonal transformation remain well-defined. Moreover, orthogonal transformations preserve scalar products and Euclidean distances between examples. sentational similarity matrices reduces to another intuitive notion of pairwise feature similarity. Invariance to orthogonal transformation seems desirable for neural networks trained by gradient descent. Invariance to orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural networks (Chen et al., 1993; Orhan & Pitkow, 2018). In the linear case, orthogonal transformation of the input does not affect the dynamics of gradient descent training (LeCun et al., 1991), and for neural networks initialized with rotationally symmetric weight distributions, e.g. i.i.d. Gaussian weight initialization, training with fixed orthogonal transformations of activations yields the same distribution of training trajectories as untransformed activations, whereas an arbitrary linear transformation would not. hvec(XX T ), vec(Y Y T )i = tr(XX T Y Y T ) = ||Y T X||2F . (1) Given a similarity index s(·, ·) that is invariant to orthogonal transformation, one can construct a similarity index s0 (·, ·) that is invariant to any invertible linear transformation by first orthonormalizing the columns of X and Y , and then applying s(·, ·). Given thin QR decompositions X = QA RA and Y = QB RB one can construct a similarity index s0 (X, Y ) = s(QX , QY ), where s0 (·, ·) is invariant to invertible linear transformation because orthonormal bases with the same span are related to each other by orthonormal transformation (see Appendix B). 2.3. Invariance to Isotropic Scaling We expect similarity indexes to be invariant to isotropic scaling, i.e. s(X, Y ) = s(αX, βY ) for any α, β ∈ R+ . That said, a similarity index that is invariant to both orthogonal transformation and non-isotropic scaling, i.e. rescaling of individual features, is invariant to any invertible linear transformation. This follows from the existence of the singular value decomposition of the transformation matrix. Generally, we are interested in similarity indexes that are invariant to isotropic but not necessarily non-isotropic scaling. 3. Comparing Similarity Structures Our key insight is that instead of comparing multivariate features of an example in the two representations (e.g. via regression), one can first measure the similarity between every pair of examples in each representation separately, and then compare the similarity structures. In neuroscience, such matrices representing the similarities between examples are called representational similarity matrices (Kriegeskorte et al., 2008a). We show below that, if we use an inner product to measure similarity, the similarity between repre- Dot Product-Based Similarity. A simple formula relates dot products between examples to dot products between features: The elements of XX T and Y Y T are dot products between the representations of the ith and j th examples, and indicate the similarity between these examples according to the respective networks. The left-hand side of (1) thus measures the similarity between the inter-example similarity structures. The right-hand side yields the same result by measuring the similarity between features from X and Y , by summing the squared dot products between every pair. Hilbert-Schmidt Independence Criterion. Equation 1 implies that, for centered X and Y : 1 tr(XX T Y Y T ) = ||cov(X T , Y T )||2F . (n − 1)2 (2) The Hilbert-Schmidt Independence Criterion (Gretton et al., 2005) generalizes Equations 1 and 2 to inner products from reproducing kernel Hilbert spaces, where the squared Frobenius norm of the cross-covariance matrix becomes the squared Hilbert-Schmidt norm of the cross-covariance operator. Let Kij = k(xi , xj ) and Lij = l(yi , yj ) where k and l are two kernels. The empirical estimator of HSIC is: HSIC(K, L) = 1 tr(KHLH), (n − 1)2 (3) where H is the centering matrix Hn = In − n1 11T . For linear kernels k(x, y) = l(x, y) = xT y, HSIC yields (2). Gretton et al. (2005) originally proposed HSIC as a test statistic for determining whether two sets of variables are independent. They prove that the empirical √ estimator converges to the population value at a rate of 1/ n, and Song et al. (2007) provide an unbiased estimator. When k and l are universal kernels, HSIC = 0 implies independence, but HSIC is not an estimator of mutual information. HSIC is equivalent to maximum mean discrepancy between the joint distribution and the product of the marginal distributions, and HSIC with a specific kernel family is equivalent to distance covariance (Sejdinovic et al., 2013). Centered Kernel Alignment. HSIC is not invariant to isotropic scaling, but it can be made invariant through normalization. This normalized index is known as centered kernel alignment (Cortes et al., 2012; Cristianini et al., 2002): CKA(K, L) = p HSIC(K, L) HSIC(K, K)HSIC(L, L) . (4) Similarity of Neural Network Representations Revisited Similarity Index 2 Linear Reg. (RLR ) 2 CCA (RCCA ) CCA (ρ̄CCA ) 2 SVCCA (RSVCCA ) SVCCA (ρ̄SVCCA ) PWCCA Linear HSIC Linear CKA RBF CKA Formula ||QTY X||2F /||X||2F ||QTY QX ||2F /p1 ||QTY QX ||∗ /p1 ||(UY TY )T UX TX ||2F /min(||TX ||2F , ||TY ||2F ) T 2 2 ||(U P X ||F , ||TY ||F ) Pp1Y TY ) UX TX ||∗ /min(||T α ρ /||α|| , α = |hh , x i| 1 i i j i=1 i i j ||Y T X||2F /(n − 1)2 T T ||Y T X||2F /(||X p X||F ||Y Y ||F ) tr(KHLH)/ tr(KHKH)tr(LHLH) Invariant to Invertible Linear Orthogonal Transform Transform Y only 3 3 3 3 3 If same subspace kept 3 If same subspace kept 3 7 7 7 3 7 3 7 3 Isotropic Scaling 3 3 3 3 3 3 7 3 3∗ Table 1. Summary of similarity methods investigated. QX and QY are orthonormal bases for the columns of X and Y . UX and UY are the left-singular vectors of X and Y sorted in descending order according to the corresponding singular vectors. || · ||∗ denotes the nuclear norm. TX and TY are truncated identity matrices that select left-singular vectors such that the cumulative variance explained reaches some threshold. For RBF CKA, K and L are kernel matrices constructed by evaluating the RBF kernel between the examples as in Section 3, and H is the centering matrix Hn = In − n1 11T . See Appendix C for more detail about each technique. ∗ Invariance of RBF CKA to isotropic scaling depends on the procedure used to select the RBF kernel bandwidth parameter. In our experiments, we selected the bandwidth as a fraction of the median distance, which ensures that the similarity index is invariant to isotropic scaling. For a linear kernel, CKA is equivalent to the RV coefficient (Robert & Escoufier, 1976) and to Tucker’s congruence coefficient (Tucker, 1951; Lorenzo-Seva & Ten Berge, 2006). Kernel Selection. Below, we report results of CKA with a linear kernel and the RBF kernel k(xi , xj ) = exp(−||xi − xj ||22 /(2σ 2 )). For the RBF kernel, there are several possible strategies for selecting the bandwidth σ, which controls the extent to which similarity of small distances is emphasized over large distances. We set σ as a fraction of the median distance between examples. In practice, we find that RBF and linear kernels give similar results across most experiments, so we use linear CKA unless otherwise specified. Our framework extends to any valid kernel, including kernels equivalent to neural networks (Lee et al., 2018; Jacot et al., 2018; Garriga-Alonso et al., 2019; Novak et al., 2019). 4. Related Similarity Indexes In this section, we briefly review linear regression, canonical correlation, and other related methods in the context of measuring similarity between neural network representations. We let QX and QY represent any orthonormal bases for the columns of X and Y , i.e. QX = X(X T X)−1/2 , QY = Y (Y T Y )−1/2 or orthogonal transformations thereof. Table 1 summarizes the formulae and invariance properties of the indexes used in experiments. For a comprehensive general review of linear indexes for measuring multivariate similarity, see Ramsay et al. (1984). Linear Regression. A simple way to relate neural network representations is via linear regression. One can fit every feature in X as a linear combination of features from Y . A suitable summary statistic is the total fraction of variance explained by the fit: 2 RLR =1− minB ||X − Y B||2F ||QTY X||2F = . 2 ||X||F ||X||2F (5) We are unaware of any application of linear regression to measuring similarity of neural network representations, although Romero et al. (2015) used a least squares loss between activations of two networks to encourage thin and deep “student” networks to learn functions similar to wide and shallow “teacher” networks. Canonical Correlation Analysis (CCA). Canonical correlation finds bases for two matrices such that, when the original matrices are projected onto these bases, the correlation is maximized. For 1 ≤ i ≤ p1 , the ith canonical correlation coefficient ρi is given by: i ρi = max corr(XwX , Y wYi ) i i wX ,wY j i subject to ∀j