An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise 0.4 Abstract 1 Introduction From a strictly mathematical perspective, training neural networks is a high-dimensional non-convex optimization problem and the dynamics of the training process is incredibly complicated. Despite this, Stochastic Gradient Descent (SGD) and its variants have proven to be extremely effective for training neural networks in practice. Much of the recent successes of deep learning in application tasks such as image recognition (He et al., 2016), speech recognition (Amodei et al., 2016), Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s). W2 0.3 The choice of batch-size in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batch-size used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients. We demonstrate that the learning performance of our method is more accurately captured by the structure of the covariance matrix of the noise rather than by the variance of gradients. Moreover, over the convex-quadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. 0.2 0.1 0.0 0.1 0.0 0.2 W1 0.4 0.0 0.2 W1 0.4 0.0 0.2 W1 0.4 0.0 0.2 W1 0.4 0.4 0.3 W2 arXiv:1902.08234v4 [cs.LG] 28 Feb 2020 Yeming Wen∗,1,2 , Kevin Luk∗,3 , Maxime Gazeau∗,3 , Guodong Zhang1,2 , Harris Chan1,2 , Jimmy Ba1,2 1 University of Toronto, 2 Vector Institute, 3 Borealis AI 0.2 0.1 0.0 0.1 Figure 1: Noise structure in a simple two-dimensional regression problem. Top-left: One-step SGD update. Top-right: One-step SGD update with isotropic Gaussian (σ = 0.1) noise. Bottom-left: One-step SGD update with full Fisher noise. Bottom-right: One-step SGD update with diagonal Fisher noise. The full Fisher noise almost recovers the SGD noise. Observe that the full Fisher noise direction is perpendicular to the contours of the loss surface. Moreover, full Fisher exhibits slower convergence than diagonal Fisher; we refer to Section 4 for a more detailed analysis. natural language processing (Wu et al., 2016) and game playing (Mnih et al., 2015) can be seen as testaments to the effectiveness of SGD. The choice of a batch-size plays an important role in the learning behavior of SGD. Taking larger batchsizes ensures better gradient estimation which typically leads to faster training convergence. However, there is a tradeoff from the viewpoint of generalization; the intrinsic noise stemming from mini-batch gradients provides regularization effects (Chaudhari and Soatto, 2017; Smith and Le, 2017) and by increasing batchsizes, we lose such generalization benefits. It is then an interesting question to ask whether large-batch can be engineered in a way such that generalization significantly improves but at the same time not sacrificing too much the training convergence. This is exactly the Large-Batch Stochastic Gradient Descent with Structured Covariance Noise central objective of our paper. To address this question, we propose to add a noise term whose covariance structure is given by the diagonal Fisher matrix to the large-batch gradient updates. We discuss the motivations underlying our approach. Under the standard log-likelihood loss assumption, the difference of large-batch gradients and small-batch gradients can be modeled as a Fisher noise. We can expect that adding this noise directly to large-batch gradients will yield small-batch performance. While this may resolve generalization issues associated with large-batch training (Keskar et al., 2016; Hoffer et al., 2017), the resulting convergence performance is undesirable. To attain our ultimate goal of designing a method which enjoys desirable optimization and generalization performance simultaneously, we reduce the noise level by changing the covariance structure from full Fisher to diagonal Fisher. Variance is commonly regarded as a criteria of optimization performance. However, for our proposed method in this paper, studying the gradient variance is not sufficient in deducing any information on the optimization behavior. Rather, it is the structure of the covariance matrix of the noise which plays a critical role. For large-batch training with diagonal Fisher, we find that despite having a high gradient variance, it still attains an ideal optimization performance. Outline and Summary of Main Contributions. We begin in Section 2 by introducing the basic framework and necessary definitions. We consider different covariance structures for large-batch training in Section 2.2 and then propose the choice of diagonal Fisher. Sections 3 and 4 constitute the central contributions of the paper. The primary takeaways are: • Gradient variance is not an accurate indicator of optimization behavior. In Fig. 3, we find empirically that the convergence of large-batch with diagonal Fisher is much faster than that of largebatch with full Fisher and small-batch. However, in Fig. 2, we find that all these regimes share roughly the same average gradient variance. • The main theoretical contribution is Theorem 4.1. We show over the convex quadratic setting, the convergence can be characterized by the Frobenius norm of the noise covariance matrix. In Fig. 5(a), we show empirically that this carries over to the non-convex deep learning context. In Section 5, we apply our methodology to address the “generalization gap” problem. We show that within the same number of training epochs, large-batch with diagonal Fisher can attain generalization performance roughly comparable to that of small-batch. Related works are discussed in Section 6 and we close the paper in Section 7. 2 Preliminaries and Approach 2.1 Preliminary Background Excess Risk Decomposition. We work in the standard framework of supervised learning. Let D be the unknown joint probability distribution over the data domain X × Y where X is the input space and Y is the target space. We have a training set S = {(x1 , y1 ), . . . , (xN , yN )} of N input-target samples drawn i.i.d. from D. The family of classifiers of interest to us are neural network outputs f (xi , θ), where θ ∈ Rd are parameters of the network. Let ` : Y × Y → R be the loss function measuring the disagreement between outputs f (xi , θ) and targets yi . For convenience, we use the notation `i (θ) to denote `(f (xi , θ), yi ). The expected risk and empirical risk functions are defined to be N L (θ) := E(x,y)∼D [`(f (x, θ), y)], L(θ) := 1 X `i (θ). N i=1 The standard technique to analyze the interplay between optimization and generalization in statistical learning theory is through excess risk decomposition. The excess risk, after k iterations, is defined as: ∆k := L (θk ) − inf L (θ). θ∈Rd From Bottou and Bousquet (2008); Chen et al. (2018), the expected excess risk can be upper-bounded by ES [∆k ] ≤ ES [L (θk ) − L(θk )] + ES [L(θk ) − L(θ∗ )] . | {z } | {z } Egen Eopt (1) where θ∗ = argminθ L(θ) here is the empirical risk minimizer. The terms Egen and Eopt are the expected generalization and optimization errors respectively. It is often the case that optimization algorithms are studied from one perspective: either optimization or generalization. The decomposition in Eqn. 1 indicates that both aspects should be analyzed together (Bottou and Bousquet, 2008; Chen et al., 2018); since the goal of a good optimization-generalization algorithm in machine learning is to minimize the excess risk in the least amount of iterations. 2.2 Motivations and Approach We begin by formalizing the setup. Let BL denote large-batch and ML = |BL | denote the size of the Wen, Luk, Gazeau, Zhang, Chan, Ba large-batch. We consider the following modification of large-batch SGD updates θk+1 = θk − αk ∇LML (θk ) + αk C(θk )ξk . (2) where αk is the learning rate, ξk ∼ N (0, Id ) is the multivariate Gaussian distribution with meanP zero and identity covariance, and ∇LML (θk ) = M1L i∈BL ∇`i (θk ) is the large-batch gradient. We can interpret Eqn. 2 as modifying large-batch SGD by injecting Gaussian noise with mean zero and covariance C(θk )C(θk )> to the gradients. The central goal of this paper is to determine a suitable matrix C(θk ) such that the excess risk of the algorithm in Eqn. 2 is minimized; in more concrete terms, it achieves low optimization and generalization error simultaneously within a reasonable computational budget. 2.2.1 Intrinsic SGD Noise Let B ⊂ S be a mini-batch drawn uniformly and without replacement from S and M = |B| be the size of this chosen mini-batch. We can write the SGD update rule here as θk+1 = θk − αk ∇LM (θk ) = θk − αk ∇L(θk ) + αk (∇L(θk ) − ∇LM (θk )) | {z } δk PN where ∇L(θk ) = N1 i=1 ∇`i (θk ) is the full-batch gradient. The difference δk = ∇L(θk ) − ∇LM (θk ) is the intrinsic noise stemming from mini-batch gradients. The covariance of δk is given by N N −M 1 X > (∇L(θk ) − ∇`i (θk )) (∇L(θk ) − ∇`i (θk )) M N i=1 (3) This result can be found in Hu et al. (2017); Hoffer et al. (2017). Moreover, this type of noise has been studied in Zhu et al. (2018). For the purposes of this paper, we assume that the loss is taken to be negative loglikelihood, `i (θk ) = − log p(yi |xi , θk ) where p(y|x, θ) is the density function for the model’s predictive distribution. Moreover, we assume that the gradient covariance matrix above can be approximated by N N −M 1 X ∇ log p(yi |xi , θk )∇ log p(yi |xi , θk )> , M N i=1 {z } | F (θk ) (4) where (xi , yi ) are sampled from the empirical data distribution. In the literature, the matrix F (θk ) above is often referred to as the empirical Fisher matrix (Martens, 2014). We make this approximation for two reasons. First, computing the full-batch gradient ∇L(θk ) at every iteration is not feasible computationally. Secondly, from Fig. 3, we find empirically that the training dynamics of a large-batch regime with empirical Fisher is very close to a small-batch regime (which by the above analysis should be captured by large-batch with empirical covariance in Eqn. 3); suggesting that it is a reasonable assumption to make. For the remainder of this paper, unless otherwise specified, all mentions of “Fisher matrix” or F (θ) refers to the empirical Fisher. For completeness, we provide explicit expressions of diagonal Fisher for feed-forward and convolutional network architectures in Appendix A.3. 2.2.2 Naive Choices of Covariance Matrices We begin by considering the choice of C(θk ) = 0. In this case, Eqn. 2 is just standard large-batch gradient descent. Since large-batches provide better gradient estimation, we can expect better training error per parameter update. However, from the perspective of generalization, it has been observed in LeCun et al. (1998); Keskar et al. (2016); Hoffer et al. (2017) that using larger batch-sizes can lead to a decay in generalization performance of the model. Now, let BS , BL denote small-batch and large-batch, MS = |BS |, ML = |BL | denote the size of small-batch and large-batch. Consider C(θk ) to be r M L − MS p C(θk ) = F (θk ). (5) ML MS Now, if the intrinsic SGD noise is reasonably approximated as a Gaussian distribution with mean zero and covariance given by C(θk ) above, then Eqn. 2 with this choice of C(θk ) should exhibit similar behavior as small-batch. If this is the case, then we can expect that Eqn. 2 exhibits poor convergence. Indeed, as shown on a 2D convex example in Fig. 1, choosing C(θk ) as in Eqn. 5 essentially recovers SGD behavior. Furthermore, on the CIFAR-10 image classification task trained using ResNet-44 in Fig. 3, we find that adding this noise significantly worsens the training convergence. Thus, choosing C(θk ) as in Eqn. 5 does not satisfy our objective of simultaneously attaining desirable convergence and generalization for large-batch training. 2.2.3 Using Diagonal Fisher We now propose to take a “middle ground” and choose C(θk ) to be r ML − MS p C(θk ) = diag (F (θk )). (6) ML MS A formal statement is given in Algorithm 1. Changing from full Fisher to diagonal Fisher has important Large-Batch Stochastic Gradient Descent with Structured Covariance Noise implications for both optimization and generalization behavior. In our empirical analysis in Sections 3 and 5, we show that Algorithm 1 can achieve both desirable convergence and generalization performance within an epoch training budget; which implies that the excess risk is minimized. With regards to computational complexity, computing diagonal Fisher only introduces minor overhead. Goodfellow (2015) shows that it can be done at the cost of one forward pass. 3 Variance and Optimization The objective of this section is to examine the optimization performance of the four regimes: large-batch with C(θk ) equal to 0 (standard large-batch), large-batch with C(θk ) equal to diagonal Fisher as in Eqn. 6, largebatch with C(θk ) equal to full Fisher as in Eqn. 5 and small-batch. In the experimentation, we fix large-batch to be 4096 and small-batch to be 128. For conciseness, we set forth the notation LB and SB for large-batch and small-batch respectively. In Fig. 3, we compare the training error (measured per parameter update) of ResNet44 (CIFAR-10) of the four regimes. The same learning rate is used across all four regimes (better training error can be obtained if we tune the learning rate further for the LB regimes). We find that LB with diagonal Fisher trains much faster than LB with full Fisher and SB. In contrast, LB with diagonal Fisher attains a convergence similar to LB, demonstrating that adding this particular form of noise does not hamper much the optimization performance. We now analyze the gradient variance of each of the four regimes. We define gradient variance here to be the trace of the covariance matrix of the gradients. In the experiment depicted in Fig. 2, we provide an estimation of the variance of gradients of the four regimes. The experiment is performed as follows: we freeze a partially-trained network and compute Monte-Carlo estimates of gradient variance with respect to each pa- variance of gradients Require: Number of iterations K, initial step-size α0 , large-batch BL of size ML , small-batch BS of size MS , initial condition θ0 ∈ Rd for k = 1 to K do ξk ∼ N (0, q Id ) p S k = αk MMLL−M diag (F (θk ))ξk MS θk+1 = θk − αk ∇LML (θk ) + k end for 10 2 LB SB fullF diagF 10 3 10 4 Layer3 Layer16 Layer31 Layer39 Figure 2: Average variance of gradients for LB, SB, LB with full Fisher and LB with diagonal Fisher. 50 40 train error Algorithm 1 Adding diagonal Fisher noise to largebatch SGD. Differences from standard large-batch SGD are highlighted in blue 30 LB SB LB fullF LB diagF LB trueF 20 10 00 1000 2000 3000 4000 5000 6000 Figure 3: Training error per parameter update for SB, LB, LB with full Fisher and LB with diagonal Fisher. rameter over different mini-batches. This variance is then averaged over the parameters within each layer. In Fig. 2, we find that LB with diagonal Fisher, LB with full Fisher, and SB all share roughly the same gradient variance meanwhile LB has a much lower one. However, the optimization behaviors are completely different. These experiments show that the number of iterations needed to reach a small optimization error is not purely determined by the gradient variance. Many recent works have suggested to add isotropic noise to the gradient dynamics to escape from saddle point or local minima (Ge et al., 2015; Jin et al., 2017). We believe that the number of iterations needed to escape saddle points is determined by the nature of the noise. A similar observation has been made in Zhu et al. (2018) where the authors studied the effects of how using Fisher covariance noise relates to the efficiency of escaping from local minima and compare it to isotropic noise. In the next section, we analyze the efficiency of the algorithm in Eqn. 2 for a convex quadratic when the covariance matrix is given by the (exact true) Fisher matrix. We prove that it requires less iterations to reach the global optimum compared to isotropic noise. We iterate that this choice of diffusion matrix is completely specific to the convex quadratic example. Wen, Luk, Gazeau, Zhang, Chan, Ba 4 Convex Quadratic Example 4.1 Motivations Diagonal Fisher 4 In this section, we analyze the optimization behavior of our proposed algorithm for the convex quadratic model. Analyzing the convex quadratic model serves as a good proxy to understand the complex dynamics of neural network training. Although the optimization landscape of neural networks is non-convex, using such a model marginalizes away inessential features and enables us to tractably analyze optimization phenomena of neural networks. There is ample evidence suggesting this: the recent work of Zhang et al. (2019) uses the convex quadratic model to accurately predict critical batch-sizes for commonly used optimizers in neural networks. Short-horizon bias phenomena of optimized learning rates were studied in Wu et al. (2018) for the convex quadratic and their theoretical insights were successfully translated to neural networks. Furthermore, a convex quadratic objective can always be obtained by first linearizing around a given parameter vector and then taking a second-order Taylor approximation. Recent empirical work (Lee et al., 2019) have demonstrated that linearized approximations do indeed match the training phenomena of large yet realistic networks. Therefore, approximating the loss surface of a neural network with a convex quadratic has proven to be a fertile “testing ground” when introducing new methodologies in deep learning. Analyzing the toy quadratic problem has led to important insights; for example, in learning rate scheduling (Schaul et al., 2013) and formulating SGD as approximate Bayesian inference (Mandt et al., 2017). 4.2 Full Fisher 2 0 2 4 2 0 2 2 0 2 Figure 4: Trajectory using full Fisher versus diagonal Fisher noise for the algorithm in Eqn. 2 used to minimize a two-dimensional quadratic function. Blue dot indicates the initial parameter value and the green dot shows the final parameter value. We used a learning rate of 0.1 for 500 iterations (plotting every 10 iterations). Observe that adding diagonal Fisher to the gradient achieves faster convergence than full Fisher. Theorem 4.1. Let λmax and λmin denote the maximum and minimum eigenvalue of A respectively. For a chosen α0 ≤ λ−1 max , suppose that we run the algorithm in Eqn. 2 according to the decaying step-size sequence αk = 2 , (k + γ)λmin for all k ∈ N>0 and where γ is chosen such that αk ≤ α0 . Then for all k ∈ N, E[L(θk )] ≤ ν k+γ where Analysis  For strongly-convex objective functions and diminishing step-sizes, the expected optimality gap is bounded in terms of the second-order moment of the gradients (Bottou et al., 2018). However, in practice, different algorithms having the same gradient moments might not need the same number of iterations to converge to the minimum. Consider the loss function L(θ) = 1 > θ Aθ, 2 where A is a symmetric, positive-definite matrix. We focus on the algorithm in Eqn. 2 and consider a constant d×d covariance matrix C. The following theorem, adapted from Bottou et al. (2018), analyzes the convergence of this optimization method. The proof is relegated to Section A.1 of Appendix. ν = max  2 Tr(C > AC) , γL(θ ) . 0 λ2min We make a couple of observations concerning this bound. First, the convergence rate is optimal when C = 0 which is expected. In this case, there is no noise and hence we obtain no regularization benefits which leads to poor generalization. A more formal discussion is given at the end of Section A.2 in the Supplementary Material where if we employ a scaling factor Cλ := λC; as λ → 0, the expected generalization error becomes worse. The second observation is that the term of importance in this theorem is Tr(C > AC). While the overall convergence rate of the algorithm is O(1/k), the discrepancy in convergence performance for different choices of the matrix C rests entirely on this term. The number of iterations for the algorithm in Eqn. 2 to reach the unique Large-Batch Stochastic Gradient Descent with Structured Covariance Noise Table 1: Number of iterations needed for the algorithm in Eqn. 2 to reach an  error for a strongly-convex p quadratic loss function. Since kdiag (A)k2Frob is smaller than kAk2Frob , less iterations are required when choosing C = diag (A). √ A covariance matrix C steps k to reach  error minimum depends entirely on Tr(C > AC) and not on the second-order moment. We analyze two specific cases which are relevant for√us: the first case where C is square-root of A, C = A, and the second case where p C is the square-root of the diagonal of A, C = diag (A). The second-order moment of the noise perturbation is the same for both and is given by 2 Eξ [kCξk k ] = Tr(C > C) = Tr(A). (7) > However, √ it is different for Tr(C AC); in the case of C = A, we get 2 Tr(C > AC) = Tr(A2 ) = kAkFrob , p and for the case of C = diag (A), 2 p diag (A) kAk2Frob / kdiag (A)k2Frob / identical. Furthermore, if we assume that the implicit conditional distribution over the network’s output is close to the conditional distribution of targets from the training distribution, then the covariance of the gradients closely matches the Hessian (Martens, 2014). The recent work of Zhang et al. (2019) shows that this relationship indeed holds tightly in empirical settings. Secondly, we have focused solely on the optimization performance of our proposed method. While it would be desirable to accompany this with a complete theoretical analysis of generalization performance, this is beyond the current scope of the paper. However, in Appendix A.2, we use the framework of uniform stability to provide some theoretical insights on how different choices of the covariance matrix impact generalization. 5 Experiments 5.1 Experimentation Details 2 Tr(C > AC) = Tr(diag (A) ) = kdiag (A)kFrob . Thus, the difference in training performance between the two cases can be measured by the difference of their respective Frobenius norms and less p number of iterations are needed with the choice of diag (A). This suggests that the off-diagonal elements of A play a role in the optimization performance. In Fig. 4, we provide a visualization of the difference between p √ C = A and C = diag (A) over a two-dimensional quadratic function. We summarize our observation in Table 1: different choices of covariance matrix C impacts the number of iterations required to reach an  error. Limitations: We mention a couple of limitations of the current analysis. In the previous analysis, the matrix A above is the Hessian which is also, in this specific setup, equal to the (exact true) Fisher defined as the expectation of the outer product of log-likelihood gradients, EPx ,Py|x [∇ log p(y|x, θ)∇ log p(y|x, θ)> ] (8) The expectation here is taken with respect to the data distribution Px for inputs x and the model’s predictive distribution Py|x for targets y. For much of this paper, we have been working with the empirical Fisher instead. We believe this to be a reasonable approximation; in Fig. 3, we find that the training curves for LB + full Fisher and LB + true Fisher are almost Batch Normalization. For all experiments involving LB, we adopt Ghost Batch Normalization (GBN) (Hoffer et al., 2017) and hence LB throughout stands for LB with Ghost Batch Normalization. This allows a fair comparison between LB and SB, as it ensures that batch normalization statistics are computed on the same number of training examples. Using standard batch normalization for large batches can lead to degradation in model quality (Hoffer et al., 2017). Models and Datasets. The network architectures we use are fully-connected networks, shallow convolutional networks (LeNet (LeCun et al., 1998), AlexNet (Krizhevsky et al., 2012)), and deep convolutional networks (VGG16 (Simonyan and Zisserman, 2014), ResNet44 (He et al., 2016), ResNet44x2 (the number of filters are doubled)). These models are evaluated on the standard deep-learning datasets: MNIST, Fashion-MNIST (LeCun et al., 1998; Xiao et al., 2017), CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton, 2009). 5.2 Frobenius Norm Over the convex quadratic setting in Section 4, Theorem 4.1 tells us that the number of iterations to reach optimum is characterized by the Frobenius norm. Hence, the optimization difference between using large- 105 Full Fisher Diag Fisher 104 70 max eigenvalue frobenius norm magnitude Wen, Luk, Gazeau, Zhang, Chan, Ba 103 102 101 100 60 LBgbn LBdiagF SB 50 40 30 20 10 0 50 100 150 (a) Frobenius Norm 200 0 Res44cifar10 Res44cifar100 vgg16cifar10 vgg16cifar100 (b) Maximum Eigenvalue Figure 5: a) Frobenius norms of full Fisher and diagonal Fisher along the training trajectory. The model is trained on ResNet44 with CIFAR-10. b) Maximum eigenvalue of the Hessian matrix at the end of training for LB with Ghost Batch Normalization, LB with Ghost Batch Normalization + diagonal Fisher and SB. Error bar is computed over 3 random seeds. batch with diagonal Fisher versus full Fisher lies in the difference of their respective Frobenius norms. We now give an empirical verification of this phenomena in the non-convex setting of deep neural networks. We compute the Frobenius norms during the training of the ResNet44 network on CIFAR-10. Fig. 5(a) shows that the full Fisher matrix has much larger Frobenius norm than the diagonal Fisher matrix, which suggests that using diagonal Fisher noise should have faster convergence than full Fisher noise in the deep neural network setting. Indeed, Fig. 3 shows that LB with full Fisher converges at the same rate as SB whereas LB with diagonal Fisher converges much faster; and in fact, roughly the same as LB. This indicates that adding diagonal Fisher noise to LB does not degrade the optimization performance of LB. 5.3 Maximum Eigenvalue of Hessian be invariant under typical weight reparameterizations; for example, affine transformations (Liao et al., 2018). To achieve this, we make the following modification to the trained model: (1) For the layers with batch normalization, we can just push the batch-norm layer parameters and running statistics into the layerwise parameter space so that the the layer is invariant under affine transformations; and (2) For the layers without batch normalization, reparameterization changes the prediction confidence while the prediction remains the same. Hence, we train a temperature parameter on the cross-entropy loss on the test data set, this encourages the model to make a calibrated prediction (prevent it from being over-confident). In Fig. 5(b), we give the error bar of the maximum eigenvalue of the Hessian over different runs, which indicates the modification gives a roughly consistent estimate. In Fig. 5(b), we give the error bar of the maximum eigenvalue of the Hessian over different runs for the regimes we experiment with. The central takeaway here is that across all models with which we experiment, we consistently find that the maximum eigenvalue of the Hessian for LB with diagonal Fisher is lower than that of LB and in some cases, comparable to SB. 5.4 Generalization Gap In this last part of our empirical studies, we apply our methodology to address the “generalization gap” problem in stochastic optimization. Here, we experiment with four regimes: SB, LB, LB with Fisher Trace and LB with diagonal Fisher. All regimes are trained for the same number for epochs. LB with Fisher Trace refers to LB with the following noise p Tr(F (θk ))/d · ξk , ξk ∼ N (0, Id ) injected at every iteration. The purpose of this is to provide an experimental comparison which adds isotropic noise with the same variance as LB with diagonal Fisher. While the relationship between the curvature of the loss surface landscape and generalization is not completely explicit, numerous works have suggested that the maximum eigenvalue of the Hessian is possibly correlated with generalization performance (Keskar et al., 2016; Chaudhari et al., 2016; Chaudhari and Soatto, 2017; Yoshida and Miyato, 2017; Xing et al., 2018). In this line of research, the magnitudes of eigenvalues of the Hessian may be interpreted as a heuristic measure for generalization; the smaller the magnitude the better the model generalizes. To situate our method with this philosophy, we compute the maximum eigenvalue of the Hessian of the final model for the following three regimes: SB, LB, and LB with diagonal Fisher. We also point out that we do not experiment with LB with full Fisher due to its exceeding long training time. This can be seen from Fig. 3 where for ResNet44 trained on CIFAR-10, LB with full Fisher does not achieve good convergence even after 8000 parameter updates. There is typically no optimal learning-rate scaling rule for large-batch training across different models and datasets (Shallue et al., 2018); hence, we tune the learning rate schedule to obtain optimal results for each method. We provide the details of this experiment. Computing maximum eigenvalue without any modification to the model gives inconsistent estimates even between different runs of the same training configuration. To make the maximum eigenvalue of the Hessian comparable over different training trajectories, the Hessian needs to The final validation accuracy numbers are reported in Table 2. While it is true that using LB with diagonal Fisher cannot completely close the “generalization gap” in some cases, it yields definite improvements over SB within an epoch-training budget. Such a training regime typically favors small-batch training as they Large-Batch Stochastic Gradient Descent with Structured Covariance Noise Table 2: Validation accuracy results on classification tasks for SB, LB, LB + Fisher Trace, and LB + diagonal Fisher. The results are averaged over 3 random seeds. All methods in each row are trained with the same number of epochs. While it is indeed not feasible to experiment LB with full Fisher for all the models below, we note that this reaches roughly the same validation accuracy (93.22) as SB in the case of ResNet44 (CIFAR-10). Dataset MNIST MNIST FASHION CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 Model SB LB LB+Fisher Trace LB+Diag MLP LeNet LeNet Alexnet Alexnet VGG16 VGG16 ResNet44 ResNet44x2 98.10 99.10 91.15 87.80 59.21 93.25 72.83 93.42 75.55 97.95 98.88 88.89 86.42 56.79 91.81 69.45 91.93 73.13 98.08 99.02 90.29 N/A N/A 92.91 71.35 92.33 73.77 98.10 99.11 90.79 87.61 59.10 93.19 72.11 92.88 74.26 perform more parameter updates (Shallue et al., 2018). This highlights that our approach is a data-efficient way to improve generalization for large-batch training. In addition, we experimented with other regimes such as injecting multiplicative Gaussian noise with constant covariance as in Hoffer et al. (2017) and replacing diagonal Fisher with the block-diagonal KroneckerFactored Approximate Curvature (K-FAC) (Martens and Grosse, 2015) 1 . We delegate these results to Section A.8 of Appendix. 6 Related Works Variance and Optimization. In the context of largescale learning, stochastic algorithms are very popular compared to full-batch methods due to lower computational overhead (Bottou et al., 2018; Bottou, 1991). The tradeoff is that stochastic algorithms exhibit slower convergence asymptotically due to the inherent noise present in their gradients (Moulines and Bach, 2011; Bottou et al., 2018; Wen et al., 2018). For smooth and strongly-convex functions, variance reduction is a common technique to improve convergence rates (Johnson and Zhang, 2013; Defazio et al., 2014). In contrast, for non-convex optimization, increasing the variance by adding noise is often times beneficial. Unlike the convex setting, the loss surface is much more complicated and there is an abundance of global minima (Choromanska et al., 2015). Adding noise can significantly improve training since it enables the dynamics to escape saddle points or shallow local minima (Ge et al., 2015; Jin et al., 2017). More specifically for deep learning, injecting annealed gradient noise has been shown to speed up training of very deep neural networks (Neelakantan et al., 2015). Variance and Generalization. The inherent noise in stochastic optimization methods is also conducive 1 This corresponds to the matrix-variate Gaussian noise in Zhang et al. (2018) to generalization performance. There are vast bodies of literature devoted to this in deep learning; for example, scaling the learning rate or batch-size (Smith and Le, 2017; Goyal et al., 2017; Hoffer et al., 2017) to augment gradient noise in order to encourage better generalization. More direct approaches of studying the covariance structure of mini-batch gradients have also been explored (Jastrzebski et al., 2017; Xing et al., 2018; Zhu et al., 2018; Li et al., 2015). A closely-related approach to ours is the Stochastic Gradient Langevin Dynamics (SGLD) (Gelfand et al., 1992; Welling and Teh, 2011); a modification of SGD where an annealed isotropic Gaussian noise is injected to the gradients. The recent systematic empirical study of Shallue et al. (2018) demonstrates that links between optimization, generalization, and the choice of batch-size in SGD is extremely complex. It underscores the necessity of a more foundational understanding of the interaction between batch-sizes, model architectures, and other optimization metaparameters. 7 Conclusion In this paper, we explored using covariance noise in designing optimization algorithms for deep neural networks that could potentially exhibit ideal learning behavior. We proposed to add diagonal Fisher noise to large-batch gradient updates. Our empirical studies showed that this yield significant improvements in generalization while retaining desirable convergence performance. Furthermore, we demonstrated that the structure of the noise covariance matrix encodes much more information about optimization than the variance of gradients. An immediate question which arises is to better understand how the mathematical structure of the noise covariance matrix ties to generalization. For example, in the special cases of diagonal Fisher and full Fisher, our experiments appear to indicate that the generalization performance is somewhat comparable. This seems to suggest that the diagonal elements contribute much more to the generalization performance Wen, Luk, Gazeau, Zhang, Chan, Ba than the off-diagonal elements. It is an interesting theoretical question as to why this may be the case. References Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pages 173–182, 2016. Saul B. Gelfand, Peter C. Doerschuk, and Mohamed Nahhas-Mohandes. Theory and application of annealing algorithms for continuous optimization. In Proceedings of the 24th Conference on Winter Simulation, WSC ’92, pages 494–499, New York, NY, USA, 1992. ACM. ISBN 0-7803-0798-4. doi: 10. 1145/167293.167407. Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008. Ian J. Goodfellow. Efficient per-example gradient computations. CoRR, abs/1510.01799, 2015. Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. Léon Bottou. Stochastic gradient learning in neural networks. In In Proceedings of Neuro-Nîmes. EC2, 1991. Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002. Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv preprint arXiv:1710.11029, 2017. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016. Y. Chen, C. Jin, and B. Yu. Stability and Convergence Trade-off of Iterative Optimization Algorithms. ArXiv e-prints, April 2018. A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38: 192–204, 2015. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016. Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. CoRR, abs/1509.01240, 2015. URL http://arxiv.org/abs/1509.01240. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017. Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017. Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017. Crispin Gardiner. Stochastic methods, volume 4. springer Berlin, 2009. Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to Escape Saddle Points Efficiently. arXiv e-prints, art. arXiv:1703.00887, Mar 2017. Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition. arXiv e-prints, art. arXiv:1503.02101, Mar 2015. Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732, 2017. Large-Batch Stochastic Gradient Descent with Structured Covariance Noise Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013. Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding Gradient Noise Improves Learning for Very Deep Networks. arXiv e-prints, art. arXiv:1511.06807, November 2015. Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324, 1998. Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. CoRR, abs/1702.03849, 2017. Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. ArXiv, abs/1902.06720, 2019. Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In International Conference on Machine Learning, pages 343–351, 2013. Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251, 2015. Qianli Liao, Brando Miranda, Andrzej Banburski, Jack Hidary, and Tomaso Poggio. A surprising linear relationship predicts test performance in deep networks. arXiv preprint arXiv:1807.09659, 2018. Kevin Luk and Roger Grosse. A coordinate-free construction of scalable natural gradient, 2018. Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017. James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014. James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. S. L. Smith and Q. V. Le. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. ArXiv e-prints, October 2017. Samuel L Smith and Quoc V Le. Understanding generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017. Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017. George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical review, 36 (5):823, 1930. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011. Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018. Wen, Luk, Gazeau, Zhang, Chan, Ba Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger B. Grosse. Understanding short-horizon bias in stochastic meta-optimization. ArXiv, abs/1803.02021, 2018. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018. Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017. Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 5852–5861, 2018. Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E Dahl, Christopher J Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in neural information processing systems, 2019. Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects. ArXiv e-prints, 2018. A SUPPLEMENTARY MATERIAL size is constant αk = α for all k, it corresponds to the Euler discretization of a gradient flow with random perturbation. We begin the proof by considering the equality, L(θk+1 ) = L(θk ) + hL(θk ), θk+1 − θk i 1 + (θk+1 − θk )> ∇2 L(θk )(θk+1 − θk ). 2 Using the fact that ∇L(θk ) = Aθk , ∇2 L(θk ) = A, and from the definition of θk+1 , we can rewrite the above equation as L(θk+1 ) = L(θk ) + hAθk , −αk Aθk + αk Cξk+1 i 1 2 + kαk Aθk − αk Cξk+1 kA . 2 Now, taking the conditional expectation E[·|Fk ] on both sides of the equality, we obtain by independence of the noise ξk+1 to Fk 2 E[L(θk+1 )|Fk ] = L(θk ) − αk kAθk k2 + + αk2 2 E[kCξk+1 kA ] 2 (10) A simple computation shows 2 E[kCξk+1 kA ] = E[(Cξk+1 )> A(Cξk+1 )] > = E[ξk+1 C > ACξk+1 ] (11) > = Tr(C AC) Moreover, we have 2 2 kAθk kA ≤ λmax kAθk k2 , (12) where λmax denotes maximum eigenvalue of A (and λmin correspondingly denotes minimum eigenvalues of A). This comes from the fact that for any symmetric, positive-definite matrix B, 2 2 kxkB ≤ λmax,B kxk2 , A.1 αk2 2 kAθk kA 2 (13) Proof of Theorem 4.1 The proof of this theorem follows the spirit of Bottou et al. (2018). The algorithm θk+1 = θk − αk ∇L(θk ) + αk Cξk+1 , ξk+1 ∼ N (0, Id ). (9) falls into the Robbins-Monro setting where the true gradient is perturbed by random noise. This perturbation can be considered as a martingale difference in the sense that E[Cξk+1 |Fk ] = 0 where (Fk )k∈N is a increasing filtration generated by the sequence of parameters (θk )k∈N . When the step where λmax,B denotes maximum eigenvalue of B. Using the results in Eqns. 11 and 12 as well as the assumption 1 on the step-size schedule for all k: αk < α0 < λmax , we rewrite Eqn. 10 as α  k 2 E[L(θk+1 )|Fk ] ≤ L(θk ) + λmax − 1 αk kAθk k2 2 α2 + k Tr(C > AC) 2 αk α2 2 ≤ L(θk ) − kAθk k2 + k Tr(C > AC). 2 2 (14) Now, taking x = Aθk and B = A−1 in Eqn. 13, we have 2 2 kAθk kA−1 ≤ λmax,A−1 kAθk k2 Large-Batch Stochastic Gradient Descent with Structured Covariance Noise Note that λmax,A−1 = λmin and simplifying the above, 2 2 kAθk k2 ≥ λmin kAθk kA−1 = λmin (θk> A)A−1 (Aθk ) 2 = λmin kθk kA Using the above inequality in Eqn. 14 and then taking expectation yields αk2 Tr(C > AC). 2 We proceed by induction to prove the final result. By definition of ν, the result is obvious for k = 0. For the inductive step, suppose that the induction hypothesis holds for k, i.e., αk = 2 , (k + γ)λmin E[L(θk )] ≤ ν . k+γ We prove the k + 1 case.  E[L(θk+1 )] ≤ 1 −  2 ν k+γ k+γ 2 + Tr(C > AC) (k + γ)2 λ2min ν ≤ (k + γ + 1) This comes from the definition of ν and also the inequality (k+γ −1)(k+γ +1) ≤ (k+γ)2 . This concludes the proof. A.2 sup |EA [L(θS ) − L(θS 0 )]| ≤ , (x,y) = 2λmin L(θk ). E[L(θk+1 )] ≤ (1 − αk λmin )E[L(θk )] + Definition A.1 (Uniform stability). A randomized algorithm A is -stable if for all data sets S and S 0 where S and S 0 differ in at most one sample, we have Relationship Between Noise Covariance Structures and Generalization As in the previous section, we work entirely in the convex quadratic setting. In this case, Eqn. 2 becomes θk+1 = θk − αk ∇L(θk ) + αk Cξk , ξk ∼ N (0, Id ). (15) Our aim in this section is to provide some theoretical discussions on how the choice of covariance structure C influences the generalization behavior. Uniform stability. Uniform stability (Bousquet and Elisseeff, 2002) is one of the most common techniques used in statistical learning theory to study generalization of a learning algorithm. Intuitively speaking, uniform stability measures how sensitive an algorithm is to perturbations of the sampling data. The more stable an algorithm is, the better its generalization will be. Recently, the uniform stability has been investigated for Stochastic Gradient methods (Hardt et al., 2015) and also for Stochastic Gradient Langevin Dynamics (SGLD) (Mou et al., 2017; Raginsky et al., 2017). We present the precise definition. where L(θS ) and L(θS 0 ) highlight the dependence of parameters on sampling datasets. The supremum is taken over input-target pairs (x, y) belonging to the sample domain. The following theorem from Bousquet and Elisseeff (2002) shows that uniform stability implies generalization. Theorem A.2 (Generalization in expectation). Let A be a randomized algorithm which is -uniformly stable, then |EA [Egen ]| ≤ , where Egen is the expected generalization error as defined in Eqn. 1 of Section 2. Continuous-time dynamics. We like to use the uniform stability framework to analyze generalization behavior of Eqn. 9. To do this, we borrow ideas from the recent work of Mou et al. (2017) which give uniform stability bounds for Stochastic Gradient Langevin Dynamics (SGLD) in non-convex learning. While the authors in that work give uniform stability bounds in both the discrete-time and continuous-time setting, we work with the continuous setting since this conveys relevant ideas while minimizing technical complications. The key takeaway from Mou et al. (2017) is that uniform stability of SGLD may be bounded in the following way p (16) SGLD ≤ sup H 2 (πt , πt0 ). S,S 0 Here, πt and πt0 are the distributions on parameters θ trained on the datasets S and S 0 . The H 2 refers to the Hellinger distance. We now proceed to mirror the approach of Mou et al. (2017) for Eqn. 9. Our usage of stochastic differential equations will be very soft but we refer to reader to Gardiner (2009); Pavliotis (2014) for necessary background. For the two datasets S and S 0 , the continuous-time analogue of Eqn. 9 are Ornstein-Uhlenbeck processes (Uhlenbeck and Ornstein, 1930): √ dθS (t) = −AS θS (t)dt + αCS dW (t) √ dθS 0 (t) = −AS 0 θS 0 (t)dt + αCS 0 dW (t). The solution is given by θS (t) = e−AS t θS (0) + √ Z t α 0 e−AS (t−u) CS dW (u), Wen, Luk, Gazeau, Zhang, Chan, Ba In fact, this yields the Gaussian distribution Suppose that we choose C = Id . Proceeding analogously, θS (t) ∼ N (µS (t), ΣS (t)), det(ΣS (t)) = where µS (t) = e−AS t θS (0) and ΣS (t) satisfies the Ricatti equation, d ΣS (t) = −(AS ΣS (t) + ΣS (t)AS ) + αCS CS> . dt  α d det(I − e−2DS t ) d 2 det(DS ) √ We can think of choosing C = Id or C = A to be extreme cases and it is interesting to observe that the Hellinger distance is more sensitive to dataset perturbation when C = Id . Ourpproposed method of this paper was to choose C = diag (A) and our experiments seem to suggest that choosing the square-root of diagonal captures much of the generalization behavior of full Fisher. Understanding precisely why this is the case poses an interesting research direction to pursue in the future. Observe that AS is symmetric and positive-definite which means that it admits a diagonalization AS = PS DS PS−1 . Solving the equation for the covariance matrix gives Z t  −1simple scaling argument also highlights the impor−DS (t−u) −1 > −DS (t−u) ΣS (t) = αPS e PS CS CS PS e du PSA . 0 tance of the trade-off between optimization and gen(17) eralization. Consider Cλ = λC. Then Theorem 4.1 suggests to take λ small to reduce the variance and We are in the position to directly apply the frameimprove convergence. However, in that case Σλ = λ2 Σ work of (Mou et al., 2017). Choosing πt and πt0 where Σ is given by the Eqn. 17 for C and in Eqn. 16 to be the Gaussians N (µS (t), ΣS (t)) and 1 1 N (µS 0 (t), ΣS 0 (t)) respectively, we obtain a uniform stadet(ΣS ) 4 det(ΣS 0 ) 4 2 0 H (π , π ) = 1 − ΛS,S 0 ,λ , t 1 t bility bound for Eqn. 9. We compute the right-hand S0 2 det( ΣS +Σ ) 2 side of the bound to obtain insights on generalization. Using the standard formula for Hellinger distance bewhere ΛS,S 0 ,λ is tween two Gaussians, we have ( )  −1 0 Σ + Σ 1 S S 1 1 exp − 2 (µS − µS 0 )> (µS − µS 0 ) . det(ΣS ) 4 det(ΣS 0 ) 4 8λ 2 0 Λ (18) H 2 (πt , πt0 ) = 1 − S,S ΣS +ΣS 0 1 det( 2 ) 2 The Hellinger distance gets close to one in the limit where ΛS,S 0 is of small λ (which intuitively corresponds to the large batch situation). ( )  −1 0 1 Σ + Σ S S exp − (µS − µS 0 )> (µS − µS 0 ) . A.3 Fisher Information Matrix for Deep 8 2 Neural Networks Choosing the noise covariance. From Eqn. 18 above, it is evident that to ensure good generalization error for Eqn. 9, we want to choose a covariance CS such that the Hellinger distance H 2 is minimized. Since we are working within the uniform stability framework, a good choice of CS should be one where Eqn. 9 becomes less data-dependent. This is intuitive after all – the less data-dependent an algorithm is; the better suited it should be for generalization. We study Eqn. 18. Note that as time t → ∞, the exponential term goes to 1. Hence, we focus our attention on the ratio √ of the determinants. Suppose that we choose CS = AS and note that AS is the Fisher in this convex quadratic example. Simplifying the determinant of ΣS (t) in this case, det(ΣS (t)) =  α d 2 det(Id − e−2DS t ) In this section, we give a formal description of the Fisher information matrix for both feed-forward networks and convolutional networks. In addition, we give the diagonal expression for both networks. Note that these expressions are valid for both the empirical and the exact Fisher; in the empirical case, the expectation will be taken over the empirical data distribution whereas in the exact case, the expectation will be taken over the predictive distribution for targets y. A.4 Feed-forward networks Consider a feed-forward network with L layers. At each layer i ∈ {1, . . . , L}, the network computation is given by zi = Wi ai−1 ai = φi (zi ), Large-Batch Stochastic Gradient Descent with Structured Covariance Noise where ai−1 is an activation vector, zi is a pre-activation vector, Wi is the weight matrix, and φi : R → R is a nonlinear activation function applied coordinate-wise. Let w be the parameter vector of network obtained by vectorizing and then concatenating all the weight matrices Wi , w = [vec(W1 )> vec(W2 )> . . . vec(WL )> ]> . Furthermore, let Dv = ∇v log p(y|x, w) denote the loglikelihood gradient. Using backpropagation, we have a decomposition of the log-likelihood gradient DWi into the outer product: The pre-activations zi,t are then passed through a nonlinear activation function φl . The log-likelihood derivatives of the weights are computed through backpropagation: X Dwi,j,δ = aj,t+δ Dzi,t . t∈T Then, the Fisher matrix here is " ! !# X X E aj,t+δ Dzi,t aj 0 ,t0 +δ0 Dzi0 ,t0 t∈T t0 ∈T Diagonal version. To give the diagonal version, it will be convenient for us to express the computation DWi = gi a> i−1 , of the convolution layer in matrix notation. First, we represent the activations aj,t as a J × |T | matrix Al−1 , where gi = Dzi are pre-activation derivatives. The the pre-activations zi,t as a I × |T | matrix Zl , and the Fisher matrix F (w) of this feed-forward network is a weights wi,j,δ as a I × J|∆| matrix Wl . Furthermore, L × L matrix where each (i, j) block is given by by extracting the patches surrounding each spatial lo> > > column Fi,j (w) = E[vec(DWi ) vec(DWj ) ] = E[ai−1 aj−1 ⊗gi gj ]. cation t ∈ T and flattening these patches into exp vectors, we can form a J|∆| × |T | matrix A (19) l−1 which we call the expanded activations. Then, the compuDiagonal version. We give an expression for the tation is Eqn. 20 can be reformulated as the matrix diagonal of Fi,i (w) here. The diagonal of F (w) folmultiplication lows immediately afterwards. Let a2i−1 and gi2 be the Zl = Wl Aexp l−1 . element-wise product of ai−1 and gi respectively. Then, Readers familiar with convolutional networks can imin vectorized form, mediately see that this is the Conv2D operation. diag (Fi,i (w)) = E[vec((a2i−1 )(gi2 )> )], At a specific spatial location t ∈ T , consider the J|∆|where (a2i−1 )(gi2 )> is the outer product of a2i−1 and gi2 . dimensional column vectors of Aexp l−1 and I-dimensional (:,t) (t) column vectors of Zl . Denote these by al−1 and zl A.5 Convolutional networks In order to write down the Fisher matrix for convolutional networks, it suffices to only consider convolution layers as the pooling and response normalization layers typically do not contain (many) trainable weights. We focus our analysis on a single layer. Much of the presentation here follows (Grosse and Martens, 2016; Luk and Grosse, 2018). A convolution layer l takes as input a layer of activations aj,t where j ∈ {1, . . . , J} indexes the input map and t ∈ T indexes the spatial location. T here denotes the set of spatial locations, which we typically take to be a 2D-grid. We assume that the convolution here is performed with a stide of 1 and padding equal to the kernel radius R, so that the set of spatial locations is shared between the input and output feature maps. This layer is parameterized by a set of weights wi,j,δ , where i ∈ {1, . . . , I} indexes the output map and δ ∈ ∆ indexes the spatial offset. The numbers of spatial locations and spatial offsets are denoted by |T | and |∆| respectively. The computation of the convolution layer is given by X zi,t = wi,j,δ aj,t+δ . (20) δ∈∆ (:,t) (t) respectively. The matrix Wl maps al−1 to zl . In this case, we find ourselves in the exact same setting as the feed-forward case given earlier. The diagonal is simply h  i (:,t) (t) E vec (al−1 )2 (Dzl )2 A.6 Kronecker-Factored Approximate Curvature (K-FAC) Later in Section A.7, we will compare the diagonal approximation of the Fisher matrix to the Kroneckerfactored approximate curvature (K-FAC) (Martens and Grosse, 2015) approximation of the Fisher matrix. We give a brief overview of the K-FAC approximation in the case of feed-forward networks. Recall that the Fisher matrix for a feed-forward network is a L × L matrix where each of the (i, j) blocks are given by Eqn. 19. Consider the diagonal (i, i) blocks. If we approximate the activations ai−1 and pre-activation derivatives gi as statistically independent, we have Fi,i (w) = E[vec(DWi ) vec(DWi )> ] > = E[ai−1 a> i−1 ⊗ gi gi ] > ≈ E[ai−1 a> i−1 ] ⊗ E[gi gi ]. Wen, Luk, Gazeau, Zhang, Chan, Ba Table 3: Validation accuracy results on classification tasks using BatchChange, Multiplicative, K-FAC and Fisher Trace. Results are averaged over 3 random seeds. For the reader’s convenience, we report again the result of Diag-F. Dataset CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 Model SB BatchChange Multiplicative K-FAC Fisher Trace Diag-F VGG16 VGG16 ResNet44 ResNet44x2 93.25 72.83 93.42 75.55 93.18 72.44 93.02 75.16 90.98 68.77 91.28 71.98 93.06 71.86 92.81 73.84 92.91 71.35 92.33 73.77 93.19 72.11 92.88 74.26 > Let Ai−1 = E[ai−1 a> i−1 ] and Gi = E[gi gi ]. The K-FAC approximation F̂ of the Fisher matrix F is   A0 ⊗ G1 0   A1 ⊗ G2   F̂ =  . . ..   0 AL−1 ⊗ GL The K-FAC approximation of the Fisher matrix can be summarized in the following way: (1) keep only the diagonal blocks corresponding to individual layers, and (2) make the probabilistic modeling assumption where the activations and pre-activation derivatives are statistically independent. A.7 Supplementary Experiments Details Learning rate: We tuned the learning rate schedule for each method in Table 2 of Section 5 to obtain best performance. As a result, for both LB and LB with diagonal Fisher method, we need to scale up the learning rate and use the linear warmup strategy in the first 10 epochs. For LB, the optimal learning rate on CIFAR-10 and CIFAR-100 with ResNet44 is 3.2 while is 1.6 for LB with diagonal Fisher. With VGG16 network on CIFAR-10 and CIFAR-100, the optimal learning rates are 1.6 for both methods. We decay the learning rate by 0.1 at the epoch of 100, 150 for all above methods. Noise Termination: For all training regimes involving noise injection, we found terminating the noise at a quarter of the training trajectory and using standard LB for the remainder of training achieves the best performance. This finding is consistent to the result of BatchChange in Table 3, which suggests that noise only helps generalization in the beginning of the training. A.8 Validation Accuracy Results We provide additional validation accuracy results to complement Table 2 of Section 5. The additional regimes are: • BatchChange: Here, we use SB for the first 50 epochs and then use LB for the remainder. This experimental setup was inspired by Smith et al. (2017). • Multiplicative: Here, we multiply the gradients with a Gaussian noise with constant diagonal covariance structure. This experimental setup was inspired by Hoffer et al. (2017). • K-FAC: Instead of choosing diagonal Fisher as the noise covariance structure, we use the blockdiagonal approximation of Fisher given by K-FAC instead • Fisher Trace: Instead of choosing diagonal Fisher as the noise covariance structure, we use p square-root of the trace of Fisher Tr(F (θk )) instead The results are reported in Table 3 above. A.9 Sampling Full Fisher Noise Sampling True Fisher Random Vector. We describe a method to sample a random vector with Fisher covariance efficiently. We obtain prediction f (x, θ) by a forward-pass. If we randomly draw labels from the model’s predictive distribution and obtain back-propagated gradients ∇θ L, then we have Cov(∇θ L, ∇θ L) = Ex [Jf> HL Jf ], which is the exact true Fisher (Martens, 2014). Here, Jf is the Jacobian of outputs with respect to parameters and HL is the Hessian of the loss function with respect to the outputs. Sampling Empirical Fisher Random Vector. Let M be the size of the mini-batch and from the M forward passes we obtain the back-propagated gradients ∇l1 , . . . , ∇lM for each data-point. Consider independent random variables σ1 , . . . , σM drawn from Rademacher distribution, i.e., P (σi = 1) = P (σi = PM −1) = 12 . Then, the mean Eσ [ i=1 σi ∇li ] = 0. The covariance is empirical Fisher.