Explaining Neural Scaling Laws Yasaman Bahri∗1 , Ethan Dyer*1 , Jared Kaplan*2 , Jaehoon Lee*1 , and Utkarsh Sharma*†2 1 arXiv:2102.06701v2 [cs.LG] 29 Apr 2024 2 Google DeepMind, Mountain View, CA Department of Physics and Astronomy, Johns Hopkins University yasamanbahri@gmail.com, edyer@google.com, jaredk@jhu.edu, jaehlee@google.com, usharma7@jhu.edu Abstract The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origins of and relationships between scaling exponents. For a large variety of models and datasets, neural network performance has been empirically observed to scale as a power law with model size and dataset size [1–4]. These exponents determine how quickly performance, as measured by the population loss, improves with more data and larger models. We would like to understand why these power laws emerge. For example, what features of the data and models determine the values of the power-law exponents? Is there a taxonomy behind scaling regimes, where regimes are governed by different underlying mechanisms? Are dataset and model size scaling exponents related in any way? And finally, which aspects of scaling behavior might exhibit universal signatures, and which aspects are strongly dependent on the “microscopic” aspects of the problem? A theoretically and empirically-grounded understanding of these questions could provide guidance for machine learning in the modern era of large models and training data. In this work, we present a theoretical framework for understanding scaling laws in trained deep neural networks. We identify four related scaling regimes with respect to the number of model parameters P and the dataset size D. With respect to each of D, P , we define both a variance-limited regime and a resolution-limited regime. ∗ All authors contributed to all aspects of this work. † A portion of work completed during an internship at Google. 1 Variance-limited : Theory D = 1 Resolution-limited 12.7 10 6 10 8 101 D: 1.02 D: 1.10 D: 1.01 D: 1.11 (SGD) 102 104 103 Dataset size (D) Loss - Loss( ) Loss 101 Width 102 12.5 Variance-limited : Theory W = 1 25 20 10 2 10 3 W: 0.98 (MSE) W: 1.03 W: 1.02 (ERF) 102 (a) Teacher-Student 102 0.0 0.2 0.4 0.6 0.8 1.0 Interpolating Between Training Points in 4-dimensions 10 1 10 5 Teacher 12.4 Dataset size (D) 10 4 W: 0.62 W: 0.40 D: 0.40 D: 0.58 104 103 100 100 W: 0.46 W: 0.34 D: 0.26 D: 0.37 10 1 Resolution-limited 10 1 Predictions D: 1.02 D: 1.01 D: 0.98 (CNN) D: 1.01 103 12.6 4/ D 10 4 100 Dataset Size 10 2 Loss Loss - Loss( ) 100 CIFAR-10 W: 1.01 W: 1.00 W: 1.03 (ERF) 104 103 Width CIFAR-100 15 10 5 0 4/ D 2/ D 2 4 6 8 10 12 14 16 18 20 22 24 26 Dimension (b) SVHN FashionMNIST MNIST Figure 1: (a) Four scaling regimes. Here we exhibit the four regimes we focus on in this work. (top-left, bottom-right) Variance-limited scaling of underparameterized models with dataset size and overparameterized models with number of parameters (width) exhibit universal scaling (αD = αW = 1) independent of the architecture or underlying dataset. (top-right, bottom-left) Resolution-limited overparameterized models with dataset or underparameterized models with model size exhibit scaling with exponents that depend on the details of the data distribution. These four regimes are also found in random feature (Figure 2a) and pretrained models (see supplement). (b) Resolution-limited models interpolate the data manifold. Linear interpolation between two training points in a four-dimensional input space (top). We show a teacher model and four student models, each trained on different sized datasets. In all cases teacher and student approximately agree on the training endpoints, but as the training set size increases they increasingly match everywhere. (bottom) We show 4/αD versus the data manifold dimension (input dimension for teacher-student models, intrinsic dimension for standard datasets). We find that the teacher-student models follow the 4/αD (dark dashed line), while the relationship for a four layer convolutional neural network (solid) and Wide ResNet architecture (hollow) on standard datasets is less clear. 1 Scaling laws for neural networks 1.1 Variance-limited regime In the limit of infinite data or an arbitrarily wide model, some aspects of neural network training simplify. Specifically, if we fix one of D, P and study scaling with respect to the other parameter as it becomes arbitrarily large, then the difference between√the finite test loss and its limiting value scales as 1/x, i.e. as a power law with exponent 1, with x = D or P ∝ width in deep networks and x = D or P in linear models. 1.2 Resolution-limited regime In this regime, one of D or P is effectively infinite, and we study scaling as the other parameter increases. In this case, a variety of works have empirically observed power-law scalings 1/xα , typically with 0 < α < 1 for both x = P or D. We derive exponents in this regime precisely in the setting of random feature models (c.f. next section). Empirically, we find that our theoretical predictions for exponents hold in pretrained, fine-tuned models even though these lie outside our theoretical setting. For more general nonlinear models, we propose a refinement of naive bounds into estimates via expansions that hold asymptotically. These rely on the idea that additional data (in the infinite model-size limit) or 2 added model parameters (in the infinite data limit) are used by the model to carve up the data manifold into smaller components. For smooth manifolds, loss, and network, the test loss will depend on the linear size of a sub-region, while it is the d-dimensional sub-region volume that scales inversely with P or D, giving rise to α ∝ 1/d.1 To test this empirically, we make measurements of the resolution-limited exponents in neural networks and intrinsic dimension of the data manifold, shown in Figure 1b. 1.3 Explicit derivation We derive the scaling laws for these four regimes explicitly in the setting of random feature teacher-student models, which also applies to neural networks in the large width limit. This setting allows us to solve for the test error directly in terms of the feature covariance (kernel). The scaling of the test loss then follows from the asymptotic decay of the spectrum of the covariance matrix. For generic continuous kernels on a d-dimensional manifold, we can further relate this to the dimension of the data manifold. 1.4 Summary of contributions 1. We propose four scaling regimes for neural networks. The variance-limited and resolution-limited regimes originate from different mechanisms, which we identify. To our knowledge, this categorization has not been previously exhibited. We provide empirical support for all four regimes in deep networks on standard datasets. 2. We derive the variance-limited regime under simple yet general assumptions (Theorem 1). 3. We present a hypothesis for resolution-limited scaling through refinement of naive bounds (Theorems 2, 3), for general nonlinear models. We empirically test the dependence of the estimates on intrinsic dimension of the data manifold for deep networks on standard datasets (Figure 1b). 4. In the setting of random feature teacher-student networks, we derive both variance-limited and resolutionlimited scaling exponents exactly. In the latter case, we relate this to the spectral decay of kernels. We identify a novel duality that exists between model and dataset size scaling. 5. We empirically investigate predictions from the random features setting in pretrained, fine-tuned models on standard datasets and find they give excellent agreement. 6. We study the dependence of the scaling exponent on changes in architecture and data, finding that (i) changing the input distribution via switching datasets and (ii) the addition of noise have strong effects on the exponent, while (iii) changing the target task via superclassing does not. 1.5 Related works There have been a number of recent works demonstrating empirical scaling laws [1–5] in deep neural networks, including scaling laws with model size, dataset size, compute, and other observables such as mutual information and pruning. Some precursors [6, 7] can be found in earlier literature. Recently, scaling laws have also played a significant role in motivating work on the largest models that have yet been developed [8, 9]. There has been comparatively little work on theoretical ideas [10, 11] that match and explain empirical findings in generic deep neural networks. In the particular case of large width, deep neural networks behave as random feature models [12–17], and known results on the loss scaling of kernel methods can be applied [18, 19]. Though not in the original, [19] analyze resolution-limited dataset size scaling for power-law spectra in later versions. The decay of test error in ridge regression under certain settings has been studied in prior work, including [20–22]. During the completion of this work, [23] presented a specific solvable model of learning exhibiting nontrivial power-law scaling for power-law (Zipf) distributed features. This does not directly relate to the settings 1 A visualization of this successively better approximation with dataset size is shown in Figure 1b for models trained to predict data generated by a random fully-connected network. 3 studied in this work, or present bounds that supersede our results. Concurrent to our work, [11] presented a derivation of the resolution-limited scaling with dataset size, also stemming from nearest-neighbor distance scaling on data manifolds. However, they do not discuss requirements on model versus dataset size or how this scaling behavior fits into other asymptotic scaling regimes. A few recent works, appearing after the completion of this manuscript, also investigate the scaling of test error in related settings. [24] studies the decay of test error with dataset size for kernel regression in a high-dimensional limit with Gaussian design. [25] examine further a teacher-student framework similar to ours, deriving joint scaling laws using techniques from random matrix theory. Finally, [26] theoretically examines kernel regression through classical statistical estimators and random matrix theory. In the variance-limited regime, scaling laws in the context of random feature models [27–29], deep linear models [30, 31], one hidden layer networks [32–34], and wide neural networks treated as Gaussian processes or trained in the NTK regime [16, 17, 35, 36] have been studied. In particular, this behavior was used in [2] to motivate a particular ansatz for simultaneous scaling with data and model size. The resolution-limited analysis can perhaps be viewed as an attempt to quantify the ideal-world generalization error of [37]. This work makes use of classic results connecting the spectrum of a smooth kernel to the geometry it is defined over [38–41] and on the scaling of iteratively refined approximations to smooth manifolds [42–44]. 2 Four scaling regimes Throughout this work we will be interested in how the average test loss L(D, P ) depends on the dataset size D and the number of model parameters P . Unless otherwise noted, L denotes the test loss averaged over initialization of the parameters√and draws of a size D training set. Some of our results only pertain directly to the scaling with width w ∝ P , but we expect many of the intuitions apply more generally. We use the notation αD , αP , and αW to indicate scaling exponents with respect to dataset size, parameter count, and width. All proofs appear in the supplement. 2.1 Variance-limited exponents In the limit of large D the outputs of an appropriately trained network approach a limiting form with corrections which scale as D−1 . Similarly, √ recent work shows that wide networks have a smooth large P limit [15], where fluctuations scale as 1/ P . If the loss is sufficiently smooth √ then its value will approach the asymptotic loss with corrections proportional to the variance (1/D or 1/ P ). In Theorem 1 we present sufficient conditions on the loss to ensure this variance dominated scaling. We note, these conditions are satisfied by mean squared error and cross-entropy loss, though we conjecture the result holds even more generally. Theorem 1. Let ℓ(f ) be the test loss as a function of network output, (L = E [ℓ(f )]), and let fT be the network output after T training steps, thought of as a random variable over weight initialization, draws of the training k dataset, and optimization seed. Further let fT be concentrating with E[(fT − E[fT ]) ] = O (ϵ) ∀k ≥ 2. If ℓ is a finite-degree polynomial, or has bounded second derivative, or is 2-Hölder, then E [ℓ(fT )] − ℓ (E [fT ]) = O(ϵ). 2.1.1 Dataset scaling Consider a neural network, and its associated training loss Ltrain (θ). For every value of the weights, the training loss, thought of as a random variable over draws of a training set of size D, concentrates around the  population loss, with a variance which scales as O D−1 . This is because if the optimization procedure is sufficiently hsmooth, the trained i weights,network output, and higher moments, will approach their infinite D k values, ED (fT − ED [fT ]) = O D−1 . Here, the subscript D on the expectation indicates an average over draws of the training set. This scaling together with Theorem 1 gives the variance limited scaling of loss with dataset size. This concentration result with respect to dataset size has appeared for linear models in [27] and for single hidden layer networks with high-dimensional input data in [32–34]. In the supplement, we prove this for GD 4 and SGD with polynomial loss as well as present informal arguments more generally. Additionally, we present examples violating the smoothness assumption and exhibiting different scaling. 2.1.2 Large width scaling We can make a very similar argument in the w → ∞ limit. It has been shown that the predictions from an infinitely wide network, either under Bayesian inference [12, 13], or when trained via gradient descent [15, 16], approach a limiting distribution at large width equivalent to a linear model with random features. Furthermore, corrections to the infinite width behavior are controlled by the variance of the full model around the linear This variance (and higher moments) have been shown to scale as 1/w h model predictions. i  k [17, 35, 45], Ew (fT − Ew [fT ]) = O w−1 . Theorem 1 then implies the loss will differ from its w = ∞ limit by a term proportional to 1/w. We note that there has also been work studying the combined large depth and large width limit, where [46] found a well-defined infinite size limit with controlled fluctuations in randomly initialized deep neural networks. In any such context where the trained model predictions concentrate, we expect the loss to scale −1 with the variance √ of the model output. In the case of linear models, studied below, the variance is O(P ) rather than O( P ), and we see the associated variance scaling in this case. 2.2 Resolution-limited exponents In this section we consider training and test data drawn uniformly from a compact d-dimensional manifold, x ∈ Md , and targets given by some smooth function y = F(x) on this manifold. 2.2.1 Overparameterized dataset scaling Consider the double limit of an overparameterized model with large training set size, P ≫ D ≫ 1. We further consider well-trained models, i.e. models that interpolate all training data. The goal is to understand L(D). If we assume that the learned model f is sufficiently smooth, then the dependence of the loss on D can be bounded in terms of the dimension of the data manifold Md . Informally, if our train and test data are drawn i.i.d. from the same manifold, then the distance from a test point to the closest training data point decreases as we add more and more training data points. In particular, this distance scales as O(D−1/d ) [47]. Furthermore, if f , F are both sufficiently smooth, they cannot differ too much over this distance. If in addition the loss function, L, is a smooth function vanishing when f = F, we have L = O(D−1/d ). This is summarized in the following theorem. Theorem 2. Let L(f ), f and F be Lipschitz with constants KL , Kf , and KF . Further let D be a training  dataset of size D sampled i.i.d from Md and let f (x) = F(x), ∀x ∈ D then L(D) = O KL max(Kf , KF )D−1/d . 2.2.2 Underparameterized parameter scaling We will again assume that F varies smoothly on an underlying compact d-dimensional manifold Md . We can obtain a bound on L(P ) if we imagine that f approximates F as a piecewise function with roughly P regions (see [10]). Here, we instead make use of the argument from the over-parameterized, resolution-limited regime above. If we construct a sufficiently smooth estimator for F by interpolating among P randomly chosen points from the (arbitrarily large) training set, then by the argument above the loss will be bounded by O(P −1/d ). Theorem 3. Let L(f ), f and F be Lipschitz with constants KL , Kf , and KF . Further let f (x) = F(x) for  P points sampled i.i.d from Md then L(P ) = O KL max(Kf , KF )P −1/d . 2.2.3 From bounds to estimates Theorems 2 and 3 are phrased as bounds, but we expect the stronger statement that these bounds also generically serve as estimates, so that eg L(D) = Ω(D−c/d ) for c ≥ 2, and similarly for parameter scaling. If we assume that F and f are analytic functions on Md and that the loss function L(f, F) is analytic in 5 10 4 Loss 10 5 103 102 101 100 10 1 10 2 10 3 10 4 10 5 10 6 10 7 101 D: 1.01 D: 1.01 D: 1.01 D: 1.01 D: 1.01 D: 1.01 D: 1.00 D: 1.00 104 Dataset size (D) Resolution-limited P: 0.34 P: 0.44 P: 0.52 P: 0.63 P: 0.70 P: 0.79 P: 0.92 P: 1.31 102 103 Parameter count (P) D: 0.34 D: 0.67 D: 0.43 D: 0.76 D: 0.52 D: 0.85 D: 0.61 D: 1.22 2 10 103 Dataset size (D) 10 2 10 6 104 100 K: 0.34 K: 0.42 K: 0.50 K: 0.51 101 100 1.0 10 1 0.8 P: 1.14 P: 1.14 P: 1.15 P: 1.14 10 2 10 3 102 P: 1.15 P: 1.15 P: 1.14 P: 1.15 103 Parameter count (P) K: 0.55 K: 0.60 K: 0.71 K: 1.25 i 102 10 103 8 Fit exponents 1.2 (a) 12 10 4 101 4 14 100 Variance-limited 102 104 10 Kernel spectrum 102 pool size 10 3 Resolution-limited i 10 2 Loss - Loss( ) Loss - Loss( ) 10 1 102 101 100 10 1 10 2 10 3 10 4 10 5 10 6 7 105 10 101 Loss Variance-limited 100 6 P K 4 2 0.6 0.4 104 0.4 0.6 0.8 D 1.0 1.2 (b) Figure 2: (a) Random feature models exhibit all four scaling regimes. Here we consider linear teacher-student models with random features trained with MSE loss to convergence. We see both variance-limited scaling (top-left, bottom-right) and resolution-limited scaling (top-right, bottom-left). Data is varied by downsampling MNIST by the specified pool size. (b) Duality and spectra in random feature models. Here we show the relation between the decay of the kernel spectra, αK , and the scaling of the loss with number of data points, αD , and with number of parameters, αP . (top) The spectra of kernels derived from random fully-connected deep neural networks on pooled MNIST (bottom) appear well described by a power-law decay. The theoretical relation αD = αP = αK is given by the dashed black line. f − F and minimized at f = F, then the loss at a given test input, xtest , can be expanded around the nearest P∞ training point, x̂train , L(xtest ) = m=n≥2 am (x̂train )(xtest − x̂train )m ,2 where the first term is of finite order n ≥ 2 because the loss vanishes at the training point. As the typical distance between nearest neighbor points scales as D−1/d on a d-dimensional manifold (an observation also made in [18]), the loss will be dominated by the leading term, L ∝ D−n/d , at large D. Note that if the model provides an accurate piecewise linear approximation, we will generically find n ≥ 4. 2.3 Explicit realization in linear random feature models In the proceeding sections we have conjectured typical case scaling relations for a model’s test loss. We have further given intuitive arguments for this behavior which relied on smoothness assumptions on the loss and training procedure. In this section, we provide a concrete realization of all four scaling regimes within the context of linear models constructed from random features. Of particular interest is the resolution-limited regime, where the scaling of the loss is a consequence of the linear model kernel spectrum – the scaling of overparameterized models with dataset size and underparameterized models with parameters is a consequence of a classic result, originally due to [38], bounding the spectrum of sufficiently smooth kernel functions by the dimension of the manifold they act on. Linear predictors serve as a model system for learning. Such models are used frequently in practice when more expressive models are unnecessary or infeasible [48–50] and also serve as an instructive test bed to study training dynamics [28, 31, 51]. Furthermore, in the large width limit, deep neural networks behave as Gaussian processes [12–14, 52–54] and in the low-learning rate regime of gradient-descent optimization 2 For simplicity we have used a very compressed notation for multi-tensor contractions in higher order terms. 6 [16, 55, 56], deep neural networks behave as a particular class of linear models [15, 16, 57]. Hence, linear predictors constructed from random features provide an accurate description of deep neural networks in the large width limit. Here we discuss linear models in general terms, though the results immediately hold for the special cases of wide, deep neural networks. We focus on teacher-student models, in which the teacher generates samples from which the student model learns. We will assume student weights initialized to zero and trained with mean squared error (MSE) loss to their global optimum. We consider a linear teacher F and student f , F (x) = S X ωM FM (x) f (x) = P X θµ fµ (x) . (1) µ=1 M =1 Here {FM } are a (potentially infinite) collection of features. The teacher weights, ωM , are sampled from a Normal distribution ω ∼ N (0, 1/S) and are averaged over in the test loss. The student has learnable parameters {θµ } and is built out of a subset of the teacher features. To vary the number of parameters in this simple model, we construct P features, fµ=1,...,P , by introducing a projector P onto a P -dimensional P subspace of the teacher features, fµ = M PµM FM . We train by sampling a training set of size D and P D 2 1 minimizing the MSE loss, Ltrain = 2D a=1 (f (xa ) − F (xa )) . We are interested in the test loss averaged over draws of our teacher weights and training dataset. The infinite data test loss, L(P ) := limD→∞ L(D, P ), takes the form, L(P ) = i h −1 1 Tr C − CP T PCP T PC . 2S (2)   Here we have introduced the feature-feature second moment-matrix, C = Ex F (x)F T (x) . If the teacher and student features had the same span, this would vanish, but due to the mismatch the loss is nonzero. On the other hand, if we keep a finite number of training points, but allow the student to use all of the teacher features, the test loss, L(D) := limP →S L(D, P ), takes the form, L(D) = i 1 h ⃗ ⃗ Ex K(x, x) − K(x) K̄−1 K(x) . 2 (3) ⃗ indicates restricting one argument to the D training Here, K(x, x′ ) is the data-data second moment matrix, K points, while K̄ indicates restricting both. This test loss vanishes as the number of training points becomes infinite but is non-zero for finite training size. We present a full derivation of these expressions in the supplement. In the remainder of this section, we explore the scaling of the test loss with dataset and model size. 2.3.1 Variance-limited scaling To derive the limiting expressions (2) and (3) for the loss one makes use of the fact that the sample expectation of the second moment matrix over the finite dataset, as well as the finite feature set, is close to the full PD 1 1 T T ′ covariance, D a=1 F (xa )F  (xa ) = C + δC, P f (x)f (x ), = K + δK, with the fluctuations satisfying  2 −1 2 −1 ED δC = O(D ) and EP δK = O(P ), where expectations are taken over draws of a dataset of size D and over feature sets. Using these expansions yields the variance-limited scaling, L(D, P ) − L(P ) = O(D−1 ), L(D, P ) − L(D) = O(P −1 ) in the underparameterized and overparameterized settings, respectively. In Figure 2a we see evidence of these scaling relations for features built from randomly initialized ReLU deep neural networks on coarse-grained versions of the MNIST dataset obtained by local averaging over the images. We see that in the variance-limited regimes the scaling exponent is independent of the modification to the training data. In the supplement, we provide an in-depth derivation of this behavior and expressions for the leading contributions to L(D, P ) − L(P ) and L(D, P ) − L(D). 7 2.3.2 Resolution-limited scaling We now would like to analyze the scaling behavior of our linear model in the resolution-limited regimes, that is the scaling with P when 1 ≪ P ≪ D and the scaling with D when 1 ≪ D ≪ P . In these cases, the scaling is controlled by the shared spectrum of C or K. This spectrum is often well described by a power-law, where 1 . See Figure 2b for example spectra on pooled MNIST. eigenvalues λi satisfy λi = i1+α K In this case, we will argue that the losses also obey a power law scaling, with the exponents controlled by the spectral decay factor, 1 + αK , L(D) ∝ D−αK , L(P ) ∝ P −αK . (4) In other words, in this setting, αP = αD = αK . This is supported empirically in Figure 2b. For other derivations of dataset scaling for kernel regression, see [18, 19]. We then argue that when the kernel function K is sufficiently smooth on a manifold of dimension d, αK ∝ d−1 , thus realizing the more general resolution-limited picture described above. 2.3.3 From spectra to scaling laws for the loss To be concrete let us focus on the overparameterized loss. If we introduce the notation ei for the eigenvectors PD 1 T of C and ēi for the eigenvectors of D a=1 F (xa )F (xa ), the loss becomes, L(D) = S D X 1X λi (1 − (ei · ēj )2 ) . 2 i=1 j=1 (5) Before discussing the general asymptotic behavior of (5), we can gain some intuition by considering the case of large αK . In this case, ēj ≈ ej (see e.g. [58]), we can simplify (5) to, L(D) ∝ ∞ X i=D+1 1 i1+αK = αK D−αK + O(D−αK −1 ) . (6) More generally in the supplement, following [19, 59] we use replica theory methods to derive L(D) ∝ D−αK and L(P ) ∝ P −αK , without requiring the large αK limit. 2.3.4 Data manifolds and kernels In Section 2.2, we discussed a simple argument that resolution-limited exponents α ∝ 1/d, where d is the dimension of the data manifold. Our goal now is to explain how this connects with the linearized models and kernels discussed above: how does the spectrum of eigenvalues of a kernel relate to the dimension of the data manifold? The key point is that sufficiently smooth kernels must have an eigenvalue spectrum with a bounded tail. 1 Specifically, a C t kernel on a d-dimensional space must have eigenvalues λn ≲ n1+t/d [40]. In the generic case where the covariance matrices we have discussed can be interpreted as kernels on a manifold, and they have spectra saturating the bound, linearized models will inherit scaling exponents given by the dimension of the manifold. As a simple example, consider a d-torus. In this case we can study the Fourier series decomposition, and exP amine the case of a kernel K(x−y). This must take the form K = nI [anI sin(nI · (x − y)) + bnI cos(nI · (x − y))], where nI = (n1 , · · · , nd ) are integer indices, and anI , bnI are the overall Fourier coefficients. To guarantee 1 that K is a C t function, we must have anI , bnI ≲ nd+t where nd = N indexes the number of anI in decreasing 1 order. But this means that in this simple case, the tail eigenvalues of the kernel must be bounded by N 1+t/d as N → ∞. 2.4 Duality We argued above that for kernels with pure power-law spectra, the asymptotic scaling of the underparameterized loss with respect to model size and the overparameterized loss with respect to dataset size share a common 8 Super-classed CIFAR-100 Nclass Corrupted CIFAR-10 100 stddev 0.200 0.175 0.150 100 50 D=0.39 D=0.39 D=0.39 D=0.38 10 1 103 D=0.41 D=0.42 D=0.40 104 Dataset size (D) 0.125 0.100 Loss Loss 100 0.075 20 10 5 D=0.58 D=0.46 D=0.41 10 1 103 D=0.37 D=0.33 D=0.29 104 Dataset size (D) 0.050 0.025 0.000 Figure 3: Effect of data distribution on scaling exponents. For CIFAR-100 superclassed to N classes (left), we find that the number of target classes does not have a visible effect on the scaling exponent. (right) For CIFAR-10 with the addition of Gaussian noise to inputs, we find the strength of the noise has a strong effect on performance scaling with dataset size. All models are WRN-28-10. exponent. In the linear setup at hand, the relation between the underparameterized parameter dependence and overparameterized dataset dependence is even stronger. The underparameterized and overparameterized losses are directly related by exchanging the projection onto random features with the projection onto random training points. Note, sample-wise double descent observed in [51] is a concrete realization of this duality for a simple data distribution. In the supplement, we present examples exhibiting the duality of the loss dependence on model and dataset size outside of the asymptotic regime. 3 Experiments 3.1 Deep teacher-student models Our theory can be tested very directly in the teacher-student framework, in which a teacher deep neural network generates synthetic data used to train a student network. Here, it is possible to generate unlimited training samples and, crucially, controllably tune the dimension of the data manifold. We accomplish the latter by scanning over the dimension of the inputs to the teacher. We have found that when scanning over both model size and dataset size, the interpolation exponents closely match the prediction of 4/d. The dataset size scaling is shown in Figure 1, while model size scaling experiments appear in the supplement and have previously been observed in [10]. 3.2 Variance-limited scaling in the wild Variance-limited scaling (2.1) can be universally observed in real datasets. Figure 1a (top-left, bottom-right) measures the variance-limited dataset scaling exponent αD and width scaling exponent αW . In both cases, we find striking agreement with the theoretically predicted values αD , αW = 1 across a variety of dataset, neural network architecture, batch size in stochastic gradient descent, and loss type. Our testbed includes deep fully-connected and convolutional networks with ReLU or Erf nonlinearities and MSE or cross-entropy losses. The supplement contains experimental details. 3.3 Resolution-limited scaling in the wild In addition to teacher-student models, we explored resolution-limited scaling behavior in the context of standard classification datasets. Wide ResNet (WRN) models [60] were trained for a fixed number of steps with cosine decay. In Figure 1b we also include data from a four hidden layer convolutional neural network (CNN) detailed in the supplement. As detailed above, we find dataset dependent scaling behavior in this 9 context. We further investigated the effect of the data distribution on the resolution-limited exponent, αD , by tuning the number of target classes and input noise (Figure 3). To probe the effect of the number of classes, we constructed tasks derived from CIFAR-100 by grouping classes into broader semantic categories. We found that performance depends on the number of categories, but αD is insensitive to this number. In contrast, the addition of Gaussian noise had a more pronounced effect on αD . This suggest a picture in which the neural network learns to model the input data manifold, independent of the classification task, consistent with observations in [61, 62]. We also explored the effect of aspect ratio on dataset scaling, finding that the exponent magnitude increases with width up to a critical width, while the dependence on depth is milder (see supplement). 4 Discussion We have presented a framework for categorizing neural network scaling laws, along with derivations that help to explain their origins. Crucially, our predictions agree with empirical findings in settings which have often proven challenging for theory – deep neural networks on real datasets. The variance-scaling regime yields, for smooth test losses, a universal prediction of αD = 1 (for D ≫ P ) and αW = 1 (for w ≫ D). The resolution-limited regime yields exponents whose numerical value is variable and data and model dependent. There are a number of intriguing directions for future work. The invariance of the dataset scaling exponent to superclassing (Figure 3) suggests that deep networks may be largely learning properties of the input data manifold – akin to unsupervised learning – rather than significant task-specific structure, which may shed light on the versatility of learned deep network representations for different downstream tasks. Another direction for future research is to more explicitly derive within the theory the effects of “feature learning.” While the random feature linear models we have discussed are in exact correspondence with deep neural networks in the large-width limit and have been a useful theoretical testbed across a variety of problems, the kernels associated with networks of finite-depth and finite-width evolve dynamically during the course of training. 4.1 Limitations One limitation is that our theoretical results are asymptotic, while experiments are performed with finite models and datasets. This is apparent in the resolution-limited regime which requires a hierarchy (D ≫ P or P ≫ D). In Figures 1a and 2a top-right (bottom-left), we see a breakdown of the predicted scaling behavior as D (P ) become large and the hierarchy is lost. Furthermore, in the resolution-limited regime for deep networks, our theoretical tools rely on positing the existence of a data manifold. A precise definition of the data manifold, however, is lacking, forcing us to use imperfect proxies, such as the nearest-neighbor distances of final embedding layers from a trained network. 5 Outlook Modern deep learning, in the era of large datasets, models, and computational power, has often made progress through extensive amounts of experimentation and trial-and-error. A theoretical understanding of deep learning that is grounded in experiments and strives to bridge the gap between mathematically rigorous theory on the one hand, and realistic settings on the other, could be scientifically important for guiding the field. Our treatment of neural scaling laws in this work touches on classic aspects of generalization within learning theory but derives new results through realistic data assumptions and identifying and deriving a taxonomy for scaling regimes. Our approach is guided by the theoretical simplicity, realistic modeling, and experimental verification that is characteristic of theory construction in physics; we also leverage results from statistical physics approaches to deep learning in our derivations. Looking further afield, it is an interesting question whether qualitatively new behavior can emerge in large neural models trained on rich datasets, or whether such models are natural extensions of smaller-scale 10 models. The exploration of so-called emergent abilities within neural-based language models is an active area of research. Further investigation into these questions through the theoretical methods and scientific approaches of physics – calling for a realistic modeling of data and neural representations – may help shed light on our understanding of learning in deep neural networks. Acknowledgements. The authors would like to thank Guy Gur-Ari, Boris Hanin, Tom Henighan, Danny Hernandez, Aitor Lewkowycz, Sam McCandlish, Preetum Nakkiran, Behnam Neyshabur, Jeffrey Pennington, Vinay Ramasesh, Dan Roberts, Jonathan Rosenfeld, Jascha Sohl-Dickstein, and Lechao Xiao for conversations during the completion of this work. US completed a portion of this work during an internship at Google. JK and US were supported in part by Open Philanthropy. References [1] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017. [2] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. [3] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations, 2020. [4] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020. [5] Jonathan S Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability of pruning across scales. In International Conference on Machine Learning, pages 9075–9083. PMLR, 2021. [6] Subutai Ahmad and Gerald Tesauro. Scaling and generalization in neural networks: a case study. In Advances in neural information processing systems, pages 160–168, 1989. [7] David Cohn and Gerald Tesauro. Can neural networks do better than the vapnik-chervonenkis bounds? In Advances in Neural Information Processing Systems, pages 911–917, 1991. [8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [9] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022. [10] Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23(9):1–34, 2022. URL http://jmlr.org/papers/v23/20-1111.html. [11] Devansh Bisla, Apoorva Nandini Saridena, and Anna Choromanska. A theoretical-empirical approach to estimating sample complexity of dnns. arXiv preprint arXiv:2105.01867, 2021. 11 [12] Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, Dept. of Computer Science, 1994. [13] Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohldickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. [14] Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. [15] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural Tangent Kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018. [16] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, 2019. [17] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1gFvANKDS. [18] Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020. [19] Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020. [20] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007. [21] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009. [22] Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464–8501, 2020. [23] Marcus Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021. [24] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021. [25] Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws. arXiv preprint arXiv:2210.16859, 2022. [26] Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning, pages 23549–23588. PMLR, 2022. [27] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in neural information processing systems, 21, 2008. [28] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019. 12 [29] Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020. [30] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017. [31] Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020. [32] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019. [33] Ben Adlam and Jeffrey Pennington. The Neural Tangent Kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020. [34] Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in Neural Information Processing Systems, 33, 2020. [35] Anders Andreassen and Ethan Dyer. Asymptotics of wide convolutional neural networks. arxiv preprint arXiv:2008.08675, 2020. [36] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020. [37] Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=guetrIHLFGI. [38] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4): 441–479, 1912. [39] JB Reade. Eigenvalues of positive definite kernels. SIAM Journal on Mathematical Analysis, 14(1): 152–157, 1983. [40] Thomas Kühn. Eigenvalues of integral operators with smooth positive definite kernels. Archiv der Mathematik, 49(6):525–534, 1987. [41] JC Ferreira and VA Menegatto. Eigenvalues of integral operators defined by smooth positive definite kernels. Integral Equations and Operator Theory, 64(1):61–81, 2009. [42] Michael L Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer Science & Business Media, 1999. [43] Peter J Bickel, Bo Li, et al. Local polynomial regression on unknown manifolds. In Complex datasets and inverse problems, pages 177–186. Institute of Mathematical Statistics, 2007. [44] David de Laat. Approximating manifolds by meshes: asymptotic bounds in higher codimension. Master’s Thesis, University of Groningen, Groningen, 2011. [45] Sho Yaida. Non-Gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning Conference, 2020. 13 [46] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=SJgndT4KwB. [47] Elizaveta Levina and Peter J Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pages 777–784, 2005. [48] P McCullagh and John A Nelder. Generalized Linear Models, volume 37. CRC Press, 1989. [49] Ryan M Rifkin and Ross A Lippert. Notes on regularized least squares, 2007. [50] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009. [51] Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent. arXiv preprint arXiv:1912.07242, 2019. [52] Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019. [53] Adrià Garriga-Alonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, 2019. [54] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. [55] Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020. [56] Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu. Implicit bias of deep linear networks in the large learning rate phase. arXiv preprint arXiv:2011.12547, 2020. [57] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2937–2947, 2019. [58] Andreas Loukas. How close are the eigenvectors of the sample and actual covariance matrices? In International Conference on Machine Learning, pages 2228–2237. PMLR, 2017. [59] Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):1–12, 2021. [60] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016. [61] Preetum Nakkiran and Yamini Bansal. Distributional generalization: A new kind of generalization. arXiv preprint arXiv:2009.08092, 2020. [62] Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=Hkxzx0NtDB. [63] Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural Tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents. 14 [64] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. [65] Vaishaal Shankar, Alex Chengyu Fang, Wenshuo Guo, Sara Fridovich-Keil, Ludwig Schmidt, Jonathan Ragan-Kelley, and Benjamin Recht. Neural kernels without tangents. In International Conference on Machine Learning, 2020. [66] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. International Conference on Learning Representations, 2017. [67] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018. [68] Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github. com/google/flax. [69] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. [70] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019. [71] Christopher KI Williams and Francesco Vivarelli. Upper and lower bounds on the learning curve for gaussian processes. Machine Learning, 40(1):77–102, 2000. [72] Dörthe Malzahn and Manfred Opper. A variational approach to learning curves. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, pages 463–469. MIT Press, 2002. URL https://proceedings.neurips.cc/paper/2001/file/ 26f5bd4aa64fdadf96152ca6e6408068-Paper.pdf. [73] Peter Sollich and Anason Halees. Learning curves for gaussian process regression: Approximations and bounds. Neural computation, 14(6):1393–1428, 2002. [74] Giorgio Parisi. A sequence of approximated solutions to the SK model for spin glasses. Journal of Physics A: Mathematical and General, 13(4):L115, 1980. [75] Peter Sollich. Learning curves for gaussian processes. In Proceedings of the 11th International Conference on Neural Information Processing Systems, pages 344–350, 1998. [76] Dörthe Malzahn and Manfred Opper. Learning curves for gaussian processes regression: A framework for good approximations. Advances in neural information processing systems, pages 273–279, 2001. [77] Dörthe Malzahn and Manfred Opper. Learning curves and bootstrap estimates for inference with gaussian processes: A statistical mechanics study. Complexity, 8(4):57–63, 2003. [78] Matthew J Urry and Peter Sollich. Replica theory for learning curves for gaussian processes on random graphs. Journal of Physics A: Mathematical and Theoretical, 45(42):425005, 2012. [79] Omry Cohen, Or Malka, and Zohar Ringel. Learning curves for overparametrized deep neural networks: A field theory perspective. Phys. Rev. Res., 3:023034, Apr 2021. doi: 10.1103/PhysRevResearch.3.023034. URL https://link.aps.org/doi/10.1103/PhysRevResearch.3.023034. 15 [80] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020. [81] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019. [82] Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1370–1378, 2019. [83] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33, 2020. [84] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2661–2671, 2019. [85] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 6(2):8, 2019. 16 Supplemental Material A Experimental setup A.1 Figure 1 (top-left) Experiments utilize relatively small models, with the number of trainable parameteters P ∼ O(1000), trained with full-batch gradient descent (GD) and small learning rate on datasets of size D ≫ P . Each data point in the figure represents an average over subsets of size D sampled from the full dataset. Experiments are done using the Neural Tangents [63] library based on JAX [64]. All experiment except denoted as (CNN), use 3-layer, width-8 fully-connected networks. Convolutional neural network (CNN) architecture used is Myrtle-5 network [65] with 8 channels. ReLU activation function with critical initialization [13, 66, 67] was used. Unless specified cross-entropy loss was used. We performed full-batch gradient descent update for all dataset sizes without L2 regularization. 20 different training data sampling seeds were averaged for each point. For fully-connected networks, input pooling of size 4 was performed for CIFAR-10/100 dataset and pooling of size 2 was performed for MNIST and Fashion-MNIST dataset. This was to reduce number of parameters in the input layer (# of pixels × width) which can be quite large even for small width networks. A.2 Figure 1 (top-right) All experiments were performed using a Flax [68] implementation of Wide ResNet 28-10 [60]. Models were trained for 78125 total steps with a cosine learning rate decay [69] and an augmentation policy consisting of random flips and crops. We report final loss, though we found no qualitative difference between using final loss, best loss, final accuracy or best accuracy (see Figure S1). A.3 Figure 1 (bottom-left) The setup was identical to Figure 1 (top-right) except that the model considered was a depth 10 residual network with varying width. A.4 Figure 1 (bottom-right) Experiments are done using Neural Tangents. All experiments use 100 training samples and two hidden layer fully-connected networks of varying width (ranging from w = 64 to w = 11, 585) with ReLU nonlinearities unless specified as Erf. Full-batch gradient descent and cross-entropy loss were used unless specified as MSE, and the figure shows curves from a random assortment of training times ranging from 100 to 500 steps (equivalently, epochs). Training was done with learning rates small enough so as to avoid catapult dynamics [55] and no L2 regularization; in such a setting, the infinite-width learning dynamics is known to be equivalent to that of linearized models [16]. Consequently, for each random initialization of the parameters, the test loss of the finite-width linearized model was additionally computed in the identical training setting. This value approximates the limiting behavior L(∞) known theoretically and is subtracted off from the final test loss of the (nonlinear) neural network before averaging over 50 random initializations to yield each of the individual data points in the figure. A.5 Deep teacher-student models The teacher-student scaling with dataset size (figure S2) was performed with fully-connected teacher and student networks with two hidden layers and widths 96 and 192, respectively, using PyTorch [70]. The inputs were random vectors sampled uniformly from a hypercube of dimension d = 2, 3, · · · , 9. To mitigate noise, we ran the experiment on eight different random seeds, fixing the random seed for the teacher and student as we scanned over dataset sizes. We also used a fixed test dataset, and a fixed training set, which was subsampled for the experiments with smaller D. The student networks were trained using MSE loss and Adam optimizer with a maximum learning rate of 3 × 10−3 , a cosine learning rate decay, and a batch size of 64, and 40, 000 steps of training. The test losses were measured with early stopping. We combine test losses from different random seeds by averaging the logarithm of the loss from each seed. 17 CIFAR-10 CIFAR-100 final loss best loss final error best error 100 final loss best loss final error best error 100 10 1 102 104 103 102 104 103 SVHN FashionMNIST final loss best loss final error best error final loss best loss final error best error 10 1 10 1 103 104 105 103 104 Figure S1: Alternate metrics and stopping conditions. We find similar scaling behavior for both the loss and error, as well as for final and best (early stopped) metrics. In our experiments, we always use inputs that are uniformly sampled from a d-dimensional hypercube, following the setup of [10]. They also utilized several intrinsic dimension (ID) estimation methods and found the estimates were close to the input dimension, so we simply use the latter for comparisons. For the dataset size scans, we used randomly initialized teachers with width 96 and students with width 192. We found similar results with other network sizes. The final scaling exponents and input dimensions are show in the bottom of Figure 1b. We used the same experiments for the top of that figure, interpolating the behavior of both teacher and a set of students between two fixed training points. The students only differed by the size of their training sets but had the same random seeds and were trained in the same way. In that figure, the input space dimension was four. Finally, we also used a similar setup to study variance-limited exponents and scaling. In that case we used much smaller models, with 16-dimensional hidden layers, and a correspondingly larger learning rate. We then studied scaling with D again, with results pictured in Figure 1a. A.6 CNN architecture for resolution-limited scaling Figure 1b includes data from CNN architectures trained on image datasets. The architectures are summarized in Table 1. We used Adam optimizer for training with cross-entropy loss. Each network was trained for long enough to achieve either a clear minimum or a plateau in test loss. Specifically, CIFAR-10, MNIST and Fashion MNIST were trained for 50 epochs, CIFAR-100 was trained for 100 epochs, and SVHN was trained for 10 epochs. The default Keras training parameters were used. In case of SVHN, we included the additional images as training data. We averaged (in log space) over 20 runs for CIFAR-100 and CIFAR-10, 16 runs for MNIST, 12 runs for Fashion MNIST, and 5 runs for SVHN. The results of these experiments are shown in Figure S3. 18 Teacher/Student Dataset Size 10 4 9 8 10 5 Input Dimension 7 Loss 10 6 6 10 7 5 10 8 4 3 10 9 26 27 28 29 210 211 Dataset Size 212 2 213 Figure S2: This figure shows scaling trends of MSE loss with dataset size for teacher/student models. The exponents extracted from these fits and their associated input-space dimensionalities are shown in Figure 1b. Table 1: CNN architectures for CIFAR-10, MNIST, Fashion MNIST (left), CIFAR-100 (center) and SVHN (right). Layer Width Layer Width CNN window (3, 3) 50 CNN window (3, 3) 50 2D Max Pooling (2, 2) 2D Max Pooling (2, 2) CNN window (3, 3) 100 CNN window (3,3) 100 2D Max Pooling (2, 2) 2D Max Pooling (2, 2) CNN window (3, 3) 100 CNN window (3, 3) 200 Dense 64 Dense 256 Dense 10 Dense 100 Layer CNN window (3, 3) 2D Max Pooling (2, 2) CNN window (3, 3) 2D Max Pooling (2, 2) Dense Dense Width 64 64 128 10 The measurement of input-space dimensionality for these experiments was done using the nearest-neighbour algorithm, described in detail in Appendix B and C in [10]. We used 2, 3 and 4 nearest neighbors and averaged over the three. A.7 Teacher-student experiment for scaling of loss with model size We replicated the teacher-student setup in [10] to demonstrate the scaling of loss with model size. The resulting variation of −4/αP with input-space dimensionality is shown in figure S4. In our implementation we averaged (in log space) over 15 iterations, with a fixed, randomly generated teacher. A.8 Effect of aspect ratio on scaling exponents We trained Wide ResNet architectures of various widths and depths on CIFAR-10 accross dataset sizes. We found that the effect of depth on dataset scaling was mild for the range studied, while the effect of width impacted the scaling behavior up until a saturating width, after which the scaling behavior fixed. See Figure S5. 19 CIFAR10: Loss scaling with Dataset Size D : 0.207 6 × 10 1 2 × 100 Test Loss Test Loss FashionMNIST: Loss scaling with Dataset Size D : 0.198 3 × 100 4 × 10 1 3 × 10 1 100 102 104 103 Dataset Size MNIST: Loss scaling with Dataset Size CIFAR100: Loss scaling with Dataset Size D : 0.397 D : 0.164 5 × 100 Test Loss Test Loss 104 Dataset Size 103 10 1 4 × 100 3 × 100 104 Dataset Size 103 103 Dataset Size 104 SVHN: Loss scaling with Dataset Size Test Loss D : 0.242 100 104 Dataset Size 103 105 Figure S3: This figure shows scaling trends of cross-entropy loss with dataset size for various image datasets. The exponents extracted from these fits and their associated input-space dimensionalities are shown in Figure 1b. Teacher/Student Model Size Exponents 14 d = 4/ P 4/ P 12 10 8 6 4 4 6 8 Dimension 10 12 Figure S4: This figure shows the variation of αP with the input-space dimension. The exponent αP is the scaling exponent of loss with model size for teacher-student setup. 20 CIFAR-10 varying width (d=28) Width factor 100 CIFAR-10 varying depth (k=10) 12 10 Depth 40 100 Loss Loss 28 D: 0.42 D: 0.50 D: 0.54 D: 0.58 D: 0.58 103 D: 0.48 D: 0.55 D: 0.58 D: 0.58 4 104 Dataset size (D) 2 1 103 16 104 Dataset size (D) 10 Figure S5: Effect of aspect ratio on dataset scaling. We find that for WRN-d-k trained on CIFAR-10, varying depth from 10 to 40 has a relatively mild effect on scaling behavior, while varying the width multiplier, k, from 1 to 12 has a more noticeable effect, up until a saturating width. B Proof of Theorem 1 We now prove Theorem 1, repeated below for convenience. Theorem 1. Let ℓ(f ) be the test loss as a function of network output, (L = E [ℓ(f )]), and let fT be the network output after T training steps, thought of as a random variable over weight initialization, draws of the training k dataset, and optimization seed. Further let fT be concentrating with E[(fT − E[fT ]) ] = O (ϵ) ∀k ≥ 2. If ℓ is a finite degree polynomial, or has bounded second derivative, or is 2-Hölder, then E [ℓ(fT )] − ℓ (E [fT ]) = O(ϵ). Proof. Case 1 – finite degree polynomial : In this case, we can write, ℓ(fT ) − ℓ(E [fT ]) = K X ℓ(k) (E [fT ]) k! k=1 k (fT − E [fT ]) , (S1) where K is the polynomial degree and ℓ(k) is the k-th derivative of ℓ. Taking the expectation of (S1) and using the moment scaling proves the result. Case 2 – bounded second derivative: The quadratic mean value theorem states that for any fT , there exists a c such that, 1 2 ℓ(fT ) − ℓ(E [fT ]) = (fT − E [fT ]) ℓ′ (E [fT ]) + ℓ′′ (c) (fT − E [fT ]) . 2 (S2) Taking the expectation of (S2) and using the fact that f ′′ (c) is bounded yields the desired result. Case 3 – 2-Hölder : Lastly, the loss being 2-Hölder means we may write, 2 ℓ(fT ) − ℓ(E [fT ]) ≤ |ℓ(fT ) − ℓ(E [fT ])| ≤ Kℓ (fT − E [fT ]) . (S3) Again, taking the expectation of this inequality completes the proof. Remarks on loss variance Theorem 1 concerns the mean loss, however we would also like to understand if this scaling holds for typical instances. This can be understood by examining how the variance of the loss or alternatively how E [|ℓ (fT ) − ℓ (E [fT ])|] scales. 21 For Case 3 – 2-Hölder loss, we can rerun the argument of Theorem 1, using (S3) to yield E [|ℓ (fT ) − ℓ (E [fT ])|] = O (ϵ). For Cases 1 and 2, we can attempt to apply the same argument as in the proof. This almost works. In k k particular, using Hölder’s inequality, E[(fT − E[fT ]) ] = O (ϵ) ∀k ≥ 2 implies E[|fT − E[fT ]| ] = O (ϵ) ∀k ≥ 2. Taking the absolute value and expectation of (S1) or (S2) then gives E [|ℓ (fT ) − ℓ (E [fT ])|] ≤ |ℓ′ (E [fT ])| E [|fT − E [fT ]|] + O (ϵ) . (S4) √ In general, the above assumptions on ℓ and fT imply only that E [|fT − E [fT ]|] = O ( ϵ) and thus typical instances of the loss will exhibit a less dramatic scaling with ϵ than the mean. If we further assume, however, that fT on average has been trained such as to be sufficiently close to a local minimum of the loss, such that √ |ℓ′ (E [fT ])| = O ( ϵ), then typical instances will also obey the O (ϵ) scaling. C Variance-limited dataset scaling In this section, we expand on our discussion of the variance-limited dataset scaling, L(D) − limD→∞ L(D) =  O D−1 . We first explain some intuition for why this behavior might be expected for sufficiently smooth loss. We then derive it explicitly for losses that are polynomial in the weights. Finally, we present non-smooth examples where the scaling can be violated either by having unbounded loss, or first derivative. C.1 Intuition At a high level, the intuition is as follows. For any fixed value of weights, θ, the training loss with D training points (thought of as a random variable over draws of the dataset), Ltrain [θ] concentrates around the  population loss Lpop [θ], with variance that scales as O D−1 . Our optimization procedure can be thought of as a map from initial weights and training loss to final weights Op : (θ0 , Ltrain [θ]) → θT . If this map is sufficiently smooth – for instance satisfying the assumptions of Theorem 1 or well approximated by a Taylor series about all ED [Ltrain [θt ]] – then the output, θT , will  also concentrate around its infinite D limit with variance scaling as O D−1 . Finally, if the population loss is also sufficiently smooth, the test loss for a model trained on D data points averaged over draws of the  dataset, L(D) = ED [Lpop [θT ]], satisfies L(D) − limD→∞ L(D) = O D−1 . We now walk through this in a little more detail. Early time We can follow this intuition a bit more explicitly for the first few steps of gradient descent. As the training loss at initialization, Ltrain [θ0 ], is a sample average over D i.i.d draws, it concentrates around the population  loss Lpop [θ0 ] with variance O D−1 . As a result, the initial gradient, g0 = ∂L∂θtrain will also concentrate with 0  O D−1 variance and so will the weights at time step 1, θ1 = θ0 − ηg0 . The training loss at time step 1 is then given by (setting η = 1), Ltrain [θ1 ] = Ltrain [θ0 − g0 ] . (S5) If Ltrain is sufficiently smooth around θ0 − ED [g0 ], then we get that Ltrain [θ1 ] concentrates around Ltrain [θ1 ]  with O D−1 variance. We can keep bouncing back and forth between gradient (or equivalently weights) and training loss for any number of steps T which does not scale with D. Plugging this final θT into the population loss and taking the expectation over draws of the training set, L(D) = ED [Lpop [θT ]]. If Lpop is  also sufficiently smooth, this yields L(D) − limD→∞ L(D) = O D−1 . Here we have used the term sufficiently smooth. A sufficient set of criteria are given in Theorem 1; however, this is likely too restrictive. Indeed, any set of train and population loss for which a Taylor series (or  asymptomatic series with optimal truncation) give an O D−1 error around the training points ED [θt=0...T ] will have this behavior. 22 Local minimum The above intuition relied on training for a number of steps that was fixed as D is taken large. Here we present some alternative intuition for the variance-limited scaling at late times, as training approaches a local minimum in the loss. For simplicity we discuss a one-dimensional loss. Consider a local minimum, θ∗ , of the population loss. As D is taken large, with high probability, the  training loss will have a local minimum, θ̄∗ , such that |θ∗ − θ̄∗ | = O D−1 . One way to see this is to note that for a generic local minimum the first derivative changes sign, i.e. we can find θ1 , θ2 such that θ1 < θ∗ < θ2 and either L′pop [θ1 ] < 0, L′pop [θ2 ] > 0 or L′pop [θ2 ] < 0, L′pop [θ1 ] > 0. To be concrete let’s focus on the first case (the argument will be identical in either case). As D becomes large, the probability that the training loss at θ1 and θ2 differs significantly from the population loss approaches zero. This can be seen  VarD (L′train [θ]) from Markov’s inequality, where, P L′train [θ] − L′pop [θ] > a ≤ , or more dramatically from a2 Hoeffding’s inequality (assuming bounded Ltrain − Lpop lying in an interval of size I)  2 2 2 (S6) P L′train [θ] − L′pop [θ] > a ≤ 2e− I D a . Here to have non-vanishing probability as we take D large, L′train [θ1 ] and L′train [θ2 ] must be closer than   O D−1 . If θ1 and θ2 are taken to be O D0 , then L′train must change sign, indicating an extremum of Ltrain ; however, we can do even better. If we assume Ltrain is Lipshitz about θ∗ then we can still ensure a  sign change even if |θ1 − θ∗ |, |θ2 − θ∗ | = O D−1 . Using concentration of L′′train [θ] ensures the extremum is a local minimum. For non-generic minimum (i.e. vanishing first derivatives) we can apply the same arguments to higher order derivatives (assuming they exist) of Lpop . Thus for a local minimum of Lpop , with high  probability Ltrain will have a corresponding minimum within a distance O D−1 . If we now consider an initialization θ0 and training procedure such that training converges to the local minimum of the training loss, θ̄∗ , and that the population loss is sufficiently smooth about θ∗ (e.g. Lipshitz), then ED [Ltrain [θ̄∗ ] − Lpop [θ∗ ]] = ED [Ltrain [θ̄∗ ] − Lpop [θ̄∗ ]] + ED [Lpop [θ̄∗ ] − Lpop [θ∗ ]]. The first term vanishes, while the second is O(D−1 ). If we further assume that this happens on average over choices of θ0 then we  expect L(D) − limD→∞ L(D) = O D−1 . Stochastic gradient descent (SGD) At first blush it may be surprising that the variance-limited scaling holds even for mini-batch training. Indeed in this case, there is batch noise that comes in at a much higher scale than any variance due to the finite training set size. Indeed, the effect of mini-batching changes the final test loss, however if we fix the SGD procedure or average over SGD seeds, as we take D large, we can still ask how the training loss for a model trained under SGD on a training set of size D differs from that for a model trained under SGD on an infinite training set. To see this, we first consider averaging over minibatches of size B, but where points are drawn i.i.d. with replacement. If we denote the batch at step t by Bt and the average over independent draws of this batch by EB [•], then note we can translate moments with respect to batch draws with empirical averages over the entire training set. Explicitly, consider ca and da potentially correlated, but each drawn i.i.d. within a batch. We have that, " # D 1 X 1 X EB ca = ca B D a=1 a∈Bt (S7) " ! !#  ! !  D D D 1 X 1 X 1 1 X 1 X 1 1 X EB ca da′ = 1− ca da′ + ca d a . B B ′ B D a=1 D ′ B D a=1 a∈Bt a ∈Bt a =1 This procedure means, after taking an average over draws of SGD batch, rather than thinking about a function of mini-batch averages, we can equivalently consider a modified function, with explicit dependence on the batch size, but that is only a function of empirical means over the training set. We can thus recycle the above intuition for the scaling of smooth functions of empirical means. 23 The above relied on independently drawing every sample from every batch. At the other extreme, we can consider drawing batches without shuffling and increasing training set size by B datapoints at a time, so as to keep the initial set of batches in an epoch fixed. In this case, the first deviation in training between a dataset of size D and one of size D + B happens at the last batch in the first epoch after processing D datapoints. As an extreme example, consider the case where D > BT . In this case, as we only take T steps, the loss is constant for all D > BT and so limD→∞ L(D; T ; B) = L(BT ; T ; B) and thus L(D > BT )−limD→∞ L(D) = 0  (and in particular is trivially O D−1 ). C.2 Polynomial loss Before discussing neural network training, we review the concentration behavior of polynomials of sample means. PD (i) (i) 1 (i) Lemma 1. Let c̄(i) = D a=1 ca for i = 0 . . . J be empirical means, over D i.i.d. draws of ca and let c (0) k0 (1) k1 (J) kJ denote the distributional mean. Further, let X = (c̄ ) (c̄ ) · · · (c̄ ) be a monomial in the sample  means. Then X concentrates with moments O D−1 , ED h n i  X − (c(0) )k0 (c(1) )k1 · · · (c(J) )kJ = O D−1 . (S8) Here, ED [•] denotes the average over independent draws of D samples. Proof. To establish this we can proceed by direct computation. n i h ED X − (c(0) )k0 (c(1) )k1 · · · (c(J) )kJ   n n−p  X n n−p = (−1) ED [X p ] (c(0) )k0 (c(1) )k1 · · · (c(J) )kJ p p=0 (S9) Each term in the sum can be computed using h i ED [X p ] = ED (c̄(0) )pk0 (c̄(1) )pk1 · · · (c̄(J) )pkJ      X 1 (J) (1) (J) (0) (1) (0) PJ = ED c (1) · · · c (1) · · · c (J) · · · c (J) c (0) · · · c (0) apk apk a1 apk a1 a1 D(p i=0 ki ) (i) 1 0 J {aα }       X 1 (J) (1) (J) (0) (1) (0) PJ = ED c (1) · · · c (1) · · · c (J) · · · c (J) c (0) · · · c (0) apk apk a1 apk a1 a1 D(p i=0 ki ) (i) (j) 1 0 J {aα ̸=aβ } + O D−1 = =  D(D − 1) · · · (D − (p PJ i=0 ki − 1)) PJ  c(0) k0   c(0) pk0  D(p i=0 ki ) k1  kJ p  (1) (J) c ··· c + O D−1 . c(1) pk1  pkJ  · · · c(J) + O D−1 Plugging this into (S9) establishes the lemma. (i) In the above, we use the multi-index notation {aα } for the collection of indices on the ci and the notation (i) (j) {aα ̸= aβ } for the subset of terms in the sum where all indices take different values. Lemma 1 immediately implies that the mean of polynomials of c̄(i) concentrate around their infinite data limit. h    n i  (S10) ED g c̄(0) , c̄(1) , . . . , c̄(K) − g c(0) , c(1) , . . . , c(K) = O D−1 ,   for g ∈ PK c̄(0) , c̄(1) , . . . , c̄(K) . 24 With this out of the way, we can proceed to analyzing the scaling of trained neural networks. Here we consider the simplified setting where the network map, f , and loss ℓ evaluated on each training example, xa = (xa , ya ), are polynomial of degree J and K in the weights, θµ , f (x) = J X b(i) µ1 µ2 ...µi (x)θµ1 θµ2 · · · θµi ℓ(xa ) = i=1 K X c(i) µ1 µ2 ...µi (xa )θµ1 θµ2 · · · θµi . (S11) i=1 The training loss can then be written as, Ltrain = K X D c̄(i) µ1 µ2 ...µi θµ1 θµ2 · · · θµi , c̄(i) = i=1 1 X (i) c (xa ) . D a=1 (S12) Here we have used the convention that the repeated weight indices µj are summed over. Gradient descent train As a result of the gradient descent weight update, θt+1 = θt − η ∂L∂θ , the weights at time T are a polynomial T of degree (K − 1) in the c̄(i) . h i θT ∈ P(K−1)T c̄(0) , c̄(1) , . . . , c̄(K) . (S13) The coefficients of this polynomial depend on the initial weights, θ0 . Plugging these weights back into the network output, we have that the network function at time T is again a polynomial in c̄(i) , now with degree T J (K − 1) . h i fT (x) ∈ PJ(K−1)T c̄(0) , c̄(1) , . . . , c̄(K) . (S14) Thus, again using Lemma 1, fT concentrates with variance O(D−1 ). h i  2 ED (fT − ED [fT ]) = O D−1 . (S15) and by Theorem 1 the loss will obey they variance-limited scaling. Stochastic gradient descent We now consider the same setup of polynomial loss, but now trained via stochastic gradient descent (SGD). We consider SGD batches drawn i.i.d. with replacement and are interested in the test loss averaged over SGD draws, with fixed batch size, B. We proceed by proving the following lemma, which allows us to reuse a similar argument to the GD case. P (i) (i) Lemma 2. Let c̃(i;t) = B1 a∈Bt ca for i = 0 . . . J be mini-batch averages, over B i.i.d. draws of ca . Further, hlet X = (c̃(0;t0 ) )k0 (c̃(1;t1 ) )k1 · ·i· (c̃(J;tJ ) )kJ be a monomial in the mini-batch means. Then EB [X] ∈ QJ PP J d¯(0) , d¯(1) , . . . , d¯( i=0 (ki +1)−1) , where d¯(i) are empirical means over the full training set of i.i.d. i=0 ki random variables as in Lemma 1 and EB [•] denotes the expectation over draws of SGD batches of size B. Proof. Expectations over draws of batches at different time steps are independent. Thus, w.l.o.g., we can consider t := t0 = t1 = · · · = tJ . We can again proceed by direct computation, expanding the mini-batch sums,       X  (0) 1 (0) (1) (1) (J) (J) EB [X] = PJ EB  c (0) · · · c (0) c (1) · · · c (1) · · · c (J) · · · c (J)  . (S16) a1 ak a1 ak a1 ak B i=0 ki 0 1 J (i) {aα }∈Bt (i) To proceed, we must keep track of terms in the sum where the aα take the same or different values. If all (i) aα are different, the expectation over batch draws fully factorizes. More generally (S16) can be decomposed as a sum over products. 25 One way of keeping track of the index combinatorics is to introduce a set of graphs, Γ, where each graph (i) γ ∈ Γ has k0 vertices of type 0, k1 vertices of type 1, . . . , and kJ vertices of type J (one vertex for each aα index). Any pair of vertices may have zero or one edge between them. For any set of three vertices, v1 , v2 , and v3 with edges (v1 , v2 ) and (v2 , v3 ) there must also be an edge (v1 , v3 ). The set Γ consists of all possible ways of connecting these vertices consistent with these rules. For each graph, γ, we denote connected components by σ and denote the number of vertices of type i (i) within the connected component σ by mσ . With this we can write the sum, (S16) as " # (0) (1) m(J)  X Y σ 1 X  (0) mσ  (1) mσ (J) EB [X] = Sγ (B) EB ca ca · · · ca B σ∈γ γ∈Γ = a∈Bt X Sγ (B) σ∈γ γ∈Γ = Y 1 X Sγ (B) Y D  X D a=1 c(0) a m(0)  σ c(1) a m(1) σ  m(J) σ · · · c(J) a (S17) (0) (1) (J) d¯({mσ ,mσ ,...,mσ }) . σ∈γ γ∈Γ (i) Here Sγ (B) is a combinatoric factor associated to each graph, not relevant for the argument. The mσ QJ take on values 0 to k , so the multi-index can take on i=1 (ki + 1) different values, which we re-index QJ i to d¯(0) , d¯(1) , . . . , d¯( i=1 (ki +1)−1) . Meanwhile, the degree of (S17) in d¯(i) is bounded by the number of total PJ vertices in each graph, i.e. i=0 ki . This establishes the lemma. For a polynomial loss of degree K, the mini-batch training loss at each time step takes the form (t) Ltrain = K X c̃(i;t) µ1 µ2 ...µi θµ1 θµ2 · · · θµi , i=1 ∂L c̃(i;t) = 1 X (i) c (xa ) . B (S18) a=∈Bt (t+1) train The update rule, θt+1 = θt −η ∂θ ensures that θT is a polynomial of degree (K−1)T in the c̃(i;0) , c̃(i;1) , · · · , c̃(i;T ) h i θT ∈ P(K−1)T c̃(0;0) , c̃(0;1) , . . . , c̃(0;T ) , c̃(1;0) , c̃(1;1) , . . . , c̃(1;T ) , . . . , c̃(K;0) , c̃(K;1) , . . . , c̃(K;T ) , (S19) and consequently, denoting the test loss evaluated at θT by L[θT ], h i L[θT ] ∈ PK(K−1)T c̃(0;0) , c̃(0;1) , . . . , c̃(0;T ) , c̃(1;0) , c̃(1;1) , . . . , c̃(1;T ) , . . . , c̃(K;0) , c̃(K;1) , . . . , c̃(K;T ) . (S20) Using Lemma 2, the expectation of L[θT ] over draws of SGD batches is given by h i K TK EB [L[θT ]] ∈ PK(K−1)T d¯(0) , . . . , d¯(K (K−1) ) . (S21) Finally, denoting ED [EB [L[θT ]]] by L(D; B) and applying Lemma 1 gives  L(D; B) − lim L(D; B) = O D−1 . (S22) D→∞ C.3 Non-smooth examples Here we present two worked examples where non-bounded or non-smooth loss leads to violations of the variance dominated scaling. In example one, the system obeys the variance dominated scaling at early times, but exhibits different behavior for times larger than the dataset size. In the second example, the system violates the variance dominated scaling even for two gradient descent steps, as a result of an unbounded derivative in the loss. 26 Example 1 – unbounded loss at late times Consider a dataset with two varieties of data points, drawn with probabilities α and 1−α, and one-dimensional quadratic losses, ℓ1 (concave up) and ℓ2 (concave down), on these two varieties. 1 2 1 (S23) θ , ℓ2 (θ) = − θ2 . 2 2 If, in a slight abuse of notation, we further denote the training loss on a sample with n1 points of type 1 and D − n1 points of type two by ℓn1 and the population loss at a given value of the weight by Lpop , we have     n1 1 1 2 ℓn1 = − θ , Lpop = α − θ2 . (S24) D 2 2 ℓ1 (θ) = For this example we take α > 1/2. In this case, the minimum of the population loss is at zero, while the minimum of the training loss can be at zero or at ±∞ depending on whether the training sample has n1 greater than or less than D/2. We can thus create a situation where at late training times, θT does not concentrate around the minimum of the population loss.  As we work through this example explicitly, we will see the following. (i) A mismatch larger than O D−1 between the population minimum and the minimum found by training on a sample set of size D requires times T larger than a constant multiple of D. (ii) The quantity we study throughout this work is the difference between the infinite data limit of the test loss and the finite data value, L(D) − limD→∞ L(D). The minimum of the infinite data limit of the test loss is not the same as the minimum of the population loss, min limD→∞ L(D) ̸= minθ Lpop . In this example one diverges, while the other is finite. In particular this example evades the scaling result by L(D) for times larger than D having a diverging limit. Explicitly, we study the evolution of the model under gradient flow.   n1 1 n1 1 θ̇ = −2 − (S25) θ , θT = e−2( D − 2 )T θ0 . D 2 The test loss averaged over draws of the dataset is given by       D 4T 1 1  1 − α 1 − e− D L(D; T ) = En1 α− θT2 = e2T α − θ02 2 2 If we consider this loss at large D and fixed T we get      1 8T 2 α(1 − α) −4(α− 12 )T −2 2 L(D; T ) = e α− +O D , θ 1+ 2 0 D  and thus L(D; T ) − limD→∞ L(D; T ) = O D−1 as expected. If on the other hand we consider taking T ≫ D we have   1 D L(D; T ≫ D) = e2T α − (1 − α) θ02 , 2 (S26) (S27) (S28) the limit limD,T →∞ L(D; T ≫ D) diverges. Lastly, we note that if we take T = βD with β < | log(1 − α)|/2 we can approach the large D limit with non-generic, tuneable exponential convergence. Example 2 – unbounded derivative Again, consider a two variety setup, this case with equal probabilities and per sample losses, 1 2 1 α 1 1 α θ + |θ| , ℓ2 (θ) = θ2 − |θ| . 2 2α 2 2α We will consider different values of α > 0. The train loss and population loss are then,   1 1 n1 1 1 ℓn1 = θ2 + − |θ|α , Lpop = θ2 . 2 α D 2 2 ℓ1 (θ) = 27 (S29) (S30) We consider a model initialized to θ0 = 1 and trained for two steps of gradient descent with learning rate 1.   n1 1 gt = θt + − θt |θt |α−2 , θt+1 = θt − gt . (S31) D 2 Two update steps gives θ2 = n1 1 − D 2 α . The test loss is given by the population loss evaluated at θ2 averaged over test set draws.    D  2α 1 2 n1 1 X 1 D L(D) = En1 θ2 = D+1 − n1 2 2 D 2 n1 =0 r Z ∞ 2α  1 2 1 D + O D−1 = e−2D(x− 2 ) x − 2π −∞ 2  1  Γ α+ = 1+α √2 D−α + O D−1 . 2 π (S32) (S33) Here we have approximated the binomial distribution at large D with a normal distribution using Stirling’s approximation.  Note that if α ≥ 1 then L(D) − limD→∞ L(D) = O D−1 i.e. the finite sample loss approaches the infinite data loss with the predicted variance-limited scaling. For 0 < α < 1, we get a different scaling controlled by α. Note that the gradient, expression (S31) is singular at the origin for α precisely in this range. In summary, this example achieves a different scaling exponent through a diverging gradient. D Proof of Theorems 2 and 3 In this section we detail the proof of Theorems 2 and 3. The key observation is to make use of the fact that nearest neighbor distances for D points sampled i.i.d. from a d-dimensional manifold have mean  ED,x [|x − x̂|] = O D−1/d , where x̂ is the nearest neighbor of x and the expectation is the mean over datapoints and draws of the dataset, see e.g. [47]. The theorem statements are copied for convenience. In the main text, in an abuse of notation, we used L(f ) to indicate the value of the test loss as a function of the network f , and L(D) to indicate the test loss averaged over the population, draws of the dataset, model initializations and training. To be more explicit below, we will use the notation ℓ(f (x)) to indicate the test loss for a single network evaluated at single test point. Theorem 2. Let ℓ(f ), f and F be Lipschitz with constants KL , Kf , and KF and ℓ(F) = 0. Further let D be a training dataset of size D sampled i.i.d from Md and let f (x) = F(x) ∀x ∈ D, then L(D) =  O KL max(Kf , KF )D−1/d . Proof. Consider a network trained on a particular draw of the training data. For each training point, x, let x̂ denote the neighboring training data point. Then by the above Lipschitz assumptions and the vanishing of the loss on the true target, we have ℓ(f (x)) ≤ KL |f (x) − F(x)| ≤ KL (Kf + KF ) |x − x̂|. With this, the average test loss is bounded as   L(D) ≤ KL (Kf + KF ) ED,x [|x − x̂|] = O KL max(Kf , KF )D−1/d . (S34) In the last equality, we used the above mentioned scaling of nearest neighbor distances. Theorem 3. Let ℓ(f ), f and F be Lipschitz with constants KL , Kf , and KF . Further let f (x) = F(x) for  P points sampled i.i.d from Md then L(P ) = O KL max(Kf , KF )P −1/d . Proof. Denote by P the P points, z, for which f (z) = F(z). For each test point x let x̂ denote the closest point in P, x̂ = argminP (|x − z|). Adopting this notation, the result follows by the same argument as Theorem 2. 28 E Random feature models Here we present random feature models in more detail. We begin by reviewing exact expressions for the loss. PP We then go onto derive its asymptotic properties. We again consider training a model f (x) = µ=1 θµ fµ (x), PS where fµ are drawn from some larger pool of features, {FM }, fµ (x) = M =1 PµM FM (x). Note, if {FM (x)} form a complete set of functions over the data distribution, then any target function, PS y(x), can be expressed as y = M =1 ωM FM (x). The extra constraint in a teacher-student model is specifying the distribution of the ωM . The variance-limited scaling goes through with or without the teacher-student assumption; however it is crucial for analyzing the resolution-limited behavior. As in section 2.3 of the main text, we consider models with weights initialized to zero and trained to convergence with mean squared error loss. D Ltrain = 1 X 2 (f (xa ) − ya ) . 2D a=1 (S35) The data and feature second moments play a central role in our analysis. We introduce the notation,   C = Ex F (x)F T (x) , D 1 X ¯ C= F (xa )F T (xa ) , D a=1 1 K(x, x ) = F T (x)F (x′ ) , S ′ K̄ = K Dtrain , C = PCP T , ¯ T. C̄ = P CP (S36) 1 K(x, x ) = f T (x)f (x′ ) , P ′ K̄ = K Dtrain . Here the script notation indicates the full feature space while the block letters are restricted to the student features. The bar represents restriction to the training dataset. We will also indicate kernels with one index ⃗ ⃗ in the training set as K(x) := K(x, xa=1...D ) and K(x) := K(x, xa=1...D ). After this notation spree, the test loss can be written for underparameterized models P ≤ D as   1 ¯ T C̄ −1 C C̄ −1 P C¯ − 2CP ¯ T C̄ −1 PC , (S37) L(D, P ) = ED Tr C + CP 2S and for overparameterized models (at the unique minimum found by GD, SGD, or projected Newton’s method), h i 1 T T ⃗ ⃗ ⃗ ⃗ (S38) L(D, P ) = Ex,D K(x, x) + K(x) K̄ −1 K̄K̄ −1 K(x) − 2K(x) K̄ −1 K(x) . 2 Here the expectation ED [•] is an expectation with respect to i.i.d. draws of a dataset of size D from the input distribution, while Ex [•] is an ordinary expectation over the input distribution. Note, expression (S37) is also valid for overparameterized models and (S38) is valid for underparameterized models if the inverses are replaces with the Moore-Penrose pseudo-inverse. Also note, the two expressions can be related by exchanging the projections onto finite features with the projection onto the training dataset and the sums of teacher features with the expectation over the data manifold. This realizes the duality between dataset and features discussed above. E.1 Asymptotic expressions We are interested in (S37) and (S38) in the limits of large P and D. Variance-limited scaling We begin with the underparameterized case. In the limit of lots of data, the sample estimate of the feature¯ approaches the true second moment matrix, C. Explicitly, if we define the feature second moment matrix, C, ¯ difference δC by C = C + δC, we have ED [δC] = 0 1 ED [δCM1 N1 δCM2 N2 ] = (Ex [FM1 (x)FN1 (x)FM2 (x)FN2 (x)] − CM1 N1 CM2 N2 ) D  ED [δCM1 N1 · · · δCMn Nn ] = O D−2 ∀n > 2 . 29 (S39) The key takeaway from (S39) is that the dependence on D is manifest. Using these expressions in (S37) yields.  1 Tr C − CP T C −1 PC 2S P h X  1 −1 + TM1 N1 M2 N2 δM1 M2 P T C −1 P N N + (C −1 PC 2 P T C −1 )M1 M2 CN 1 N2 1 2 2DS M1,2 N1,2 =1 i    −2 CP T C −1 P M1 M2 P T C −1 P N1 N2 + O D−2 . L(D, P ) = (S40) Here we have introduced the notation, TM1 N1 M2 N2 = Ex [FM1 (x)FN1 (x)FM2 (x)FN2 (x)]. As above, defining L(P ) := lim L(D, P ) = D→∞  1 Tr C − CP T C −1 PC , 2S (S41) we see that though L(D, P ) − L(P ) is a somewhat cumbersome quantity to compute, involving the average of a quartic tensor over the data distribution, its dependence on D is simple. For the overparameterized case, we can similarly expand (S38) using K = K + δK. With fluctuations satisfying, EP [δK] = 0 1 EP [δKa1 b1 δKa2 b2 ] = (EP [fµ (xa1 )fµ (xb1 )fµ (xa2 )fµ (xb2 )] − Ka1 b1 Ka2 b2 ) P  EP [δKa1 a1 · · · δKan an ] = O P −2 ∀n > 2 . (S42) This gives the expansion L(D, P ) = h i 1 T −1 ⃗ ⃗ Ex,D K(x, x) − K(x) K̄ K(x) + O(P −1 ) , 2 (S43) h i 1 T −1 ⃗ ⃗ Ex,D K(x, x) − K(x) K̄ K(x) . 2 (S44) and L(D) = Resolution-limited scaling We now move onto studying the parameter scaling of L(P ) and dataset scaling of L(D). We explicitly analyze the dataset scaling of L(D), with the parameter scaling following via the dataset parameter duality. Much work has been devoted to evaluating the expression (S44) [71–73]. One approach is to use the replica trick – a tool originating in the study of disordered systems which computes the expectation of a logarithm of a random variable via simpler moment contributions and analyticity assumptions [74]. The replica trick has a long history as a technique to study the generalization properties of kernel methods [19, 75–80]. We will most closely follow the work of [59], who use the replica method to derive an expression for the test loss of linear feature models in terms of the eigenvalues of the kernel C and ω̄, the coefficient vector of the target labels in terms of the model features. κ2 X λi ω̄i2 L(D) = , 1 − γ i (κ + Dλi )2 (S45) X κλi X Dλ2i κ= , γ= 2 . κ + Dλi i i (κ + Dλi ) This is the ridge-less, noise-free limit of equation (4) of [59]. Here we analyze the asymptotic behavior of these expressions for eigenvalues satisfying a power-law decay, λi = i−(1+αK ) and for targets coming from a teacher-student setup, w ∼ N (0, 1/S). 30 Loss 102 Fixed Low Regularization Tuned Regularization 10 1 101 6 × 10 2 100 4 × 10 2 P/D 103 102 3 × 10 2 10 1 2 × 10 2 101 101 102 103 102 103 Feature size (P, Solid) , Dataset size (D, Dashed) Feature size (P, Solid) , Dataset size (D, Dashed) 101 Figure S6: Duality between dataset size vs feature number in pretrained features. Using pretrained embedding features of EfficientNet-B5 [81] for different levels of regularization, we see that loss as function of dataset size or loss as a function of the feature dimension track each other both for small regularization (left) and for tuned regularization (right). Note that regularization strength with trained-feature kernels can be mapped to inverse training time [82, 83]. Thus (left) corresponds to long training time and exhibits double descent behavior, while (right) corresponds to optimal early stopping. To begin, we note that for teacher-student models in the limit of many features, the overlap coefficients ω̄ are equal to the teacher weights, up to a rotation ω̄i = OiM wM . As we are choosing an isotropic Gaussian   initialization, we are insensitive to this rotation and, in particular, Ew ω̄i2 = 1/S. See Figure S8 for empirical support of the average constancy of ω̄i for the teacher-student setting and contrast with realistic labels. With this simplification, we now compute the asymptotic scaling of (S45) by approximating the sums with integrals and expanding the resulting expressions in large D. We use the identities:   1   Z ∞ Γ n − −n(1+α) 1+α x α −D 1   2 F1 m, n − m = κ−m dx ,n + , 1+α 1+α κ κ + Dx−(1+α) (S46) 1 (1 + α)Γ n + α 1+α 2 F1 (a, b, c, −y) ∝ y −a + By −b + ... . Here 2 F1 is the hypergeometric function and the second line gives its asymptotic form at large y. B is a constant which does not effect the asymptotic scaling. Using these relations yields κ ∝ D−αK , and L(D) ∝ D−αK , γ ∝ D0 , (S47) as promised. Here we have dropped sub-leading terms at large D. Scaling behavior for parameter scaling L(P ) follows via the data-parameter duality. Duality beyond asymptotics Expressions (S37) and (S38) are related by changing projections onto finite feature set and finite dataset even without taking any asymptotic limits. We thus expect the dependence of test loss on parameter count and dataset size to be related quite generally in linear feature models. See Section F for further details. F Learned features In this section, we consider linear models with features coming from pretrained neural networks. Such features are useful for transfer learning applications (e.g. [84, 85]). In Figures S6 and S7, we take pretrained embedding features from an EfficientNet-B5 model [81] using TF hub3 . The EfficientNet model is pretrained 3 https://www.tensorflow.org/hub 31 Resolution-limited: Theory D = P 103 Dataset size (D) Resolution-limited: Theory P = D Loss 10 1 101 10 2 104 D=1156, P=0.25 D=1587, P=0.27 D=2048, P=0.28 100 102 Feature size (P) 103 10 1 101 102 101 100 10 1 10 2 10 3 102 103 10 2 10 3 Dataset size (D) Variance-limited: Theory P = 1 D=10, P=1.27 D=25, P=1.35 D=35, P=1.41 P=10, D=1.09 P=25, D=1.04 P=35, D=1.00 10 4 10 5 104 Dataset size (D) Variance-limited: Theory P = 1 D=1156, P=0.28 D=1587, P=0.28 D=2048, P=0.29 102 4 × 10 2 Feature size (P) 2 × 10 2 103 101 102 Feature size (P) D=10, P=1.13 D=25, P=1.07 D=35, P=1.05 10 1 3 × 10 2 101 P=1156, D=0.29 P=1587, D=0.30 P=2048, D=0.30 10 2 Dataset size (D) Resolution-limited: Theory P = D 6 × 10 2 Loss 102 Resolution-limited: Theory D = P Loss 100 Loss 101 Variance-limited: Theory D = 1 10 1 P=1156, D=0.23 P=1587, D=0.25 P=2048, D=0.26 Loss - Loss ( ) P=10, D=1.13 P=25, D=1.25 P=35, D=1.28 Loss - Loss ( ) Variance-limited: Theory D = 1 Loss - Loss ( ) Loss - Loss ( ) 102 101 100 10 1 10 2 10 3 10 4 10 5 103 10 2 10 3 101 102 Feature size (P) 103 Figure S7: Four scaling regimes exhibited by pretrained embedding features. Using pretrained embedding features of EfficientNet-B5 [81] for fixed low regularization (left) and tuned regularization (right), we can identify four regimes of scaling using real CIFAR-10 labels. using the ImageNet dataset with input image size of (456, 456). To extract features for the (32, 32) CIFAR-10 images, we use bilinear resizing. We then train a linear classifier on top of the penultimate pretrained features. To explore the effect of feature size P and dataset size D, we randomly subset the feature dimension and training dataset size and average over 5 random seeds. Prediction on test points are obtained as a kernel ridge regression problem with linear kernel. We note that the regularization ridge parameter can be mapped to an inverse early-stopping time [82, 83] of a corresponding ridgeless model trained via gradient descent. Inference with low regularization parameter denotes training for long time while tuned regularization parameter is equivalent to optimal early stopping. In Figure S7 we see evidence of all four scaling regimes for low regularization (left four) and optimal regularization (right four). We speculate that the deviation from the predicted variance-limited exponent αP = αD = 1 for the case of fixed low regularization (late time) is possibly due to the double descent resonance at D = P , which interferes with the power law fit. In Figure S6, we observe the duality between dataset size D (solid) and feature size P (dashed) – the loss as a function of the number of features is identical to the loss as a function of dataset size for both the optimal loss (tuned regularization) or late-time loss (low regularization). In Figure S8, we also compare properties of random features (using the infinite-width limit) and learned features from trained Wide Resnet (WRN) 28-10 models. We note that teacher-student models, where the feature class matches the target function and ordinary, fully trained models on real data (Figure 1a) have significantly larger exponents than models with fixed features and realistic targets. The measured ω̄i – the coefficient of the task labels under the i-th feature (S45) – are approximately constant as function of index i for all teacher-student settings. However, for real targets, ω̄i are only constant for the well-performing Myrtle-10 and WRN trained features (last two columns). 32 FC Loss TS: D = 0.20 Real: D = 0.05 10 1 102 103 Dataset size (D) 100 104 K = 0.26 4 × 10 2 3 × 10 2 101 102 103 Dataset size (D) 100 104 10 2 10 2 Normalized Value 100 101 102 i 103 104 100 101 102 i 103 104 101 101 10 2 10 2 10 2 2 0.03 TS 2 Real 1.41 10 5 10 8 i 100 101 2 0.03 TS 2 Real 1.37 10 5 i 102 103 10 8 101 102 103 Dataset size (D) i 102 10 8 103 10 9 104 K = 0.46 TS: D = 2.52 Real: D = 0.25 101 102 103 Dataset size (D) 104 K = 1.31 10 2 10 6 10 10 10 14 101 102 i 103 104 100 101 102 i 103 104 102 10 1 10 4 2 -0.07 TS 2 Real -0.45 10 5 i 100 101 100 101 10 6 10 12 100 10 1 10 2 10 3 10 4 K = 0.26 10 1 10 1 TS: D = 0.41 Real: D = 0.15 10 3 10 4 WRN pretrained 10 3 10 2 i i/ 0 10 1 Myrtle-10 10 1 TS: D = 0.23 Real: D = 0.07 6 × 10 2 101 iC CNN-VEC 2 × 10 1 i 100 101 i 2 -0.07 TS 2 Real -0.40 10 7 102 103 10 10 i 100 101 i 102 103 Figure S8: Loss on the teacher targets scale better than real targets for both untrained and trained features. The first three columns are infinite width kernels while the last column is a kernel built out of features from the penultimate layer of pretrained WRN 28-10 models on CIFAR-10. The first row is the loss as a function of dataset size D for teacher-student targets vs real targets. The observed dataset scaling exponent is denoted in the legend. The second row is the normalized partial sum of kernel eigenvalues. The partial sum’s scaling exponent is measured to capture the effect of the finite dataset size when empirical αK is close to zero. The third row shows ω̄i for teacher-student and real target compared against the kernel eigenvalue decay. We see the teacher-student ω̄i are approximately constant. 33