Scaling Laws and Interpretability of Learning from Repeated Data arXiv:2205.10487v1 [cs.LG] 21 May 2022 Danny Hernandez∗ Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, Sam McCandlish Anthropic Abstract Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work — attempting to reverse engineer the detailed computations performed by the model — by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance. 1 Introduction Large, high-quality text datasets are crucial for training large language models [Brown et al., 2020, Rae et al., 2021]. Such datasets often contain many copies of substantially overlapping documents, which ∗ Correspondence to: danny@anthropic.com All authors are at Anthropic. Author contributions are listed at the end of the paper. Constructing Repeated Datasets that are Subset of the Original Dataset Training Composition sample to be repeated Fraction Unique Original Text Dataset Unique Dataset
 (one epoch) Fraction Repeated Test Set Repeated Dataset
 (many epochs) Figure 1 Experimental Setup. From a large original text dataset (left), we draw 90% of our desired training dataset in a non-repeated fashion, and 10% as repeats of a tiny portion of the original dataset (right). We hold constant that 10% of total training tokens will come from repeats, but we vary the repeated fraction in our runs. In other words, the sample to be repeated might be very small, like 0.01% of the total training tokens repeated 1000x, or relatively large, like 1% of the total training tokens repeated 10x. A small, held-back portion of the original dataset (yellow in left figure), not including any repeated data, is used as a test set and is the test loss reported in all subsequent figures. greatly impairs the performance of language models on downstream tasks [Lee et al., 2021]. However, it is not well understood why data repetition impacts performance to such a large extent. In this paper we study data repetition in language models through two lenses: the macroscopic lens of scaling laws, and the microscopic lens of mechanistic interpretability [Elhage et al., 2021, Olsson et al., 2022]. For the first lens, we trained transformer [Vaswani et al., 2017] language models on mostly unique data plus a small fraction of repeated data (Figure 1), varying the repeated dataset size, model size, and fraction of tokens trained on repeated data. We find a strong double-descent phenomenon [Advani and Saxe, 2017, Belkin et al., 2018, Nakkiran et al., 2019], such that there is a defined range of repetition frequency for which performance is harmed to a surprisingly large extent. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. The location of the region suggests that large models like GPT-3, Gopher, and PALM [Brown et al., 2020, Rae et al., 2021, Bi et al., 2020] need to be careful about overfitting their high quality distributions like Wikipedia and books. For the second lens, mechanistic interpretability (attempting to reverse engineer the detailed computations performed by the model) we show that repeated data disproportionately damages induction heads. Induction heads use a circuit of 2 attention heads to "complete the pattern by copying and completing sequences" [Olsson et al., 2022]. The damage to induction heads is observed through degradation in copying, prefix matching, and through inspection. Together, the two lenses provide an integrated picture of how repeated data might be causing the network (or part of it) to shift from generalization to memorization, and mechanistically how this could be harming performance of the overall language model. 1.1 Summary of Results To systematically study repeated data, we trained transformer [Vaswani et al., 2017] language models on mostly unique data plus a small fraction of repeated data (Figure 1), varying the repeated dataset size, model size, and fraction of tokens trained on repeated data over 2-3 orders of magnitude. All models were trained for 100B tokens. We examined the resulting models using both scaling laws and mechanistic interpretability tools. Our main findings were as follows: 2 Overfitting Repeated Subset Coincides with Performance Hit loss loss 3.5 3 test, with repetition 2.5 test, without repetition 2 train, repeated subset 1.5 1 2 5 10M 2 5 100M 2 5 1B parameters Figure 2 Models of different sizes show a degradation in performance at a specific range of repeats that shrinks with model size (left panel). At its peak the degradation sometimes reaches the equivalent of a 2x decrease in model size. The right panel shows that divergence (blue line) from a healthy, straight scaling law (red) lines up with when the models start to dramatically overfit the repeated subset (green curve). The blue line on the right corresponds to a vertical slice of models in the left diagram trained on the repeated subset for 120 epochs. All these models were trained on 90% unique data and 10% repeated tokens. • Repeated data induces a strong double-descent phenomenon [Advani and Saxe, 2017, Belkin et al., 2018, Nakkiran et al., 2019], in which data repeated a few times does not cause much damage to language model performance, data repeated very many times also does not cause much damage, but there is a peak in the middle where damage is surprisingly large. For instance, when we train an 800M parameter transformer with 10% of training tokens drawn from the repeated subset (yellow curve in Figure 2) we find the loss can be nearly as high as for the 340M parameter transformer (light green curve). We see an epoch-wise [Nakkiran et al., 2019] double descent learning curve in Figure 3 is driving this performance degradation. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. Figure 2 on the right shows that the peak performance hit coincides with where the train loss on the repeated data approaches zero, similar to previously observed double-descent phenomena. This also provides a practical diagnostic for when repeated data is likely to be harming the model. • Repeated data can cause a divergence from power-law scaling. For the blue curve in Figure 2 right (122 repeated epochs), we see only a moderate impact to performance (line on log-log graph) until the model is scaled up to 100M parameters, after which we see a large divergence from power law scaling of cross entropy loss. Extrapolating the region of large degradation in Figure 4 predicts meaningful degradation of repeating data only 2 times for large (GPT-3 size) models, though the region would be shifted if the models were trained to the compute optimal frontier [Hoffmann et al., 2022]. • Repeated data causes a disproportionately large performance hit to copying, a mechanism for in-context learning. We constructed a simple copying eval, the loss on the first paragraph of Harry Potter copied 11 times. We observe that using 3% repeated data at the worst number of repeated epochs caused up to a 3x reduction in effective model size (performance equal to model with 3x fewer parameters) on this task whereas it only caused at most a 15% reduction in effective model size on test loss. • The disproportionate performance hit to copying coincides with a disproportionate degradation of induction heads. In line with [Olsson et al., 2022] we evaluated the models on their prefix matching score, repeated sequences of random tokens and observed the degree to which attention heads attend to earlier tokens that are preceded by a token that matches the present token. We observe that using 3% repeated data at the worst number of repeated epochs caused on average a 32% reduction in effective model size on this task whereas it only caused at most a 15% reduction in effective model size on test loss. • Repeated text data causes a small but still disproportionate performance drop out of distribution, as measured by cross entropy loss on Python code. Unlike our the Harry Potter copying and prefix matching evals we mostly see the performance drop with higher levels of repetition, 50-90%. 3 Double Descent on 800M Parameter Model Trained on 90% Repeated Data repeated epochs 7 6 5.5 5 439 4.5 244 4 610 3.5 6,100 test loss 5 1,100 4 11,000 110,000 1,100,000 3 repeated epochs 109 1 6 test loss Manifests as Long Plateau when Trained on less Repeated Data 11,000,000 1 61 61,000 3 610,000 6,100,000 2.5 2 2 5 1B 2 5 10B 2 5 5 tokens 1B 2 5 10B 2 5 tokens Figure 3 Learning curves for test loss on 800M models with 90% repeated data (left) and 50% repeated data (right), each with varying numbers of repeats/sizes of the repeated fraction. The graph on the left shows characteristic double descent curves. Repeated epochs corresponds to the number of epochs on the repeated tokens, the rest of the data is seen only once. For several models, test loss drops as normal during the beginning of training, but then starts to rise during the middle of training before dropping again. In the graph on the right with only 50% repeated data, we see that the double descent bumps have turned into long plateaus for highly affected models. • One and two-layer attention only models trained on repeated data are worse at exactly copying and fuzzily copying (for instance correctly predicting Dursleys given that Dursley has appeared previously) proper names on inspection. When we inspect per tokens losses of smaller models we can see this degradation in a simple, understandable form of copying in a paragraph of text. • Training on repeated Python code creates a similar behavior. When training on Python we also observe a double descent phenomenon and a predictable poor performance region in terms of model size and repeated epochs, though the shape of both curves are somewhat different. • Pre-training on repeated data damages models. Pre-training with repeated data leads to worse performance than both training from scratch and fine-tuning from a control model pre-trained on the original text dataset. During fine-tuning, the repeated data model forgets the repeated dataset, so we consider the model pre-trained with repeated data to be strictly worse than the model fine-tuned from the unique dataset. 2 Results Repeated data induces a strong double descent phenomenon. The results from training models on different sizes, fractions of repeated data, and frequency of repeats are shown in Figures 2 and 3. Figure 2 (left) shows that when we train on 10% repeated data and vary the frequency of repetition (or equivalently the number of epochs of repeated data), there is a specific range of repetition frequency for which damage to model performance is maximized. The range depends on the model size but for a 800M parameter model it occurs at roughly 100x repeats of 0.1% of the data, and degrades performance nearly to that of a 340M parameter model. This is a large degradation given that only 10% of the data is repeated. The peak coincides with the advent of memorization on the repeated data (Figure 2 right) – a possible indicator of a double descent phenomenon. Figure 3 shows learning curves for different repetition frequencies and for 50% and 90% of the data being repeated. In the extreme case of 90% repeated data and the correct frequency of repetition (100x-10,000x), we confirm the presence of a literal double descent curve in which the loss decreases, increases, and then decreases again (Figure 3 left). As we lower the fraction of repeated data to 50%, the curve becomes a long plateau rather than double descent, but it appears to be fundamentally an epoch-wise double descent phenomenon [Nakkiran et al., 2019]. These peaks and plateaus again coincide with the training loss on the repeated data approaching zero as shown in Figure 2. As in [Nakkiran et al., 2019] we see double descent effects caused by both increasing model size and epochs. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may 4 10% Repeated Data Can Lead Poor Scaling to Emerge repeated epochs model size multiplier 1 1 12 0.9 48 122 0.8 1,220 12,200 122,000 0.7 0.6 2 5 10M 2 5 100M 2 5 1B parameters Figure 4 On the left we plot the same results as in Figure 2, re-parameterized in terms of the effective model size multiplier implied by the test loss (performance equal to a model with x times as many parameters). For a given number of repetitions, degradation occurs only for a specific range of model sizes. For example, for the blue curve (122 repeated epochs), we see almost no performance deviation from a power law scaling law (line on log-log graph) until the model is scaled up to 100M parameters, after which we see a divergence. We see the same divergence around 400M parameters for 12,200 repeated epochs. The right graph shows a large, predictable region over which the degradation occurs, and suggests that large models like GPT-3, Gopher, and PALM [Brown et al., 2020, Rae et al., 2021, Bi et al., 2020] need to be careful about overfitting their high quality distributions like Wikipedia and books – although note that this holds constant the number of total training tokens. The blue and green curves correspond to the right and left sides of the double descent region where we observe 50% of the maximum effect. They are an aggregation of that curve for the scans where we trained on 3%, 10%, 20%, 50%, and 90% repeated data. The details of both fits are in Appendix A. A large number of runs needed to be aggregated to produce a clean fit for region of reduced performance. be where the peak of degradation occurs, for a more thorough discussion of this question see the discussion (section 5). Repeated data can cause a divergence from power-law scaling. Figure 4 zooms in on the degradation of performance, measured as a function of model size for different repetition frequencies of the repeated data. For example, models trained for 1,220 repeats and 10% repeated data show a dip in performance to the equivalent of a model 0.55x as large, when the model size is 10M to 100M parameters. As the model size continues to increase, performance recovers to 0.8x model-size equivalent for a 1B parameter model. For a smaller number of repeats (122 repeats), the dip occurs later, centered around 1B parameters. The right panel of Figure 4 shows the range over which we observe at least 50% of the maximum degradation; this corresponds to a “band” or region in the (model size, repetition frequency) plane. Both boundaries of the region are a good fit to a power law relating frequency of repetition to the number of parameters of the model, namely: E = k ∗ Nα where E corresponds to epochs of repetition and N corresponds to the parameters in the model. it is notable that the lines in figure 2b are relatively parallel. The fits for the above lines are given in the table below: k α right boundary 5.1e7 -.50 left boundary 4.2e6 -.56 Note that extrapolating these boundaries leads to a prediction of significant degradation from repeating data as little as 2x on state-of-the-art language models with hundreds of billions of parameters, although this applies for a constant number of training tokens (100B). In practice large models are trained for more than this[Hoffmann et al., 2022], and as shown in Figure 3, training past the double descent peak is helpful, so the degradation would likely not be quite as bad. When looking at Figure 3 we see that the the poor performance 5 Model Size Multiplier: Loss on 11x copies of a Paragraph 1 Parameters 5,310,000 5 model size multiplier model size multiplier 1 Model Size Multiplier: Test Loss 12,600,000 42,500,000 2 101,000,000 197,000,000 0.1 340,000,000 5 805,000,000 2 0.01 Parameters 5,310,000 5 12,600,000 42,500,000 2 101,000,000 197,000,000 0.1 340,000,000 5 805,000,000 2 0.01 3 4 5 6 7 8 9 10 2 3 4 5 6 3 fraction repeated 4 5 6 7 8 9 10 2 3 4 5 6 fraction repeated Figure 5 We constructed a simple measure of the model’s copying ability, consisting of the loss on the first paragraph of Harry Potter repeated 11 times. We measured the double descent peak performance for a given model size and fraction of repeated data and compared that to a fit of these evaluations on the control model (trained on unique text) scan to generate an effective model size. We observe that 3% repeated data at the pessimal number of repeated epochs caused a 3x reduction in effective model size on this task for a for several model sizes, whereas it only caused at most a 1.15x reduction in effective model size on test loss. We see much larger effects on the copying evaluation than on overall performance for repeated data fractions between 3% and 20%. The model size multiplier for copying is based on interpolation and the model size multiplier for test loss is based on a power law fit (see Appendix C for more details). region would be shifted left for large models trained on the compute efficient frontier (the pareto frontier of compute and performance) [Kaplan et al., 2020]. Overall it seems that in addition to being robust to task, model size, and architecture as shown in previous work [Advani and Saxe, 2017, Belkin et al., 2018, Nakkiran et al., 2019] double descent as a general phenomenon appears to be robust to occurring in a sub-distribution and that it can have a large effect on overall performance even while being a modest fraction of training tokens. Repeated data causes a disproportionately large performance hit to copying, a mechanism for incontext learning. The ability of a language model to copy text (in the sense of being provided with a context consisting of a passage repeated several times, and testing whether the model can repeat it once more) is a potential measure of generalization, as copying is independent of the content of the text. Also, recent interpretability work has suggested that copying may be implemented by crisp internal algorithmic structures ([Olsson et al., 2022]), again suggesting generalization. It thus seems valuable to investigate what happens to copying during a memorization-related degradation in performance, which we have shown above occurs in our experiments. To do this constructed a simple evaluation in which copying is heavily emphasized: we measure the loss on the first paragraph of Harry Potter copied 11 times. The models trained on repeated data performed much worse on this evaluation (Figure 5), substantially out of proportion to the degradation on the loss itself. In other words, copying is preferentially harmed by training on repeated data. For example, a 3% fraction of repeated data leads to a 1.15x reduction in effective model size (performance equal to model with 1.15 fewer parameters) on the general loss, but a much larger 3x effective model size reduction in terms of copying ability. As can be seen in Figure 5, the damage to copying is greater than the damage to overall loss across the entire range of repeated data fractions. This suggests that the shift to memorization caused by repeated data is selectively harming at some behaviors associated with generalization. To get another view on the same phenomenon, we measured the loss of various models on the Xth consecutive copy of the Harry Potter paragraph, where X runs from 1 to 12. As shown in Figure 7 (left), for most models the loss gradually decreases with increasing numbers of copies of the paragraph (i.e. the model has an easier time predicting an additional copies after seeing more consecutive copies), but at the peak of the double descent phenomenon, the loss is much higher and, strikingly, does not decrease at all with additional copies of the paragraph. This large aberration shows how strong the selective effect of the double descent phenomenon on copying is. General in-context learning is also harmed at the pessimal number of repeated epochs (Figure 7 right), though to a lesser extent than copying. 6 Model Size Multiplier: Prefix Matching Score 1 Parameters 5 1,570,000 2 5,310,000 0.1 12,600,000 5 42,500,000 101,000,000 2 197,000,000 0.01 340,000,000 5 805,000,000 model size multiplier model size multiplier 1 Model Size Multiplier: Test Loss 2 Parameters 5 1,570,000 2 5,310,000 0.1 12,600,000 5 42,500,000 101,000,000 2 197,000,000 0.01 340,000,000 5 805,000,000 2 0.001 0.001 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 3 100 fraction repeated 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 100 fraction repeated Figure 6 Comparison of degradation of prefix matching score with repeated data, compared to general degradation of the test loss. We measured the double descent peak performance for a given model size and fraction of repeated data and compared that to a fit of the prefix matching score on the control model scan to generate an effective model size. We observe that 3% repeated data causes on average 21 a 1.47 model size multiplier on prefix matching score while causing less than a 1.15x model size reduction in effective model size on test loss. Again we see much larger effects on the prefix matching score than on overall performance for repeated data fractions between 3% and 20%. The model size multiplier for prefix matching is based on a linear fit (see Appendix C for more details of fit). The test loss shown on the right is the same graph as in Figure 5, but with differently scaled axes for ease of comparison. The disproportionate performance hit to copying coincides with a disproportionate degradation of induction heads. Having connected the damage associated with repeated data with a measure of generalization (in-context copying of text), we next took the connection one step further, by trying to also probe the potential mechanistic basis of copying. [Olsson et al., 2022] identifies “induction heads” as a possible basis for copying and in-context learning behavior in general, so we decided to measure these and try to connect them back to the repeated data double descent phenomenon. [Olsson et al., 2022] defines induction heads by their ability to facilitate simple copying given a repeated random sequence of tokens (though in practice this definition ends up including heads with more complex behaviors too). Induction heads use a circuit of 2 attention heads to "complete the pattern by copying and completing sequences." This can be split up into attending to the relevant token (prefix matching) and increasing the logit corresponding to the attended-to token. We decided to probe the prefix matching score as measure of mechanistic structure that is distinct from the behavior of copying itself. Figure 6 shows the same setup as Figure 5 except for prefix matching score instead of copying loss. As can be seen in the figure, preferential damage to prefix matching score is not present across the whole range of repeated data fraction as it is for copying, but at low fractions of data repeated, there is still preferential damage. For example, at 3% repeated tokens, there is a 2x effective parameter decrease in prefix matching score, but only a 1.15x effective parameter decrease in general (test) loss. As another example, we find it interesting that the sharp drop in prefix matching score for a 1.5M parameter model with 50% repetition corresponded to a complete breakdown of paragraph level copying. This complete breakdown of paragraph level copying corresponds to a 1.5M parameter model having the effective overall performance of a 30,000 parameter model, while having an equivalent prefix matching score to a model with effectively 2,000 parameters. Although not as conclusive as the previous results, these clearly show that prefix matching is preferentially degraded in some cases. One and two-layer attention only models are worse at copying and fuzzily copying proper names on inspection. To examine the effect on induction heads and in-context learning even more closely, we looked at more granular copying in one and two layer attention-only transformers, for which interpreting the internal structure (and especially induction heads) is known to be particularly straightforward [Elhage et al., 2021, Olsson et al., 2022]. That is, we can reverse engineer a large portion of attentiononly-transformers (no MLP’s) with a circuits-level understanding (understanding how individual neurons act together to produce useful behavior) [Cammarata et al., 2020]. These small models also exhibit the same double-descent phenomenon as larger models (Appendix B). 7 50% Repeated Data Completely Breaks Paragraph Level Copying for 2L Repetition Disrupts in Context Learning (2 Layer 50% Repeated Data) 9 paragraph copies 5 per token loss (test) 2 3 4 loss 4 3.5 5 6 3 7 8 9 2.5 Repeated Epochs 8 1 4.5 1,220,000 122,000 7 12,200 6 1,220 122 5 1 4 10 11 1 100 10k 1M 1 epochs on repeated tokens 2 5 10 2 5 100 2 5 1000 2 5 token index Figure 7 Degradation of copying and in-context learning at the peak of the double descent curve. On the left we show the 2-layer models trained on 50% repeated data from Figure 5, evaluated on the first paragraph of Harry Potter copied X times where X runs from 1 to 11. In Appendix D, we explore shortening the length of the paragraph to verify the problem is with copying rather than long contexts. The right shows per token losses on the test set. Both graphs show dramatically reduced performance (higher copying loss, lower benefit to in-context learning) at the peak of the double descent. Figure 8 Visualization of the difference in loss on the first paragraph of Harry Potter for control and 10%repeated-data runs of a 1-layer attention-only model. Orange highlights correspond to the control model performing better, purple corresponds to the repeated data performing, and the intensity corresponds to the magnitude of the difference in per token losses. Proper names (which are a good target for copying when they occur more than once) are underlined in yellow on second or later occurance; it is clear that the control model performs better on these. Often the difference is dramatic: for the last three appearances of “Potters” the control model puts a >97% chance on “ters” given “Pot”, whereas the repeated data model puts <4% chance on that token. For 1-layer attention only models, where copying takes the form of skip-trigrams, we can easily see that the repeated data model is worse at a form of copying associated with these skip trigrams. Namely, we compare the probabilities that the repeated data and control models assign to each token in a paragraph, and focus especially on proper names which occur repeatedly in the paragraph (Figure 8). The most obvious way to correctly predict these re-occurring names is by copying, and we see that in most cases the control model (trained on unique text) performs much better than the one with repeated data (yellow underlines). Very specifically, predicting repeated names requires exactly a skip-trigram pattern [Elhage et al., 2021] which is the algorithmic operation 1-layer attention-only models are known to perform. For example, the following skip-trigrams are useful in the Harry Potter paragraph in Figure 8: [a][b] . . . [a] => [b] [ P ot][ter] . . . [ P ot] => [ter] [a][b] . . . [a] => [b0 ] [ P ot][ter] . . . [ P ot] => [ters] 8 Figure 9 Same as Figure 9, but for 2-layer attention-only models. Proper names (which are a good target for copying when they occur more than once) are underlined in yellow on second or later occurance. Here the repeated-data model sometimes does better on repeated proper names, but there are still clear examples of the control performing much better. These examples are highlighted in green and discussed. On the token [ley] in the second appearance of [D][urs][ley] the control model places a 92% likelihood on [ley] whereas the repeated data model places a 10% likelihood. On the token [leys] in the second appearance of [D][urs][leys] the control model places a 44% likelihood on [leys] whereas the repeated data model places a 4.9% likelihood. On the [ley] in [ un][D][urs][ley][ish] the control model places a 68% likelihood on [ley] whereas the repeated data model places a 0.4% likelihood. 1.2 Off Distribution: Model Size Multiplier Ratio of Python to Text multiplier ratio of python to text multiplier ratio of python to text Off Distribution: Model Size Multiplier Ratio of Python to Text Parameters 1,570,000 1 5,310,000 0.8 12,600,000 42,500,000 0.6 101,000,000 197,000,000 340,000,000 0.4 805,000,000 1 2 5 10 2 5 100 1 0.95 0.9 0.85 0.8 1 fraction repeated 2 5 10 2 5 100 fraction repeated Figure 10 We observe that training on high levels of repeated data causes a small disproportionate drop on out-of-distribution performance (Python loss). The effect is noisy, but since we do not see a model size effect we take the average in the figure on the right (harmonic mean of multipliers). For large repeated fractions of 50% and 90% we see model size multipliers of .84 and .75. We also plotted the same visualization for a 2-layer attention-only model (which is known to contain simple induction heads), and find the control model is better at fuzzy copying (Figure 9). Visually, it is less obvious (compared to the 1-layer case) that the 2-layer repeated model is worse at names, and there are a few examples where it puts 1.1x higher odds on the correct token. But on the other hand there are dramatic cases of the control model doing 500x times better (odds ratio on correct token) for fuzzy copying, like unDursleyish, which is exactly the kind of degradation we’d expect to see from disrupting induction heads. We attempted to leverage logit attribution (which earlier tokens contributed to the prediction of the current token through a "direct path" with this attention head) to see if the difference was primarily due to the induction head being less active or other heads interfering with it [Olsson et al., 2022]. We were unable to find clear evidence of either, but we include our exploration of a 2 layer attention only model in Appendix B. Repeated data causes a smaller, disproportionate performance drop on our out-of-distribution evaluations. 9 Repeated Python Data also causes Double Descent 2.2 Parameters 2 1,570,000 python test loss 1.8 5,310,000 1.6 12,600,000 1.4 42,500,000 101,000,000 1.2 197,000,000 340,000,000 1 805,000,000 0.8 1 100 10k 1M epochs on repeated tokens Figure 11 Double descent phenomenon for models trained on python. Training on Python gives similar results to what Figure 2 and Figure 4 show for language models. Here 50% of the dataset consists of repeats and 50% is unique. On the left side is degradation in performance, occurring over a specific range of repetition that varies with model size. On the right, we again see a large region of poor performance as we did in Figure 4, although the fit is noisier. Again the blue and green curves correspond to the right and left sides of the double descent curve where we observe 50% of the maximum effect. Given that we overfit the model, we expected it to perform worse off distribution, which we do observe (Figure 10). We notice almost an opposite pattern to what we observed in the induction head results. We see most of the disproportionate drop at 50% and 90% rather than 1-10%. We observe a double descent phenomenon in sparse sweep of models trained on python, but we the Python scans exhibit a somewhat different overall shape. To add more generality to our results, we repeated the same experiments on a Python dataset instead of natural language (Figure 11). If we use the same method to fit the poor performance region, we see a broadly similar fit and a second epoch for today’s large models (approximately 200B parameters) is still robustly in the reduced performance region for python. However the fit is noisier than the fit for text and the two lines are no longer parallel. The noise may partially be explained by the Python fits being averaged over half as many settings for the fraction of tokens that are repeated data. It could also be that we need a higher resolution Python scan to get a cleaner estimate for the poor performance region. Finally, the Python data was trained on approximately 2 epochs as described in the methods section (so it included some repetition on the main dataset as well, not just the repeated subset). Python also may have more unintentional repetition than text, from copying and pasting of example code and forking of codebases. Such repetition could change the shape of the region of poor performance. More analysis of the Python experiments is shown in Appendix A. Pre-training on repeated data hurts fine-tuned performance We find that the negative impact of repeated data persists after fine-tuning natural-language models on Python (Figure 12). It is noteworthy that the performance hit once fine-tuned is much smaller. An 800M model pre-trained on 50% repeated data from the double descent peak had its effective parameters reduced by 10x in Figure 15 in Appendix A. When we fine-tune from the repeated model we see a 1.6x reduction in effective parameters compared to training from scratch. This is still meaningful damage to the model, but it is recovered substantially. Since the repeated model forgets the repeated dataset after a modest amount of fine-tuning (Figure 12, we consider the fine-tuned model with repeated data pre-training to be dominated by the fine-tuned model from the unique dataset. 3 Methods The decoder-only transformer models were trained on an 8192 token context with the same settings as described in [Askell et al., 2021] for 100B tokens. Our language experiments utilized a 400B token dataset with 55% heavily filtered common crawl data (220B tokens), 32% internet books (128B tokens), and some smaller distributions including OpenWebText, Wikipedia, and Stack Exchange; most of which we sourced from The Pile [Gao et al., 2021], and leveraged the 50,304 vocabulary GPT-2 encoding [Radford et al., 2019, Wolf et al., 2019]. 10 Pretraining on Repeated Data Hurts Finetuned Performance finetune without repetition 5 4.5 4 finetune with repetition 3.5 scan 1.2 1.1 dataset text test loss repeated loss 3 from scratch 1 loss python test loss Majority of Overfitting on Repeated Data is Lost Quickly 0.9 2.5 2 1.5 0.8 10M 2 5 100M 2 5 1B 0 parameters 20B 40B 60B 80B python tokens finetuned Figure 12 Effect of repeated data during pre-training on fine-tuning. Models were pre-trained on 90% repeated data (red lines) or on totally unique data (blue lines), and then fine-tuned on Python (always unique data). The repetition frequency was chosen to maximize the performance hit. The model pre-trained on repeated data encounters a sizable performance hit during fine-tuning (left panel), causing it to not only perform worse than the model pre-trained on unique data, but also worse than a model trained from scratch (green line). The right panel shows fine-tuning curves of the two models. The model pretrained on repeated data performs much worse for several billion tokens (red line), but eventually catches up to the model pretrained on unique data (blue line). Code models were trained or fine-tuned on 45B tokens of Python for 2.2 epochs. Fine-tuning experiments had the same hyperparameters as pre-training experiments, but with learning rates reduced by a factor of 2 and reduced warmups. We varied model size, repeated dataset size, and the fraction of tokens trained on repeated data by 3, 2.5, and 2 orders of magnitude respectively. 4 Related Work Scaling Laws A scaling law lens consists of finding a small set of hyperparameters that have large, predictable impacts on model performance, and was present throughout this work (at least one of the hyperparameters is generally model size, compute, or dataset size). The predictive nature of scaling laws makes them useful in a broad number of research and engineering settings. The implications of scaling laws are sufficiently broad and understandable that understanding them is relevant to policy makers [Ganguli et al., 2022]. Predictable scaling trends in neural networks were first studied with [Hestness et al., 2017]. [Kaplan et al., 2020] demonstrated that test loss performance on language modeling tasks scales as a predictable function of model size, dataset size, and compute. The scaling law lens has become more popular over time. For instance scaling laws have been shown in many modalities (e.g., images, video, math, etc.) [Henighan et al., 2020], acoustics [Droppo and Elibol, 2021], transfer to code, [Hernandez et al., 2021], and few-shot adaptation of vision models [Prato et al., 2021]. Existing scaling laws have been revisited as training setups change; for instance, [Hoffmann et al., 2022] found that many recent large models have been under-trained. Our work uses the scaling law lens on an aspect of dataset quality and supplements the lens with an interpretability lens, and we believe our work is novel in both these respects. Mechanistic Interpretability A mechanistic interpretability lens was used in this work. Mechanistic interpretability refers to attempting to reverse engineer the detailed computations performed by the model. The mechanistic interpretability lens is useful for pure scientific understanding and has the potential to anticipate safety issues from future more powerful models. There is a relatively detailed understanding of mechanistic interpretability for convolutional image models [Cammarata et al., 2020], some understanding for multimodal models [Goh et al., 2021, Radford et al., 2021], and such an understanding is starting to be built up for Transformers trained on language [Elhage et al., 2021, Olsson et al., 2022]. For a more thorough background on interpretability progress see the related work section of [Elhage et al., 2021]. These results are an example of a “bridge” between microscopic phenomena inside the network and macroscopic trends in the loss, and we’re only aware of one other example of such a bridge [Olsson et al., 2022]. 11 Double Descent Double descent was first shown in generality by Belkin et al. [Belkin et al., 2018] where it was observed for decision trees, random features, and 2-layer neural networks. Similar behavior has been observed in [Opper, 1995, Malzahn and Opper, 2001, Advani and Saxe, 2017, Geiger et al., 2019, Nakkiran et al., 2019]. For a more thorough background on double descent see Nakkiran et al. [Nakkiran et al., 2019]. We extend the double descent phenomenon to a setting we see as more practical since data repetition in various forms appears to be a universal, long-term issue; whereas modern large language models are generally outside of the parameters and data regime of previously observed double descent phenomenon. Rise of Engineering Large, Diverse Language Datasets Algorithmic innovation [Hernandez and Brown, 2020], compute [Amodei et al., 2018], and data are three of the major factors that drive the advance of AI. The engineering and science of large, diverse language datasets is relatively new. Pre-2017 many language models were trained on a single distribution of text, such as news articles [Jozefowicz et al., 2016], Wikipedia [Merity et al., 2016], or fiction books [Kiros et al., 2015]. GPT-2 [Radford et al., 2019] leveraged webtext, outbound Reddit links with at least 3 upvotes in order to use human curation/filtration to ensure quality in addition to a broad distribution. GPT-2’s capabilities are largely attributed to its scaled-up size and dataset (10x the parameters and 10x the data of GPT) [Radford et al., 2019]. The next generation of language models, [Brown et al., 2020, Rae et al., 2021, Hoffmann et al., 2022], leveraged large, diverse datasets that consist of many sub-distributions. Constructing such datasets includes a large number of decisions: choosing sampling weights, quality filtering, de-duplication, fuzzy de-duplication, epochs per dataset, and more. There has not yet been substantial public work that quantitatively shows the impact of such decisions, but the dataset ablations in Appendix A of the Gopher [Rae et al., 2021] paper are notable. They clearly show the benefit of their dataset mixture, quality filter, exact de-duplication, and fuzzy de-duplication for 1.4B parameter models. Our work aims to provide some insights and potential diagnostics for researchers and engineers designing large datasets for language models. 5 Discussion 5.1 Why does repeating a small fraction of data damage performance so much? We showed that a dataset with only 10% repeated tokens can reduce model performance by an effective 2x in parameter count, much more than if that 10% of the data had simply never been trained on. The repeated data thus degrades model performance out of proportion to its share in the dataset. Why does this occur, and why only for a specific amount of repetition? One plausible hypothesis comes from looking at the model’s “incentives” to memorize vs generalize. To informally explore this hypothesis consider the following rough numbers, a 800M parameter model typically has a loss of roughly 2.0 nats/token, a 400M parameter model has a loss of roughly 2.2 nats/token, and fully memorized data will have a loss of 0 nats/token. Now suppose a 800M model is trained on 90% unique data and 10% tokens consisting of repeated data. We can ask whether it is a “good tradeoff” for the model to memorize the repeated data (leading to 0 loss on 10% of the dataset), at the cost of degrading performance by the equivalent of a 2x multiple in model size (which raises loss on the other 90% from 2 to 2.2). Some simple arithmetic suggests that it is: 0.9 ∗ 2.2 + 0.1 ∗ 0 = 1.98 < 2.0. Another way to say this is that zero loss is such a huge drop compared to the differences in entropy between model sizes that driving the loss to zero on even a tiny subset can incentivize enormous degradation in quality. This however leaves open the question of when this tradeoff is necessary or possible – and here is where the double descent phenomenon comes in. If a lot of data is repeated only a few times (say 5% of the data repeated 2x) then the model may not have the capacity to memorize it, and also does not see it enough times during training to do so. If a tiny amount of data is repeated very many times (say 0.01% of the data repeated 1000x), then the model will memorize it, but because it is so small the model need not use much capacity to do so, so the degradation in quality will likely be small. There is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak of degradation occurs. 5.2 Generalization, memorization, and induction heads Our results show that overfitting on the repeated data results in worse test loss, and this co-occurs with a disproportionate degradation in the model’s induction heads (prefix matching score) and its ability to copy text. Copying sequences can be seen as a form of generalization, as it requires algorithmic operations that are independent of the content of the data. [Elhage et al., 2021, Olsson et al., 2022] provided evidence for induction heads as the mechanism implementing copying and other pattern-matching. For the 2 layer model 12 shown in Figure 7 it seems as if the pressure to memorize the repeated dataset has led a skip tri-gram head to replace the induction head entirely. Thus our results tell a story where a type of generalization and its internal implementation are disrupted when the model memorizes repeated data – a vivid illustration of the memorization-generalization trade-off. Future work could take this even further, by measuring the number of parameters devoted to memorization and trying to observe them competing for space with induction heads. Finally, it is worth noting that the co-occurence of copying degradation and induction head degradation is itself some additional evidence for induction heads as the source of in-context learning; Olsson et al. [Olsson et al., 2022] was not fully conclusive and our results further bolster the case. 5.3 Bridging mechanistic interpretability and scaling laws The results connecting memorization to the degradation of mechanistic interpretability structures [Olsson et al., 2022] are an example of a “bridge” between microscopic phenomena inside the network and macroscopic trends in the loss. We view such connections as very fruitful tools for research, because they allow us to see the same thing through different lenses: the macroscopic behavior demonstrates the significance of the microscopic mechanisms, and the microscopic mechanisms help explain how and why the macroscopic phenomena occur. Switching back and forth between the two allows for a deeper understanding of both, as well as more robust diagnostics if something goes wrong. We are aware of at least one other instance of such a bridge – the correspondence between the formation of induction heads and the boost in in-context learning near the beginning of training [Elhage et al., 2021, Olsson et al., 2022] – but such connections remain rare so far, and we believe that finding more of them is a promising route to more deeply understanding neural nets. 5.4 Repeated data and fine-tuning We hypothesized repetition might help explain why models trained from scratch sometimes outperformed models that were pre-trained and then fine-tuned [Hernandez et al., 2021]. For our purposes, we define ossification as any pre-training that leads a fine-tuned model to perform worse than a model trained from scratch (given a fixed compute and data budget). It required relatively extreme repetition in pre-training (90% training on repeated tokens at peak of double descent curve, 73x reduction in effective model size) to see a large ossification effect (1.6x reduction in effective model size) within our fine-tuning setup. We still think repetition might explain a large fraction of ossification when we consider training on various types of repetition we did not study here (sentence level, paragraph level, similar documents, distribution, etc). Overall, our finding that repetition can induce ossification provides medium causal evidence to this hypothesis. We think ossification is an interesting phenomenon that merits further study. 5.5 Limitations We attempt to discuss limitations throughout the text where appropriate, but for the reader’s convenience, we enumerate them here. We attempt to list them in a loosely descending order of importance. 1. We used a fixed number of tokens for all models (similar to the GPT-3 model sweep), because these models were trained prior to the release of Chinchilla, which showed the compute frontier (pareto frontier of performance and compute) is quite different than previously understood [Brown et al., 2020, Hoffmann et al., 2022]. 2. Our fits for region of poor performance were relatively noisy, and we only observed a clean trend by aggregating them. This is discussed in the Results section and further explored in Appendix A. 3. The data we repeated was a random subset of the original dataset, and is thus not directly applicable to the situation where higher quality data (such as Wikipedia) is intentionally repeated to improve quality. Nevertheless, it seems plausible that the results would carry over. 4. We measured loss, rather than downstream NLP evaluations. Overfitting does not always entail worse performance on downstream tasks [Ouyang et al., 2022], so it is possible that the degradation we observe does not carry over to these tasks. 5. We did not explore the effects of early stopping, dropout, weight decay, or other regularization. 6. We did not investigate simpler systems than 1L attention-only models, which might contain more complete mechanistic insights. 13 5.6 Future Directions Below are some future directions we think are promising: 1. A compute efficient frontier scan to predict the poor performance region. 2. Varying the type of repetition. We could inject repeated sentences or paragraphs at the beginning or end of some fraction of contexts, or repeat chunks of documents in a different order. We could also explore cases where the repeated data has a different distribution than the unique data. 3. Further interpretability work. Are there neurons that tell the model what distribution it is in: unique or repeated? Are there neurons through which we can observe and edit the repeated sequences? 4. Drill down on memorization and generalization. Could we measure the number of model parameters taken up by memorization vs generalization, either behaviorally or by using mechanistic interpretability to identify parameters that are storing memorized data? Can we measure how this varies across the double descent, and thus watch the competition between memorized data and induction heads for model capacity? 5. Could repetition and double descent help explain loss spikes during training? If a model can largely memorize a particularly easy batch in a single gradient step then a very skinny double descent could present as a loss spike. 6 Conclusion We’ve shown that small fractions of repeated data, if repeated at the right frequency, can cause surprisingly severe degradation to model performance. We show that this degradation scales predictably, occurs across datasets, and is associated with disproprotionate damage to internal mechanisms associated with generalization, such as induction heads. In practical terms, these results provide a tool for predicting and diagnosing data-repetition-related problems in language models. In more conceptual terms, they are an example of a bridge between the macroscopic domain of scaling laws and the microscopic domain of mechanistic interpretability, as well as a lens for gaining a more detailed understanding of how generalization and memorization work. We believe these conceptual themes are promising ones, and hope to see more work that employs them. Acknowledgments We thank Ethan Perez, Jan Leike, and Martin Wattenberg for helpful feedback on the draft. We thank Daniela Amodei, Jamie Kerr, Jia Yuan Loke, Rebecca Raible, and Tim Telleen-Lawton for support with the project. Author Contributions Danny Hernandez led the project performed the majority of experiments, analysis, and writing. Tom Brown led engineering efforts for the scaling team, including efficient pre-training and gave helpful feedback on the paper. Tom Conerly made engineering contributions on the scaling team. Nova DasSarma managed the underlying cluster infrastructure. Dawn Drain helped with pre-training research and infrastructure. Sheer El-Showk helped with pretraining research and dataset construction. Nelson Elhage contributed significantly to interpretability tooling, provided support on that tooling, and gave helpful feedback. Zac Hatfield-Dodds helped with codebase maintenance and with engineering Tom Henighan helped with pretraining the underlying language models, with dataset creation, with managing the cluster during some phases of the project, and gave helpful feedback on the paper. Tristan Hume contributed to interpretability tooling that was leveraged in this work. 14 Scott Johnston helped with pretraining research. Ben Mann contributed to pretraining and cluster management. Chris Olah lead the interpretability team, which provided tooling and support for this work. Catherine Olsson contributed to interpretability tooling, provided support on that tooling, and provided interpretability research advice. Dario Amodei contributed greatly to the framing and writing of the work and advised the project. Nicholas Joseph helped design and build a framework for efficient training of large language models, gave helpful feedback on the paper, and advised the project. Jared Kaplan led pre-training efforts initially and advised the project. Sam McCandlish led pre-training efforts and advised the project. 15 Power Law fit For Control Scan on Language Data Power Law fit For Control Scan on Python Data 3.4 3.2 fit actual fit 1.4 2.8 test loss test loss 1.6 actual 3 2.6 2.4 1.2 1 2.2 0.8 2 2 5 10M 2 5 100M 2 5 2 1B 5 10M parameters 2 5 100M 2 5 1B parameters Figure 13 We see a power laws provide good fits for both language and Python data. We can use these fit to re-parameterize loss for our models trained on repeated data into model size multipliers. A Model Size Multiplier and Poor Performance Region Fits In order to fit the poor performance regions we first fit power laws to our control scans on language and Python so that we can re-parameterize loss in terms of model size multipliers. These fits are shown in Figure 13 When we graph repeated epochs vs model size multiplier with a given fraction of repeated data in Figure 15, we observed that our 1% repeated data graphs were quite noisy, so we excluded the 1% scans from the fits. The 3% repeated data graphs looked reasonable, in that the double descent peak looked large compared to the noise, so we included that all higher fractions in our fits. We estimate how many repeated epochs half of the maximum effect size (on a log scale) would be observed using linear interpolation on the left and right side of the double descent peak for each fraction of repeated data. We then averaged these curves to make an overall estimate for the left and right boundaries of the poor performance region shown in Figure 4 and Figure 11. For text this produces a relatively clean overall fit, but the the individual curves for text are relatively noisy as shown in 14. Some potential explanations for the noise are i) Given the resolution of our scan we do not always get a good estimate of the peak effect for a given curve (the peak can easily be between two points we measured ii) our linear interpolation also introduces error as our underlying curves only have 6 points. 50% of Max Double Descent Effect for Text (Left Boundary) 50% of Max Double Descent Effect for Text (Right Boundary) 2 fraction 2 10 20 5 50 90 2 100 5 repeated epochs repeated epochs 1000 fraction 100k 3 3 10 5 20 2 50 10k 90 5 2 1000 2 5 2 5 10M 2 5 100M 2 5 2 1B parameters 5 10M 2 5 100M 2 5 1B parameters Figure 14 We estimate how many repeated epochs half of the maximum effect size (on a log scale) would be observed using linear interpolation on the left and right side of the double descent peak for each fraction of repeated data. We then averaged these curves to make an overall estimate for the left and right boundaries of the poor performance region shown in Figure 4 Overall we think the region of poor performances we showed in Figure 4 is relatively robust in that it is useful to think about the sub distribution double descent phenomena there. However, we would not claim that we have produced extremely accurate estimates for the exact boundaries, even in our setup, and the boundaries could vary meaningfully given a different setup, especially differences in regularization. 16 Model Size Multiplier: 1% Repeated Data 1.02 Parameters 5,310,000 1 12,600,000 0.99 42,500,000 101,000,000 0.98 197,000,000 340,000,000 0.97 Parameters 1 1,570,000 805,000,000 model size multiplier 1.01 model size multiplier Model Size Multiplier: 3% Repeated Data 1,570,000 0.98 5,310,000 0.96 12,600,000 0.94 42,500,000 101,000,000 0.92 197,000,000 0.9 340,000,000 0.88 805,000,000 0.86 0.96 0.84 1 100 10k 1 100 repeated epochs Model Size Multiplier: 10% Repeated Data Model Size Multiplier: 50% Repeated Data Parameters 1 0.9 5,310,000 12,600,000 42,500,000 0.8 101,000,000 197,000,000 0.7 Parameters 1 1,570,000 model size multiplier model size multiplier 10k repeated epochs 340,000,000 805,000,000 0.6 1,570,000 5 5,310,000 12,600,000 42,500,000 2 101,000,000 0.1 197,000,000 340,000,000 5 805,000,000 2 1 100 10k 1 100 repeated epochs 10k 1M repeated epochs Figure 15 it is easier to see the sharpness of the double descent peaks in this diagram than Figure 2. The 1% runs was much noisier than the rest so we excluded it from our fits . 50% of Max Double Descent Effect for Python (Left Boundary) 50% of Max Double Descent Effect for Python (Right Boundary) fraction fraction 5 10 50 5 2 10 repeated epochs repeated epochs 2 100 10 50 2 10k 5 2 1000 5 5 2 5 10M 2 5 100M 2 5 2 1B 5 10M parameters 2 5 100M 2 5 1B parameters Figure 16 We estimate how many repeated epochs half would cause half of the maximum effect size (on a log scale) for our Python models using linear interpolation on the left and right side of the double descent peak for each fraction of repeated data. We then averaged these two curves to make an overall estimate for the left and right boundaries of the poor performance region shown in Figure 11 For Python, the aggregate shown in Figure 11 is quite a bit noisier. A lot of the noise is explained by only aggregating two scans rather than 5. But we see the individual scans for Python are also noiser as shown in Figure 16 B Appendix: Logit Attribution Analysis, 2 Layer Models For attention only models we can directly attribute contributions of the attention heads to the logits. We attempted to use this technique to better understand how the induction heads were disrupted for 2 layer 17 Figure 17 For attention only models we can directly attribute contributions of the attention heads to the logits as shown in [Elhage et al., 2021, Olsson et al., 2022]. Both models were evaluated on the first paragraph of Harry Potter copied twice. The induction head appeared to be head 0, shown in red for both models. The control model’s logit attribution is shown for the first two paragraph, and the third paragraph shown is from the repeated model at the double descent peak for comparison. models. For instance, it could be they were firing more weakly, or it could be activity from other attention heads were interfering with their ability to copy. Overall it feels like both effects happen weakly, and that it was easier to understand the disruption to induction heads through the per token losses shown in Figures 8 and 9, than through logit attribution. 18 Figure 18 For attention only models we can directly attribute contributions of the attention heads to the logits as shown in [Elhage et al., 2021, Olsson et al., 2022]. Similar to Figure 17 both models were evaluated on the first paragraph of Harry Potter copied twice, but here the contribution of all attention heads is shown. The other attention heads in the repeated data model appears more active (several of the reddish tokens in the second paragraph are brown in the third paragraph). 1-Layer Attention only Models also show Double Descent 4.2 attention only 4.15 True False test loss 4.1 4.05 4 3.95 3.9 3.85 100 1000 10k 100k 1M epochs on repeated tokens Figure 19 We still observe double descent on repeated data with 1 layer attention only models, so it is possible we’d observe double descent on repeated data for simpler model types. 19 C Appendix: Copying and Prefix Matching Score Fits Harry Potter 1st Paragraph Repeated 11 Times: Control 10% Repeated Data: Loss on HP 1st Paragraph copied 11 times 3 2 Parameters 1,570,000 2 5,310,000 12,600,000 1 9 8 7 6 loss loss 1 42,500,000 9 8 7 6 5 5 4 4 3 3 101,000,000 197,000,000 340,000,000 805,000,000 2 2 2 5 10M 2 5 100M 2 5 5 1B 100 2 parameters 5 1000 2 5 10k 2 5 100k 2 repeated epochs Figure 20 In order to do the model size interpolation used in Figure 5 we use the loss on Harry Potter’s first paragraph copied 11 times for our control models (no repeated data). it is relatively well behaved, but it was not obvious how to extrapolate the curve. On the right, as a sanity check, we check to make sure we still see peaks moving left as model size increases that approximately line up with what was observed in 2 Prefix Matching Score with 90% repetition Model Size Multiplier: Prefix Matching Score Averaged 1 model size multiplier prefix matching score 0.9 0.8 0.7 0.6 0.5 5 2 0.1 5 2 0.01 2 5 10M 2 5 100M 2 5 1B 1 Parameters 2 5 10 2 5 100 fraction repeated Figure 21 For the model size multiplier in Figure 6 we a linear fit on the prefix matching score for the control models shown on the left. On the right, similar to 10 we show that if we take an average over model size (harmonic mean of multiplier), we get a relatively clean relationship. 20 D Appendix: Harry Potter Copying Evaluation with Fewer Characters 50% Repeated Data, 2L, First 125 Characters of Paragraph 6 paragraph copies 1 5 2 4 3 4 loss 3 5 6 2 7 8 9 10 11 1 1 100 10k 1M epochs on repeated tokens Figure 22 In order to make sure the copying eval was not merely evaluating in context learning, we tried a much shorter copied sequence (approximately 10x shorter, 125 characters instead of 1463). We still observe approximately no learning from repeated copying for the 2L model trained on 50% repeated data at the double descent peak 21 References [Advani and Saxe, 2017] Advani, M. S. and Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks. [Amodei et al., 2018] Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., and Sutskever, I. (2018). Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute. [Askell et al., 2021] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment. [Belkin et al., 2018] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning practice and the bias-variance trade-off. [Bi et al., 2020] Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. (2020). Palm: Pre-training an autoencoding and autoregressive language model for context-conditioned generation. [Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. [Cammarata et al., 2020] Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. (2020). Thread: Circuits. Distill. https://distill.pub/2020/circuits. [Droppo and Elibol, 2021] Droppo, J. and Elibol, O. (2021). Scaling laws for acoustic models. [Elhage et al., 2021] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html. [Ganguli et al., 2022] Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D., Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C., and Clark, J. (2022). Predictability and surprise in large generative models. [Gao et al., 2021] Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2021). The pile: An 800gb dataset of diverse text for language modeling. [Geiger et al., 2019] Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019). Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115. [Goh et al., 2021] Goh, G., Nick, C., Chelsea, V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill. https://distill.pub/2021/multimodalneurons. [Henighan et al., 2020] Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative modeling. [Hernandez and Brown, 2020] Hernandez, D. and Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. CoRR, abs/2005.04305. [Hernandez et al., 2021] Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. arXiv preprint arXiv:2102.01293. [Hestness et al., 2017] Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically. [Hoffmann et al., 2022] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, 22 G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. [Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling. [Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. [Kiros et al., 2015] Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. (2015). Skip-thought vectors. [Lee et al., 2021] Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499. [Malzahn and Opper, 2001] Malzahn, D. and Opper, M. (2001). A variational approach to learning curves. In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems, volume 14. MIT Press. [Merity et al., 2016] Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models. [Nakkiran et al., 2019] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. [Olsson et al., 2022] Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2022). In-context learning and induction heads. Transformer Circuits Thread. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. [Opper, 1995] Opper, M. (1995). Statistical mechanics of learning: Generalization. [Ouyang et al., 2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. [Prato et al., 2021] Prato, G., Guiroy, S., Caballero, E., Rish, I., and Chandar, S. (2021). Scaling laws for the few-shot adaptation of pre-trained image classifiers. [Radford et al., 2021] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. [Radford et al., 2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. [Rae et al., 2021] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J.-B., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., d’Autume, C. d. M., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., Casas, D. d. L., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis, and insights from training gopher. [Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. [Wolf et al., 2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2019). Huggingface’s transformers: State-of-the-art natural language processing. 23