Scaling Laws and Interpretability of Learning from
Repeated Data

arXiv:2205.10487v1 [cs.LG] 21 May 2022

Danny Hernandez∗
Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage,
Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston,
Ben Mann, Chris Olah, Catherine Olsson,
Dario Amodei, Nicholas Joseph, Jared Kaplan, Sam McCandlish

Anthropic
Abstract
Recent large language models have been trained on vast datasets, but also often on repeated
data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at
the sentence, paragraph, or document level. Some works have reported substantial negative
performance effects of this repeated data. In this paper we attempt to study repeated data
systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many
times. We find a strong double descent phenomenon, in which repeated data can lead test
loss to increase midway through training. A predictable range of repetition frequency leads
to surprisingly severe degradation in performance. For instance, performance of an 800M
parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining
unique. We suspect there is a range in the middle where the data can be memorized and
doing so consumes a large fraction of the model’s capacity, and this may be where the peak
of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work — attempting to reverse engineer the detailed computations performed by
the model — by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible
mechanism for the shift from generalization to memorization. Taken together, these results
provide a hypothesis for why repeating a relatively small fraction of data in large language
models could lead to disproportionately large harms to performance.

1

Introduction

Large, high-quality text datasets are crucial for training large language models [Brown et al., 2020,
Rae et al., 2021]. Such datasets often contain many copies of substantially overlapping documents, which
∗

Correspondence to: danny@anthropic.com
All authors are at Anthropic. Author contributions are listed at the end of the paper.

Constructing Repeated Datasets that are Subset of the Original Dataset
Training Composition

sample to be repeated

Fraction Unique

Original Text Dataset

Unique Dataset 
(one epoch)

Fraction  

Repeated

Test Set

Repeated Dataset 
(many epochs)

Figure 1 Experimental Setup. From a large original text dataset (left), we draw 90% of our desired training
dataset in a non-repeated fashion, and 10% as repeats of a tiny portion of the original dataset (right). We hold
constant that 10% of total training tokens will come from repeats, but we vary the repeated fraction in our
runs. In other words, the sample to be repeated might be very small, like 0.01% of the total training tokens
repeated 1000x, or relatively large, like 1% of the total training tokens repeated 10x. A small, held-back
portion of the original dataset (yellow in left figure), not including any repeated data, is used as a test set and
is the test loss reported in all subsequent figures.

greatly impairs the performance of language models on downstream tasks [Lee et al., 2021]. However, it is
not well understood why data repetition impacts performance to such a large extent.
In this paper we study data repetition in language models through two lenses: the macroscopic lens of scaling laws, and the microscopic lens of mechanistic interpretability [Elhage et al., 2021, Olsson et al., 2022].
For the first lens, we trained transformer [Vaswani et al., 2017] language models on mostly unique data plus
a small fraction of repeated data (Figure 1), varying the repeated dataset size, model size, and fraction of
tokens trained on repeated data. We find a strong double-descent phenomenon [Advani and Saxe, 2017,
Belkin et al., 2018, Nakkiran et al., 2019], such that there is a defined range of repetition frequency for which
performance is harmed to a surprisingly large extent. We suspect there is a range in the middle where the data
can be memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the
peak of degradation occurs. The location of the region suggests that large models like GPT-3, Gopher, and
PALM [Brown et al., 2020, Rae et al., 2021, Bi et al., 2020] need to be careful about overfitting their high
quality distributions like Wikipedia and books.
For the second lens, mechanistic interpretability (attempting to reverse engineer the detailed computations
performed by the model) we show that repeated data disproportionately damages induction heads. Induction
heads use a circuit of 2 attention heads to "complete the pattern by copying and completing sequences"
[Olsson et al., 2022]. The damage to induction heads is observed through degradation in copying, prefix
matching, and through inspection.
Together, the two lenses provide an integrated picture of how repeated data might be causing the network
(or part of it) to shift from generalization to memorization, and mechanistically how this could be harming
performance of the overall language model.
1.1

Summary of Results

To systematically study repeated data, we trained transformer [Vaswani et al., 2017] language models on
mostly unique data plus a small fraction of repeated data (Figure 1), varying the repeated dataset size, model
size, and fraction of tokens trained on repeated data over 2-3 orders of magnitude. All models were trained
for 100B tokens. We examined the resulting models using both scaling laws and mechanistic interpretability
tools. Our main findings were as follows:
2

Overﬁtting Repeated Subset Coincides with Performance Hit

loss

loss

3.5
3

test, with repetition

2.5

test, without repetition

2

train, repeated subset

1.5
1

2

5

10M

2

5

100M

2

5

1B

parameters

Figure 2 Models of different sizes show a degradation in performance at a specific range of repeats that
shrinks with model size (left panel). At its peak the degradation sometimes reaches the equivalent of a 2x
decrease in model size. The right panel shows that divergence (blue line) from a healthy, straight scaling law
(red) lines up with when the models start to dramatically overfit the repeated subset (green curve). The blue
line on the right corresponds to a vertical slice of models in the left diagram trained on the repeated subset
for 120 epochs. All these models were trained on 90% unique data and 10% repeated tokens.

• Repeated data induces a strong double-descent phenomenon [Advani and Saxe, 2017,
Belkin et al., 2018, Nakkiran et al., 2019], in which data repeated a few times does not cause much
damage to language model performance, data repeated very many times also does not cause much
damage, but there is a peak in the middle where damage is surprisingly large. For instance, when we
train an 800M parameter transformer with 10% of training tokens drawn from the repeated subset
(yellow curve in Figure 2) we find the loss can be nearly as high as for the 340M parameter transformer (light green curve). We see an epoch-wise [Nakkiran et al., 2019] double descent learning
curve in Figure 3 is driving this performance degradation. We suspect there is a range in the middle
where the data can be memorized and doing so consumes a large fraction of the model’s capacity,
and this may be where the peak of degradation occurs. Figure 2 on the right shows that the peak
performance hit coincides with where the train loss on the repeated data approaches zero, similar to
previously observed double-descent phenomena. This also provides a practical diagnostic for when
repeated data is likely to be harming the model.
• Repeated data can cause a divergence from power-law scaling. For the blue curve in Figure
2 right (122 repeated epochs), we see only a moderate impact to performance (line on log-log
graph) until the model is scaled up to 100M parameters, after which we see a large divergence
from power law scaling of cross entropy loss. Extrapolating the region of large degradation in
Figure 4 predicts meaningful degradation of repeating data only 2 times for large (GPT-3 size) models, though the region would be shifted if the models were trained to the compute optimal frontier
[Hoffmann et al., 2022].
• Repeated data causes a disproportionately large performance hit to copying, a mechanism for
in-context learning. We constructed a simple copying eval, the loss on the first paragraph of Harry
Potter copied 11 times. We observe that using 3% repeated data at the worst number of repeated
epochs caused up to a 3x reduction in effective model size (performance equal to model with 3x
fewer parameters) on this task whereas it only caused at most a 15% reduction in effective model
size on test loss.
• The disproportionate performance hit to copying coincides with a disproportionate degradation of induction heads. In line with [Olsson et al., 2022] we evaluated the models on their prefix
matching score, repeated sequences of random tokens and observed the degree to which attention
heads attend to earlier tokens that are preceded by a token that matches the present token. We observe that using 3% repeated data at the worst number of repeated epochs caused on average a 32%
reduction in effective model size on this task whereas it only caused at most a 15% reduction in
effective model size on test loss.
• Repeated text data causes a small but still disproportionate performance drop out of distribution, as measured by cross entropy loss on Python code. Unlike our the Harry Potter copying and
prefix matching evals we mostly see the performance drop with higher levels of repetition, 50-90%.
3

Double Descent on 800M Parameter Model Trained on 90% Repeated Data

repeated epochs

7

6
5.5
5

439

4.5

244

4

610

3.5

6,100

test loss

5

1,100

4

11,000
110,000
1,100,000

3

repeated epochs

109

1

6

test loss

Manifests as Long Plateau when Trained on less Repeated Data

11,000,000

1
61

61,000

3

610,000
6,100,000

2.5

2

2
5

1B

2

5

10B

2

5

5

tokens

1B

2

5

10B

2

5

tokens

Figure 3 Learning curves for test loss on 800M models with 90% repeated data (left) and 50% repeated
data (right), each with varying numbers of repeats/sizes of the repeated fraction. The graph on the left shows
characteristic double descent curves. Repeated epochs corresponds to the number of epochs on the repeated
tokens, the rest of the data is seen only once. For several models, test loss drops as normal during the
beginning of training, but then starts to rise during the middle of training before dropping again. In the graph
on the right with only 50% repeated data, we see that the double descent bumps have turned into long plateaus
for highly affected models.

• One and two-layer attention only models trained on repeated data are worse at exactly copying and fuzzily copying (for instance correctly predicting Dursleys given that Dursley has appeared previously) proper names on inspection. When we inspect per tokens losses of smaller
models we can see this degradation in a simple, understandable form of copying in a paragraph of
text.
• Training on repeated Python code creates a similar behavior. When training on Python we also
observe a double descent phenomenon and a predictable poor performance region in terms of model
size and repeated epochs, though the shape of both curves are somewhat different.
• Pre-training on repeated data damages models. Pre-training with repeated data leads to worse
performance than both training from scratch and fine-tuning from a control model pre-trained on
the original text dataset. During fine-tuning, the repeated data model forgets the repeated dataset, so
we consider the model pre-trained with repeated data to be strictly worse than the model fine-tuned
from the unique dataset.

2

Results

Repeated data induces a strong double descent phenomenon. The results from training models on different sizes, fractions of repeated data, and frequency of repeats are shown in Figures 2 and 3. Figure 2 (left)
shows that when we train on 10% repeated data and vary the frequency of repetition (or equivalently the number of epochs of repeated data), there is a specific range of repetition frequency for which damage to model
performance is maximized. The range depends on the model size but for a 800M parameter model it occurs
at roughly 100x repeats of 0.1% of the data, and degrades performance nearly to that of a 340M parameter
model. This is a large degradation given that only 10% of the data is repeated. The peak coincides with
the advent of memorization on the repeated data (Figure 2 right) – a possible indicator of a double descent
phenomenon.
Figure 3 shows learning curves for different repetition frequencies and for 50% and 90% of the data being
repeated. In the extreme case of 90% repeated data and the correct frequency of repetition (100x-10,000x),
we confirm the presence of a literal double descent curve in which the loss decreases, increases, and then
decreases again (Figure 3 left). As we lower the fraction of repeated data to 50%, the curve becomes a
long plateau rather than double descent, but it appears to be fundamentally an epoch-wise double descent
phenomenon [Nakkiran et al., 2019]. These peaks and plateaus again coincide with the training loss on the
repeated data approaching zero as shown in Figure 2. As in [Nakkiran et al., 2019] we see double descent
effects caused by both increasing model size and epochs. We suspect there is a range in the middle where
the data can be memorized and doing so consumes a large fraction of the model’s capacity, and this may
4

10% Repeated Data Can Lead Poor Scaling to Emerge

repeated epochs

model size multiplier

1

1
12

0.9

48
122

0.8

1,220
12,200
122,000

0.7
0.6

2

5

10M

2

5

100M

2

5

1B

parameters

Figure 4 On the left we plot the same results as in Figure 2, re-parameterized in terms of the effective model
size multiplier implied by the test loss (performance equal to a model with x times as many parameters). For
a given number of repetitions, degradation occurs only for a specific range of model sizes. For example, for
the blue curve (122 repeated epochs), we see almost no performance deviation from a power law scaling law
(line on log-log graph) until the model is scaled up to 100M parameters, after which we see a divergence.
We see the same divergence around 400M parameters for 12,200 repeated epochs. The right graph shows
a large, predictable region over which the degradation occurs, and suggests that large models like GPT-3,
Gopher, and PALM [Brown et al., 2020, Rae et al., 2021, Bi et al., 2020] need to be careful about overfitting
their high quality distributions like Wikipedia and books – although note that this holds constant the number
of total training tokens. The blue and green curves correspond to the right and left sides of the double descent
region where we observe 50% of the maximum effect. They are an aggregation of that curve for the scans
where we trained on 3%, 10%, 20%, 50%, and 90% repeated data. The details of both fits are in Appendix
A. A large number of runs needed to be aggregated to produce a clean fit for region of reduced performance.

be where the peak of degradation occurs, for a more thorough discussion of this question see the discussion
(section 5).
Repeated data can cause a divergence from power-law scaling. Figure 4 zooms in on the degradation
of performance, measured as a function of model size for different repetition frequencies of the repeated
data. For example, models trained for 1,220 repeats and 10% repeated data show a dip in performance to the
equivalent of a model 0.55x as large, when the model size is 10M to 100M parameters. As the model size
continues to increase, performance recovers to 0.8x model-size equivalent for a 1B parameter model. For a
smaller number of repeats (122 repeats), the dip occurs later, centered around 1B parameters.
The right panel of Figure 4 shows the range over which we observe at least 50% of the maximum degradation;
this corresponds to a “band” or region in the (model size, repetition frequency) plane. Both boundaries of
the region are a good fit to a power law relating frequency of repetition to the number of parameters of the
model, namely:
E = k ∗ Nα
where E corresponds to epochs of repetition and N corresponds to the parameters in the model. it is notable
that the lines in figure 2b are relatively parallel. The fits for the above lines are given in the table below:
k

α

right boundary

5.1e7

-.50

left boundary

4.2e6

-.56

Note that extrapolating these boundaries leads to a prediction of significant degradation from repeating data
as little as 2x on state-of-the-art language models with hundreds of billions of parameters, although this
applies for a constant number of training tokens (100B). In practice large models are trained for more than
this[Hoffmann et al., 2022], and as shown in Figure 3, training past the double descent peak is helpful, so the
degradation would likely not be quite as bad. When looking at Figure 3 we see that the the poor performance
5

Model Size Multiplier: Loss on 11x copies of a Paragraph

1

Parameters
5,310,000

5

model size multiplier

model size multiplier

1

Model Size Multiplier: Test Loss

12,600,000
42,500,000

2

101,000,000
197,000,000

0.1

340,000,000

5

805,000,000

2

0.01

Parameters
5,310,000

5

12,600,000
42,500,000

2

101,000,000
197,000,000

0.1

340,000,000

5

805,000,000

2

0.01
3

4

5

6

7

8 9

10

2

3

4

5

6

3

fraction repeated

4

5

6

7

8 9

10

2

3

4

5

6

fraction repeated

Figure 5 We constructed a simple measure of the model’s copying ability, consisting of the loss on the
first paragraph of Harry Potter repeated 11 times. We measured the double descent peak performance for a
given model size and fraction of repeated data and compared that to a fit of these evaluations on the control
model (trained on unique text) scan to generate an effective model size. We observe that 3% repeated data
at the pessimal number of repeated epochs caused a 3x reduction in effective model size on this task for a
for several model sizes, whereas it only caused at most a 1.15x reduction in effective model size on test loss.
We see much larger effects on the copying evaluation than on overall performance for repeated data fractions
between 3% and 20%. The model size multiplier for copying is based on interpolation and the model size
multiplier for test loss is based on a power law fit (see Appendix C for more details).

region would be shifted left for large models trained on the compute efficient frontier (the pareto frontier of
compute and performance) [Kaplan et al., 2020].
Overall it seems that in addition to being robust to task, model size, and architecture as shown in previous
work [Advani and Saxe, 2017, Belkin et al., 2018, Nakkiran et al., 2019] double descent as a general phenomenon appears to be robust to occurring in a sub-distribution and that it can have a large effect on overall
performance even while being a modest fraction of training tokens.
Repeated data causes a disproportionately large performance hit to copying, a mechanism for incontext learning. The ability of a language model to copy text (in the sense of being provided with a context
consisting of a passage repeated several times, and testing whether the model can repeat it once more) is a
potential measure of generalization, as copying is independent of the content of the text. Also, recent interpretability work has suggested that copying may be implemented by crisp internal algorithmic structures
([Olsson et al., 2022]), again suggesting generalization. It thus seems valuable to investigate what happens to
copying during a memorization-related degradation in performance, which we have shown above occurs in
our experiments.
To do this constructed a simple evaluation in which copying is heavily emphasized: we measure the loss on
the first paragraph of Harry Potter copied 11 times. The models trained on repeated data performed much
worse on this evaluation (Figure 5), substantially out of proportion to the degradation on the loss itself. In
other words, copying is preferentially harmed by training on repeated data. For example, a 3% fraction of
repeated data leads to a 1.15x reduction in effective model size (performance equal to model with 1.15 fewer
parameters) on the general loss, but a much larger 3x effective model size reduction in terms of copying
ability. As can be seen in Figure 5, the damage to copying is greater than the damage to overall loss across
the entire range of repeated data fractions. This suggests that the shift to memorization caused by repeated
data is selectively harming at some behaviors associated with generalization.
To get another view on the same phenomenon, we measured the loss of various models on the Xth consecutive
copy of the Harry Potter paragraph, where X runs from 1 to 12. As shown in Figure 7 (left), for most models
the loss gradually decreases with increasing numbers of copies of the paragraph (i.e. the model has an easier
time predicting an additional copies after seeing more consecutive copies), but at the peak of the double
descent phenomenon, the loss is much higher and, strikingly, does not decrease at all with additional copies of
the paragraph. This large aberration shows how strong the selective effect of the double descent phenomenon
on copying is. General in-context learning is also harmed at the pessimal number of repeated epochs (Figure
7 right), though to a lesser extent than copying.
6

Model Size Multiplier: Preﬁx Matching Score

1

Parameters

5

1,570,000

2

5,310,000

0.1

12,600,000

5

42,500,000
101,000,000

2

197,000,000

0.01

340,000,000

5

805,000,000

model size multiplier

model size multiplier

1

Model Size Multiplier: Test Loss

2

Parameters

5

1,570,000

2

5,310,000

0.1

12,600,000

5

42,500,000
101,000,000

2

197,000,000

0.01

340,000,000

5

805,000,000

2

0.001

0.001
3

4

5

6

7 8 9

10

2

3

4

5

6

7 8 9

3

100

fraction repeated

4

5

6

7 8 9

10

2

3

4

5

6

7 8 9

100

fraction repeated

Figure 6 Comparison of degradation of prefix matching score with repeated data, compared to general
degradation of the test loss. We measured the double descent peak performance for a given model size and
fraction of repeated data and compared that to a fit of the prefix matching score on the control model scan to
generate an effective model size. We observe that 3% repeated data causes on average 21 a 1.47 model size
multiplier on prefix matching score while causing less than a 1.15x model size reduction in effective model
size on test loss. Again we see much larger effects on the prefix matching score than on overall performance
for repeated data fractions between 3% and 20%. The model size multiplier for prefix matching is based on
a linear fit (see Appendix C for more details of fit). The test loss shown on the right is the same graph as in
Figure 5, but with differently scaled axes for ease of comparison.

The disproportionate performance hit to copying coincides with a disproportionate degradation of induction heads. Having connected the damage associated with repeated data with a measure of generalization
(in-context copying of text), we next took the connection one step further, by trying to also probe the potential mechanistic basis of copying. [Olsson et al., 2022] identifies “induction heads” as a possible basis for
copying and in-context learning behavior in general, so we decided to measure these and try to connect them
back to the repeated data double descent phenomenon.
[Olsson et al., 2022] defines induction heads by their ability to facilitate simple copying given a repeated
random sequence of tokens (though in practice this definition ends up including heads with more complex
behaviors too). Induction heads use a circuit of 2 attention heads to "complete the pattern by copying and
completing sequences." This can be split up into attending to the relevant token (prefix matching) and increasing the logit corresponding to the attended-to token.
We decided to probe the prefix matching score as measure of mechanistic structure that is distinct from the
behavior of copying itself. Figure 6 shows the same setup as Figure 5 except for prefix matching score instead
of copying loss. As can be seen in the figure, preferential damage to prefix matching score is not present
across the whole range of repeated data fraction as it is for copying, but at low fractions of data repeated,
there is still preferential damage. For example, at 3% repeated tokens, there is a 2x effective parameter
decrease in prefix matching score, but only a 1.15x effective parameter decrease in general (test) loss.
As another example, we find it interesting that the sharp drop in prefix matching score for a 1.5M parameter
model with 50% repetition corresponded to a complete breakdown of paragraph level copying. This complete
breakdown of paragraph level copying corresponds to a 1.5M parameter model having the effective overall
performance of a 30,000 parameter model, while having an equivalent prefix matching score to a model with
effectively 2,000 parameters.
Although not as conclusive as the previous results, these clearly show that prefix matching is preferentially
degraded in some cases.
One and two-layer attention only models are worse at copying and fuzzily copying proper names
on inspection. To examine the effect on induction heads and in-context learning even more closely,
we looked at more granular copying in one and two layer attention-only transformers, for which interpreting the internal structure (and especially induction heads) is known to be particularly straightforward
[Elhage et al., 2021, Olsson et al., 2022]. That is, we can reverse engineer a large portion of attentiononly-transformers (no MLP’s) with a circuits-level understanding (understanding how individual neurons
act together to produce useful behavior) [Cammarata et al., 2020]. These small models also exhibit the same
double-descent phenomenon as larger models (Appendix B).
7

50% Repeated Data Completely Breaks Paragraph Level Copying for 2L

Repetition Disrupts in Context Learning (2 Layer 50% Repeated Data)
9

paragraph copies

5

per token loss (test)

2
3

4

loss

4

3.5

5
6

3

7
8
9

2.5

Repeated Epochs

8

1

4.5

1,220,000
122,000

7

12,200

6

1,220
122

5

1

4

10
11

1

100

10k

1M

1

epochs on repeated tokens

2

5

10

2

5

100

2

5

1000

2

5

token index

Figure 7 Degradation of copying and in-context learning at the peak of the double descent curve. On the
left we show the 2-layer models trained on 50% repeated data from Figure 5, evaluated on the first paragraph
of Harry Potter copied X times where X runs from 1 to 11. In Appendix D, we explore shortening the length
of the paragraph to verify the problem is with copying rather than long contexts. The right shows per token
losses on the test set. Both graphs show dramatically reduced performance (higher copying loss, lower benefit
to in-context learning) at the peak of the double descent.

Figure 8 Visualization of the difference in loss on the first paragraph of Harry Potter for control and 10%repeated-data runs of a 1-layer attention-only model. Orange highlights correspond to the control model
performing better, purple corresponds to the repeated data performing, and the intensity corresponds to the
magnitude of the difference in per token losses. Proper names (which are a good target for copying when
they occur more than once) are underlined in yellow on second or later occurance; it is clear that the control
model performs better on these. Often the difference is dramatic: for the last three appearances of “Potters”
the control model puts a >97% chance on “ters” given “Pot”, whereas the repeated data model puts <4%
chance on that token.

For 1-layer attention only models, where copying takes the form of skip-trigrams, we can easily see that the
repeated data model is worse at a form of copying associated with these skip trigrams. Namely, we compare
the probabilities that the repeated data and control models assign to each token in a paragraph, and focus
especially on proper names which occur repeatedly in the paragraph (Figure 8). The most obvious way to
correctly predict these re-occurring names is by copying, and we see that in most cases the control model
(trained on unique text) performs much better than the one with repeated data (yellow underlines).
Very specifically, predicting repeated names requires exactly a skip-trigram pattern [Elhage et al., 2021]
which is the algorithmic operation 1-layer attention-only models are known to perform. For example, the
following skip-trigrams are useful in the Harry Potter paragraph in Figure 8:
[a][b] . . . [a] => [b]

[ P ot][ter] . . . [ P ot] => [ter]

[a][b] . . . [a] => [b0 ]

[ P ot][ter] . . . [ P ot] => [ters]

8

Figure 9 Same as Figure 9, but for 2-layer attention-only models. Proper names (which are a good target
for copying when they occur more than once) are underlined in yellow on second or later occurance. Here
the repeated-data model sometimes does better on repeated proper names, but there are still clear examples
of the control performing much better. These examples are highlighted in green and discussed. On the
token [ley] in the second appearance of [D][urs][ley] the control model places a 92% likelihood on [ley]
whereas the repeated data model places a 10% likelihood. On the token [leys] in the second appearance of
[D][urs][leys] the control model places a 44% likelihood on [leys] whereas the repeated data model places a
4.9% likelihood. On the [ley] in [ un][D][urs][ley][ish] the control model places a 68% likelihood on [ley]
whereas the repeated data model places a 0.4% likelihood.

1.2

Oﬀ Distribution: Model Size Multiplier Ratio of Python to Text
multiplier ratio of python to text

multiplier ratio of python to text

Oﬀ Distribution: Model Size Multiplier Ratio of Python to Text

Parameters
1,570,000

1

5,310,000

0.8

12,600,000
42,500,000

0.6

101,000,000
197,000,000
340,000,000

0.4

805,000,000

1

2

5

10

2

5

100

1
0.95
0.9
0.85
0.8

1

fraction repeated

2

5

10

2

5

100

fraction repeated

Figure 10 We observe that training on high levels of repeated data causes a small disproportionate drop on
out-of-distribution performance (Python loss). The effect is noisy, but since we do not see a model size effect
we take the average in the figure on the right (harmonic mean of multipliers). For large repeated fractions of
50% and 90% we see model size multipliers of .84 and .75.

We also plotted the same visualization for a 2-layer attention-only model (which is known to contain simple
induction heads), and find the control model is better at fuzzy copying (Figure 9).
Visually, it is less obvious (compared to the 1-layer case) that the 2-layer repeated model is worse at names,
and there are a few examples where it puts 1.1x higher odds on the correct token. But on the other hand
there are dramatic cases of the control model doing 500x times better (odds ratio on correct token) for fuzzy
copying, like unDursleyish, which is exactly the kind of degradation we’d expect to see from disrupting
induction heads.
We attempted to leverage logit attribution (which earlier tokens contributed to the prediction of the current
token through a "direct path" with this attention head) to see if the difference was primarily due to the induction head being less active or other heads interfering with it [Olsson et al., 2022]. We were unable to find
clear evidence of either, but we include our exploration of a 2 layer attention only model in Appendix B.
Repeated data causes a smaller, disproportionate performance drop on our out-of-distribution evaluations.
9

Repeated Python Data also causes Double Descent
2.2

Parameters

2

1,570,000

python test loss

1.8

5,310,000

1.6

12,600,000

1.4

42,500,000
101,000,000

1.2

197,000,000
340,000,000

1

805,000,000

0.8
1

100

10k

1M

epochs on repeated tokens

Figure 11 Double descent phenomenon for models trained on python. Training on Python gives similar
results to what Figure 2 and Figure 4 show for language models. Here 50% of the dataset consists of repeats
and 50% is unique. On the left side is degradation in performance, occurring over a specific range of repetition
that varies with model size. On the right, we again see a large region of poor performance as we did in Figure
4, although the fit is noisier. Again the blue and green curves correspond to the right and left sides of the
double descent curve where we observe 50% of the maximum effect.

Given that we overfit the model, we expected it to perform worse off distribution, which we do observe
(Figure 10). We notice almost an opposite pattern to what we observed in the induction head results. We see
most of the disproportionate drop at 50% and 90% rather than 1-10%.
We observe a double descent phenomenon in sparse sweep of models trained on python, but we the
Python scans exhibit a somewhat different overall shape. To add more generality to our results, we
repeated the same experiments on a Python dataset instead of natural language (Figure 11). If we use the
same method to fit the poor performance region, we see a broadly similar fit and a second epoch for today’s
large models (approximately 200B parameters) is still robustly in the reduced performance region for python.
However the fit is noisier than the fit for text and the two lines are no longer parallel.
The noise may partially be explained by the Python fits being averaged over half as many settings for the
fraction of tokens that are repeated data. It could also be that we need a higher resolution Python scan to get
a cleaner estimate for the poor performance region. Finally, the Python data was trained on approximately 2
epochs as described in the methods section (so it included some repetition on the main dataset as well, not
just the repeated subset). Python also may have more unintentional repetition than text, from copying and
pasting of example code and forking of codebases. Such repetition could change the shape of the region of
poor performance. More analysis of the Python experiments is shown in Appendix A.
Pre-training on repeated data hurts fine-tuned performance We find that the negative impact of repeated
data persists after fine-tuning natural-language models on Python (Figure 12).
It is noteworthy that the performance hit once fine-tuned is much smaller. An 800M model pre-trained on
50% repeated data from the double descent peak had its effective parameters reduced by 10x in Figure 15
in Appendix A. When we fine-tune from the repeated model we see a 1.6x reduction in effective parameters
compared to training from scratch. This is still meaningful damage to the model, but it is recovered substantially. Since the repeated model forgets the repeated dataset after a modest amount of fine-tuning (Figure 12,
we consider the fine-tuned model with repeated data pre-training to be dominated by the fine-tuned model
from the unique dataset.

3

Methods

The decoder-only transformer models were trained on an 8192 token context with the same settings
as described in [Askell et al., 2021] for 100B tokens. Our language experiments utilized a 400B token dataset with 55% heavily filtered common crawl data (220B tokens), 32% internet books (128B tokens), and some smaller distributions including OpenWebText, Wikipedia, and Stack Exchange; most of
which we sourced from The Pile [Gao et al., 2021], and leveraged the 50,304 vocabulary GPT-2 encoding
[Radford et al., 2019, Wolf et al., 2019].
10

Pretraining on Repeated Data Hurts Finetuned Performance

ﬁnetune without repetition

5
4.5
4

ﬁnetune with repetition

3.5

scan

1.2
1.1

dataset
text test loss
repeated loss

3

from scratch
1

loss

python test loss

Majority of Overﬁtting on Repeated Data is Lost Quickly

0.9

2.5
2
1.5

0.8

10M

2

5

100M

2

5

1B

0

parameters

20B

40B

60B

80B

python tokens ﬁnetuned

Figure 12 Effect of repeated data during pre-training on fine-tuning. Models were pre-trained on 90% repeated data (red lines) or on totally unique data (blue lines), and then fine-tuned on Python (always unique
data). The repetition frequency was chosen to maximize the performance hit. The model pre-trained on repeated data encounters a sizable performance hit during fine-tuning (left panel), causing it to not only perform
worse than the model pre-trained on unique data, but also worse than a model trained from scratch (green
line). The right panel shows fine-tuning curves of the two models. The model pretrained on repeated data
performs much worse for several billion tokens (red line), but eventually catches up to the model pretrained
on unique data (blue line).

Code models were trained or fine-tuned on 45B tokens of Python for 2.2 epochs. Fine-tuning experiments
had the same hyperparameters as pre-training experiments, but with learning rates reduced by a factor of 2
and reduced warmups.
We varied model size, repeated dataset size, and the fraction of tokens trained on repeated data by 3, 2.5, and
2 orders of magnitude respectively.

4

Related Work

Scaling Laws
A scaling law lens consists of finding a small set of hyperparameters that have large, predictable impacts
on model performance, and was present throughout this work (at least one of the hyperparameters is generally model size, compute, or dataset size). The predictive nature of scaling laws makes them useful in a
broad number of research and engineering settings. The implications of scaling laws are sufficiently broad
and understandable that understanding them is relevant to policy makers [Ganguli et al., 2022]. Predictable
scaling trends in neural networks were first studied with [Hestness et al., 2017]. [Kaplan et al., 2020] demonstrated that test loss performance on language modeling tasks scales as a predictable function of model size,
dataset size, and compute. The scaling law lens has become more popular over time. For instance scaling
laws have been shown in many modalities (e.g., images, video, math, etc.) [Henighan et al., 2020], acoustics [Droppo and Elibol, 2021], transfer to code, [Hernandez et al., 2021], and few-shot adaptation of vision
models [Prato et al., 2021]. Existing scaling laws have been revisited as training setups change; for instance,
[Hoffmann et al., 2022] found that many recent large models have been under-trained. Our work uses the
scaling law lens on an aspect of dataset quality and supplements the lens with an interpretability lens, and we
believe our work is novel in both these respects.
Mechanistic Interpretability
A mechanistic interpretability lens was used in this work. Mechanistic interpretability refers to attempting to reverse engineer the detailed computations performed by the model. The mechanistic interpretability lens is useful for pure scientific understanding and has the potential to anticipate safety issues from
future more powerful models. There is a relatively detailed understanding of mechanistic interpretability for convolutional image models [Cammarata et al., 2020], some understanding for multimodal models
[Goh et al., 2021, Radford et al., 2021], and such an understanding is starting to be built up for Transformers
trained on language [Elhage et al., 2021, Olsson et al., 2022]. For a more thorough background on interpretability progress see the related work section of [Elhage et al., 2021]. These results are an example of a
“bridge” between microscopic phenomena inside the network and macroscopic trends in the loss, and we’re
only aware of one other example of such a bridge [Olsson et al., 2022].
11

Double Descent
Double descent was first shown in generality by Belkin et al. [Belkin et al., 2018] where it was observed
for decision trees, random features, and 2-layer neural networks. Similar behavior has been observed in
[Opper, 1995, Malzahn and Opper, 2001, Advani and Saxe, 2017, Geiger et al., 2019, Nakkiran et al., 2019].
For a more thorough background on double descent see Nakkiran et al. [Nakkiran et al., 2019]. We extend
the double descent phenomenon to a setting we see as more practical since data repetition in various forms
appears to be a universal, long-term issue; whereas modern large language models are generally outside of
the parameters and data regime of previously observed double descent phenomenon.
Rise of Engineering Large, Diverse Language Datasets
Algorithmic innovation [Hernandez and Brown, 2020], compute [Amodei et al., 2018], and data are three of
the major factors that drive the advance of AI. The engineering and science of large, diverse language datasets
is relatively new. Pre-2017 many language models were trained on a single distribution of text, such as news
articles [Jozefowicz et al., 2016], Wikipedia [Merity et al., 2016], or fiction books [Kiros et al., 2015]. GPT-2
[Radford et al., 2019] leveraged webtext, outbound Reddit links with at least 3 upvotes in order to use human
curation/filtration to ensure quality in addition to a broad distribution. GPT-2’s capabilities are largely attributed to its scaled-up size and dataset (10x the parameters and 10x the data of GPT) [Radford et al., 2019].
The next generation of language models, [Brown et al., 2020, Rae et al., 2021, Hoffmann et al., 2022], leveraged large, diverse datasets that consist of many sub-distributions. Constructing such datasets includes a
large number of decisions: choosing sampling weights, quality filtering, de-duplication, fuzzy de-duplication,
epochs per dataset, and more. There has not yet been substantial public work that quantitatively shows the
impact of such decisions, but the dataset ablations in Appendix A of the Gopher [Rae et al., 2021] paper are
notable. They clearly show the benefit of their dataset mixture, quality filter, exact de-duplication, and fuzzy
de-duplication for 1.4B parameter models. Our work aims to provide some insights and potential diagnostics
for researchers and engineers designing large datasets for language models.

5

Discussion

5.1

Why does repeating a small fraction of data damage performance so much?

We showed that a dataset with only 10% repeated tokens can reduce model performance by an effective 2x
in parameter count, much more than if that 10% of the data had simply never been trained on. The repeated
data thus degrades model performance out of proportion to its share in the dataset. Why does this occur, and
why only for a specific amount of repetition? One plausible hypothesis comes from looking at the model’s
“incentives” to memorize vs generalize. To informally explore this hypothesis consider the following rough
numbers, a 800M parameter model typically has a loss of roughly 2.0 nats/token, a 400M parameter model
has a loss of roughly 2.2 nats/token, and fully memorized data will have a loss of 0 nats/token. Now suppose a
800M model is trained on 90% unique data and 10% tokens consisting of repeated data. We can ask whether
it is a “good tradeoff” for the model to memorize the repeated data (leading to 0 loss on 10% of the dataset),
at the cost of degrading performance by the equivalent of a 2x multiple in model size (which raises loss on
the other 90% from 2 to 2.2). Some simple arithmetic suggests that it is: 0.9 ∗ 2.2 + 0.1 ∗ 0 = 1.98 < 2.0.
Another way to say this is that zero loss is such a huge drop compared to the differences in entropy between
model sizes that driving the loss to zero on even a tiny subset can incentivize enormous degradation in quality.
This however leaves open the question of when this tradeoff is necessary or possible – and here is where
the double descent phenomenon comes in. If a lot of data is repeated only a few times (say 5% of the data
repeated 2x) then the model may not have the capacity to memorize it, and also does not see it enough times
during training to do so. If a tiny amount of data is repeated very many times (say 0.01% of the data repeated
1000x), then the model will memorize it, but because it is so small the model need not use much capacity to
do so, so the degradation in quality will likely be small. There is a range in the middle where the data can be
memorized and doing so consumes a large fraction of the model’s capacity, and this may be where the peak
of degradation occurs.
5.2

Generalization, memorization, and induction heads

Our results show that overfitting on the repeated data results in worse test loss, and this co-occurs with a
disproportionate degradation in the model’s induction heads (prefix matching score) and its ability to copy
text. Copying sequences can be seen as a form of generalization, as it requires algorithmic operations that
are independent of the content of the data. [Elhage et al., 2021, Olsson et al., 2022] provided evidence for
induction heads as the mechanism implementing copying and other pattern-matching. For the 2 layer model
12

shown in Figure 7 it seems as if the pressure to memorize the repeated dataset has led a skip tri-gram head
to replace the induction head entirely. Thus our results tell a story where a type of generalization and its
internal implementation are disrupted when the model memorizes repeated data – a vivid illustration of the
memorization-generalization trade-off. Future work could take this even further, by measuring the number of
parameters devoted to memorization and trying to observe them competing for space with induction heads.
Finally, it is worth noting that the co-occurence of copying degradation and induction head degradation
is itself some additional evidence for induction heads as the source of in-context learning; Olsson et al.
[Olsson et al., 2022] was not fully conclusive and our results further bolster the case.
5.3

Bridging mechanistic interpretability and scaling laws

The results connecting memorization to the degradation of mechanistic interpretability structures
[Olsson et al., 2022] are an example of a “bridge” between microscopic phenomena inside the network and
macroscopic trends in the loss. We view such connections as very fruitful tools for research, because they allow us to see the same thing through different lenses: the macroscopic behavior demonstrates the significance
of the microscopic mechanisms, and the microscopic mechanisms help explain how and why the macroscopic
phenomena occur. Switching back and forth between the two allows for a deeper understanding of both, as
well as more robust diagnostics if something goes wrong. We are aware of at least one other instance of such
a bridge – the correspondence between the formation of induction heads and the boost in in-context learning
near the beginning of training [Elhage et al., 2021, Olsson et al., 2022] – but such connections remain rare so
far, and we believe that finding more of them is a promising route to more deeply understanding neural nets.
5.4

Repeated data and fine-tuning

We hypothesized repetition might help explain why models trained from scratch sometimes outperformed
models that were pre-trained and then fine-tuned [Hernandez et al., 2021]. For our purposes, we define ossification as any pre-training that leads a fine-tuned model to perform worse than a model trained from scratch
(given a fixed compute and data budget). It required relatively extreme repetition in pre-training (90% training on repeated tokens at peak of double descent curve, 73x reduction in effective model size) to see a large
ossification effect (1.6x reduction in effective model size) within our fine-tuning setup. We still think repetition might explain a large fraction of ossification when we consider training on various types of repetition
we did not study here (sentence level, paragraph level, similar documents, distribution, etc). Overall, our
finding that repetition can induce ossification provides medium causal evidence to this hypothesis. We think
ossification is an interesting phenomenon that merits further study.
5.5

Limitations

We attempt to discuss limitations throughout the text where appropriate, but for the reader’s convenience, we
enumerate them here. We attempt to list them in a loosely descending order of importance.
1. We used a fixed number of tokens for all models (similar to the GPT-3 model sweep), because
these models were trained prior to the release of Chinchilla, which showed the compute frontier (pareto frontier of performance and compute) is quite different than previously understood
[Brown et al., 2020, Hoffmann et al., 2022].
2. Our fits for region of poor performance were relatively noisy, and we only observed a clean trend by
aggregating them. This is discussed in the Results section and further explored in Appendix A.
3. The data we repeated was a random subset of the original dataset, and is thus not directly applicable
to the situation where higher quality data (such as Wikipedia) is intentionally repeated to improve
quality. Nevertheless, it seems plausible that the results would carry over.
4. We measured loss, rather than downstream NLP evaluations. Overfitting does not always entail
worse performance on downstream tasks [Ouyang et al., 2022], so it is possible that the degradation
we observe does not carry over to these tasks.
5. We did not explore the effects of early stopping, dropout, weight decay, or other regularization.
6. We did not investigate simpler systems than 1L attention-only models, which might contain more
complete mechanistic insights.
13

5.6

Future Directions

Below are some future directions we think are promising:
1. A compute efficient frontier scan to predict the poor performance region.
2. Varying the type of repetition. We could inject repeated sentences or paragraphs at the beginning or
end of some fraction of contexts, or repeat chunks of documents in a different order. We could also
explore cases where the repeated data has a different distribution than the unique data.
3. Further interpretability work. Are there neurons that tell the model what distribution it is in: unique
or repeated? Are there neurons through which we can observe and edit the repeated sequences?
4. Drill down on memorization and generalization. Could we measure the number of model parameters taken up by memorization vs generalization, either behaviorally or by using mechanistic interpretability to identify parameters that are storing memorized data? Can we measure how this varies
across the double descent, and thus watch the competition between memorized data and induction
heads for model capacity?
5. Could repetition and double descent help explain loss spikes during training? If a model can largely
memorize a particularly easy batch in a single gradient step then a very skinny double descent could
present as a loss spike.

6

Conclusion

We’ve shown that small fractions of repeated data, if repeated at the right frequency, can cause surprisingly
severe degradation to model performance. We show that this degradation scales predictably, occurs across
datasets, and is associated with disproprotionate damage to internal mechanisms associated with generalization, such as induction heads. In practical terms, these results provide a tool for predicting and diagnosing
data-repetition-related problems in language models. In more conceptual terms, they are an example of a
bridge between the macroscopic domain of scaling laws and the microscopic domain of mechanistic interpretability, as well as a lens for gaining a more detailed understanding of how generalization and memorization work. We believe these conceptual themes are promising ones, and hope to see more work that employs
them.

Acknowledgments
We thank Ethan Perez, Jan Leike, and Martin Wattenberg for helpful feedback on the draft. We thank Daniela
Amodei, Jamie Kerr, Jia Yuan Loke, Rebecca Raible, and Tim Telleen-Lawton for support with the project.

Author Contributions
Danny Hernandez led the project performed the majority of experiments, analysis, and writing.
Tom Brown led engineering efforts for the scaling team, including efficient pre-training and gave helpful
feedback on the paper.
Tom Conerly made engineering contributions on the scaling team.
Nova DasSarma managed the underlying cluster infrastructure.
Dawn Drain helped with pre-training research and infrastructure.
Sheer El-Showk helped with pretraining research and dataset construction.
Nelson Elhage contributed significantly to interpretability tooling, provided support on that tooling, and gave
helpful feedback.
Zac Hatfield-Dodds helped with codebase maintenance and with engineering
Tom Henighan helped with pretraining the underlying language models, with dataset creation, with managing the cluster during some phases of the project, and gave helpful feedback on the paper.
Tristan Hume contributed to interpretability tooling that was leveraged in this work.
14

Scott Johnston helped with pretraining research.
Ben Mann contributed to pretraining and cluster management.
Chris Olah lead the interpretability team, which provided tooling and support for this work.
Catherine Olsson contributed to interpretability tooling, provided support on that tooling, and provided
interpretability research advice.
Dario Amodei contributed greatly to the framing and writing of the work and advised the project.
Nicholas Joseph helped design and build a framework for efficient training of large language models, gave
helpful feedback on the paper, and advised the project.
Jared Kaplan led pre-training efforts initially and advised the project.
Sam McCandlish led pre-training efforts and advised the project.

15

Power Law ﬁt For Control Scan on Language Data

Power Law ﬁt For Control Scan on Python Data

3.4
3.2

ﬁt

actual
ﬁt

1.4

2.8

test loss

test loss

1.6

actual

3

2.6
2.4

1.2

1

2.2
0.8

2
2

5

10M

2

5

100M

2

5

2

1B

5

10M

parameters

2

5

100M

2

5

1B

parameters

Figure 13 We see a power laws provide good fits for both language and Python data. We can use these fit
to re-parameterize loss for our models trained on repeated data into model size multipliers.

A

Model Size Multiplier and Poor Performance Region Fits

In order to fit the poor performance regions we first fit power laws to our control scans on language and
Python so that we can re-parameterize loss in terms of model size multipliers. These fits are shown in Figure
13
When we graph repeated epochs vs model size multiplier with a given fraction of repeated data in Figure 15,
we observed that our 1% repeated data graphs were quite noisy, so we excluded the 1% scans from the fits.
The 3% repeated data graphs looked reasonable, in that the double descent peak looked large compared to
the noise, so we included that all higher fractions in our fits.
We estimate how many repeated epochs half of the maximum effect size (on a log scale) would be observed
using linear interpolation on the left and right side of the double descent peak for each fraction of repeated
data. We then averaged these curves to make an overall estimate for the left and right boundaries of the poor
performance region shown in Figure 4 and Figure 11. For text this produces a relatively clean overall fit,
but the the individual curves for text are relatively noisy as shown in 14. Some potential explanations for
the noise are i) Given the resolution of our scan we do not always get a good estimate of the peak effect
for a given curve (the peak can easily be between two points we measured ii) our linear interpolation also
introduces error as our underlying curves only have 6 points.

50% of Max Double Descent Eﬀect for Text (Left Boundary)

50% of Max Double Descent Eﬀect for Text (Right Boundary)

2

fraction

2

10
20

5

50
90

2

100
5

repeated epochs

repeated epochs

1000

fraction

100k

3

3
10

5

20
2

50

10k

90

5

2

1000

2

5
2

5

10M

2

5

100M

2

5

2

1B

parameters

5

10M

2

5

100M

2

5

1B

parameters

Figure 14 We estimate how many repeated epochs half of the maximum effect size (on a log scale) would
be observed using linear interpolation on the left and right side of the double descent peak for each fraction
of repeated data. We then averaged these curves to make an overall estimate for the left and right boundaries
of the poor performance region shown in Figure 4
Overall we think the region of poor performances we showed in Figure 4 is relatively robust in that it is useful
to think about the sub distribution double descent phenomena there. However, we would not claim that we
have produced extremely accurate estimates for the exact boundaries, even in our setup, and the boundaries
could vary meaningfully given a different setup, especially differences in regularization.
16

Model Size Multiplier: 1% Repeated Data
1.02

Parameters
5,310,000

1

12,600,000
0.99

42,500,000
101,000,000

0.98

197,000,000
340,000,000

0.97

Parameters

1

1,570,000

805,000,000

model size multiplier

1.01

model size multiplier

Model Size Multiplier: 3% Repeated Data

1,570,000

0.98

5,310,000

0.96

12,600,000

0.94

42,500,000
101,000,000

0.92

197,000,000

0.9

340,000,000

0.88

805,000,000

0.86

0.96

0.84
1

100

10k

1

100

repeated epochs

Model Size Multiplier: 10% Repeated Data

Model Size Multiplier: 50% Repeated Data

Parameters

1
0.9

5,310,000
12,600,000
42,500,000

0.8

101,000,000
197,000,000

0.7

Parameters

1

1,570,000

model size multiplier

model size multiplier

10k

repeated epochs

340,000,000
805,000,000

0.6

1,570,000
5

5,310,000
12,600,000
42,500,000

2

101,000,000
0.1

197,000,000
340,000,000

5

805,000,000

2

1

100

10k

1

100

repeated epochs

10k

1M

repeated epochs

Figure 15 it is easier to see the sharpness of the double descent peaks in this diagram than Figure 2. The
1% runs was much noisier than the rest so we excluded it from our fits
.

50% of Max Double Descent Eﬀect for Python (Left Boundary)

50% of Max Double Descent Eﬀect for Python (Right Boundary)

fraction

fraction
5

10
50

5

2

10

repeated epochs

repeated epochs

2

100

10
50

2

10k
5

2

1000

5

5
2

5

10M

2

5

100M

2

5

2

1B

5

10M

parameters

2

5

100M

2

5

1B

parameters

Figure 16 We estimate how many repeated epochs half would cause half of the maximum effect size (on
a log scale) for our Python models using linear interpolation on the left and right side of the double descent
peak for each fraction of repeated data. We then averaged these two curves to make an overall estimate for
the left and right boundaries of the poor performance region shown in Figure 11

For Python, the aggregate shown in Figure 11 is quite a bit noisier. A lot of the noise is explained by only
aggregating two scans rather than 5. But we see the individual scans for Python are also noiser as shown in
Figure 16

B

Appendix: Logit Attribution Analysis, 2 Layer Models

For attention only models we can directly attribute contributions of the attention heads to the logits. We
attempted to use this technique to better understand how the induction heads were disrupted for 2 layer
17

Figure 17 For attention only models we can directly attribute contributions of the attention heads to the
logits as shown in [Elhage et al., 2021, Olsson et al., 2022]. Both models were evaluated on the first paragraph of Harry Potter copied twice. The induction head appeared to be head 0, shown in red for both models.
The control model’s logit attribution is shown for the first two paragraph, and the third paragraph shown is
from the repeated model at the double descent peak for comparison.
models. For instance, it could be they were firing more weakly, or it could be activity from other attention
heads were interfering with their ability to copy.
Overall it feels like both effects happen weakly, and that it was easier to understand the disruption to induction
heads through the per token losses shown in Figures 8 and 9, than through logit attribution.

18

Figure 18 For attention only models we can directly attribute contributions of the attention heads to the
logits as shown in [Elhage et al., 2021, Olsson et al., 2022]. Similar to Figure 17 both models were evaluated
on the first paragraph of Harry Potter copied twice, but here the contribution of all attention heads is shown.
The other attention heads in the repeated data model appears more active (several of the reddish tokens in the
second paragraph are brown in the third paragraph).

1-Layer Attention only Models also show Double Descent

4.2

attention only

4.15

True
False

test loss

4.1
4.05
4
3.95
3.9
3.85
100

1000

10k

100k

1M

epochs on repeated tokens

Figure 19 We still observe double descent on repeated data with 1 layer attention only models, so it is
possible we’d observe double descent on repeated data for simpler model types.

19

C

Appendix: Copying and Prefix Matching Score Fits

Harry Potter 1st Paragraph Repeated 11 Times: Control

10% Repeated Data: Loss on HP 1st Paragraph copied 11 times
3

2

Parameters
1,570,000

2

5,310,000
12,600,000

1

9
8
7
6

loss

loss

1

42,500,000

9
8
7
6

5

5

4

4

3

3

101,000,000
197,000,000
340,000,000
805,000,000

2

2
2

5

10M

2

5

100M

2

5

5

1B

100

2

parameters

5

1000

2

5

10k

2

5

100k

2

repeated epochs

Figure 20 In order to do the model size interpolation used in Figure 5 we use the loss on Harry Potter’s first
paragraph copied 11 times for our control models (no repeated data). it is relatively well behaved, but it was
not obvious how to extrapolate the curve. On the right, as a sanity check, we check to make sure we still see
peaks moving left as model size increases that approximately line up with what was observed in 2

Preﬁx Matching Score with 90% repetition

Model Size Multiplier: Preﬁx Matching Score Averaged
1

model size multiplier

preﬁx matching score

0.9

0.8

0.7

0.6

0.5

5

2

0.1
5

2

0.01
2

5

10M

2

5

100M

2

5

1B

1

Parameters

2

5

10

2

5

100

fraction repeated

Figure 21 For the model size multiplier in Figure 6 we a linear fit on the prefix matching score for the
control models shown on the left. On the right, similar to 10 we show that if we take an average over model
size (harmonic mean of multiplier), we get a relatively clean relationship.

20

D

Appendix: Harry Potter Copying Evaluation with Fewer Characters

50% Repeated Data, 2L, First 125 Characters of Paragraph

6

paragraph copies
1

5

2

4

3
4

loss

3

5
6

2

7
8
9
10
11

1
1

100

10k

1M

epochs on repeated tokens

Figure 22 In order to make sure the copying eval was not merely evaluating in context learning, we tried a
much shorter copied sequence (approximately 10x shorter, 125 characters instead of 1463). We still observe
approximately no learning from repeated copying for the 2L model trained on 50% repeated data at the double
descent peak

21

References
[Advani and Saxe, 2017] Advani, M. S. and Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks.
[Amodei et al., 2018] Amodei, D., Hernandez, D., Sastry, G., Clark, J., Brockman, G., and Sutskever, I.
(2018). Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute.
[Askell et al., 2021] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph,
N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K.,
Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general
language assistant as a laboratory for alignment.
[Belkin et al., 2018] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine
learning practice and the bias-variance trade-off.
[Bi et al., 2020] Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. (2020). Palm:
Pre-training an autoencoding and autoregressive language model for context-conditioned generation.
[Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan,
A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child,
R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,
S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020).
Language models are few-shot learners.
[Cammarata et al., 2020] Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C.,
Egan, B., and Lim, S. K. (2020). Thread: Circuits. Distill. https://distill.pub/2020/circuits.
[Droppo and Elibol, 2021] Droppo, J. and Elibol, O. (2021). Scaling laws for acoustic models.
[Elhage et al., 2021] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A.,
Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D.,
Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish,
S., and Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
https://transformer-circuits.pub/2021/framework/index.html.
[Ganguli et al., 2022] Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph,
N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E.,
Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D.,
Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C., and Clark, J. (2022). Predictability and
surprise in large generative models.
[Gao et al., 2021] Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H.,
Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2021). The pile: An 800gb dataset of diverse text for
language modeling.
[Geiger et al., 2019] Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart,
M. (2019). Jamming transition as a paradigm to understand the loss landscape of deep neural networks.
Physical Review E, 100(1):012115.
[Goh et al., 2021] Goh, G., Nick, C., Chelsea, V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah,
C. (2021). Multimodal neurons in artificial neural networks. Distill. https://distill.pub/2021/multimodalneurons.
[Henighan et al., 2020] Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown,
T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M.,
Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative modeling.
[Hernandez and Brown, 2020] Hernandez, D. and Brown, T. B. (2020). Measuring the algorithmic efficiency
of neural networks. CoRR, abs/2005.04305.
[Hernandez et al., 2021] Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws
for transfer. arXiv preprint arXiv:2102.01293.
[Hestness et al., 2017] Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary,
M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically.
[Hoffmann et al., 2022] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E.,
Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche,
22

G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L.
(2022). Training compute-optimal large language models.
[Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling.
[Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S.,
Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models.
[Kiros et al., 2015] Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler,
S. (2015). Skip-thought vectors.
[Lee et al., 2021] Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N.
(2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
[Malzahn and Opper, 2001] Malzahn, D. and Opper, M. (2001). A variational approach to learning curves.
In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing
Systems, volume 14. MIT Press.
[Merity et al., 2016] Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture
models.
[Nakkiran et al., 2019] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019).
Deep double descent: Where bigger models and more data hurt.
[Olsson et al., 2022] Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B.,
Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D.,
Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan,
J., McCandlish, S., and Olah, C. (2022). In-context learning and induction heads. Transformer Circuits
Thread. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
[Opper, 1995] Opper, M. (1995). Statistical mechanics of learning: Generalization.
[Ouyang et al., 2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell,
A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow
instructions with human feedback.
[Prato et al., 2021] Prato, G., Guiroy, S., Caballero, E., Rish, I., and Chandar, S. (2021). Scaling laws for the
few-shot adaptation of pre-trained image classifiers.
[Radford et al., 2021] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual
models from natural language supervision.
[Radford et al., 2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019).
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
[Rae et al., 2021] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J.,
Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R.,
Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang,
S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S.,
Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L.,
Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J.-B.,
Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D.,
d’Autume, C. d. M., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., Casas, D. d. L., Guy, A., Jones,
C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero,
S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K.,
and Irving, G. (2021). Scaling language models: Methods, analysis, and insights from training gopher.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,
L., and Polosukhin, I. (2017). Attention is all you need.
[Wolf et al., 2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault,
T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C.,
Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2019). Huggingface’s transformers:
State-of-the-art natural language processing.

23