An Empirical Exploration in Quality Filtering of Text Data
Leo Gao
EleutherAI
lg@eleuther.ai

arXiv:2109.00698v2 [cs.CL] 6 Oct 2021

Abstract
While conventional wisdom suggests that
more aggressively filtering data from lowquality sources like Common Crawl always
monotonically improves the quality of training data, we find that aggressive filtering can
in fact lead to a decrease in model quality on
a wide array of downstream tasks for a GPTlike language model. We speculate that this
is because optimizing sufficiently strongly for
a proxy metric harms performance on the true
objective, suggesting a need for more robust
filtering objectives when attempting to filter
more aggressively. We hope this work leads to
detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

1

Introduction

As language models increase in size, the need for
large, high-quality text datasets has increased as
well. Recent work in dataset construction for large
language models has centered largely on taking
large internet corpora like Common Crawl and employing some method of filtering using some proxy
for quality to extract a smaller, high quality training
set (Wenzek et al., 2019; Brown et al., 2020; Raffel
et al., 2020; Yang et al., 2020). In particular, we
focus on shallow classifier-based quality filtering
as in Brown et al. (2020) because it provides a simple, continuous, and quantifiable way to adjust the
aggressiveness of filtering, and because this reflects
the type of classifier used in prior work.
While intuitively it may seem like the more data is
discarded the higher quality the remaining data will
be, we find that this is not always the case with shallow classifier-based filtering. Instead, we find that
filtering improves downstream task performance
up to a point, but then decreases performance again
as the filtering becomes too aggressive.
We speculate that this decrease in performance

Figure 1: Average accuracy across all 13 tasks for various different filtering ratios using a shallow quality
classifier.1 The amount of data post-filtering is held constant. Although filtering improves performance at first,
discarding more data can actually reduce accuracy, due
to misalignment between filtering classifier objective
and text quality.

is due to Goodhart’s law (Goodhart, 1984), and
specifically regressional Goodharting (Manheim
and Garrabrant, 2019):
Goodhart’s Law. Any observed statistical regularity will tend to collapse once pressure is
placed upon it for control purposes. (Goodhart,
1984)
In other words, optimizing a metric that is a proxy
for a desired outcome tends to invalidate the proxy.
By optimizing too strongly for the classifier’s score
by discarding too many low-scoring documents,
the documents that are kept are consistently biased
towards the ones with features superficially resembling the high quality data in a way that satisfies the
classifier, rather than truly high quality data.
1
The average is taken across all task accuracies, with each
task weighted equally. The error bars in this plot p
represent
P 2
standard error and are computed by semean = n−1
sei ,
where sei represents the standard error for each individual
task.

2

Related work

The recent proliferation of ever larger language
models has led to increasing demands on training
data (Radford et al., 2018, 2019; Gokaslan and
Cohen, 2019; Rosset, 2019; Shoeybi et al., 2019;
Devlin et al., 2019; Liu et al., 2019; Raffel et al.,
2020; Brown et al., 2020; Zeng et al., 2021). This
data is increasingly derived from internet corpora
like Common Crawl (Radford et al., 2019; Ortiz
Suárez et al., 2019; Wenzek et al., 2020; Conneau
et al., 2020; Brown et al., 2020; Gao et al., 2020;
Raffel et al., 2020).
However, the quality of raw Common Crawl data
is often insufficient to be directly used. To combat
this, many existing works use some kind of proxy
for quality, like a classifier between known high
quality data and low quality data (Brown et al.,
2020; Gao et al., 2020; Zeng et al., 2021), handcrafted heuristics (Yang et al., 2020; Raffel et al.,
2020), or keeping only documents with perplexity
scores that fall in some middle quantile of an existing language model (Wenzek et al., 2020). Brown
et al. (2020) in particular filter extremely aggressively using their classifier, discarding about 98.7%
of their data.
Previous work has shown that models trained on
heuristic-filtered datasets perform better on downstream tasks (Raffel et al., 2020). However, Gao
et al. (2020) show that a perplexity-filtered CCderived dataset actually performs worse than unfiltered CC on certain tasks. Brown et al. (2020) do
not provide any detailed analysis, but claim better
quality for filtered data as evaluated through loss on
held out sets of “generative text samples."

3

Downstream Evaluation
Experiment

To evaluate the effect of different degrees of filtering, we create a series of training sets with
a controlled filtering methodology but with different hyperparameter settings to result in varied filtering ratios. We filter using the same
method used in Brown et al. (2020), with a Paretodistribution thresholded filtering method and a
shallow CommonCrawl-WebText classfier. In this
method, rather than using a hard threshold, the
threshold τ ∼ Pareto(α) is sampled from a Pareto
distribution, such that each document is kept if
τ > 1 − score, where α is a hyperparameter that
controls the permissivity of the filter (see Table 1).

α

Fraction Discarded

1
2
3
4
5
6
7
8

0.4107
0.6351
0.7610
0.8329
0.8761
0.9026
0.9198
0.9315

Table 1: Percentage of discarded documents of various
settings using our classifier.

In effect, this relaxes the filter when compared to a
hard threshold and allows some low-scoring data
to be kept.
As none of the data or models used in Brown et al.
(2020) has been made public, we instead use the
same type of fasttext (Joulin et al., 2017) classifier
between unfiltered Common Crawl and OpenWebText2 as used in Gao et al. (2020).
We use GPT-Neo (Black et al., 2021) to train a series of models on each training set and evaluate on
downstream tasks using the EleutherAI LM evaluation harness (Gao et al., 2021). Each model is
1.3 billion parameters, has a GPT-2 architecture
(Radford et al., 2019) with the same model hyperparameters as the GPT-3-XL setting in Brown et al.
(2020), and is trained for 25k iterations with a batch
size of 256.
To ensure that the effect is not confined to any specific task. we evaluate on a series of many downstream tasks. We use zero-shot prompting with
no task-specific fine tuning and with prompting inspired by Brown et al. (2020) for many tasks. In
total, we evaluate on ANLI Round 3 (Nie et al.,
2020), BoolQ (Clark et al., 2019), CommitmentBank (de Marneffe et al., 2019), COPA (Gordon
et al., 2012), Hellaswag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), MathQA (Amini
et al., 2019), MultiRC (Khashabi et al., 2018),
OpenbookQA (Mihaylov et al., 2018), PiQA (Bisk
et al., 2019), PubmedQA (Jin et al., 2019), SciQ
(Welbl et al., 2017), and Winogrande (Sakaguchi
et al., 2019). Error bars in all evaluation task plots
indicate standard error with respect to instances of
the evaluation task.
For the training data, we create 40 GB filtered

Figure 2: Plots of results for all downstream tasks explored in this paper. Higher is better on all metrics except
LAMBADA perplexity (first plot in the third row), where lower is better.

chunks of the Common Crawl data for each value
of α ∈ {1, 2, 3, 4, 5, 8}; in other words, different
amounts of raw Common Crawl data are consumed
for different α to produce the same fixed 40GB
size result. For reference, Brown et al. (2020) filter even more aggressively than we do, discarding
about 98.7% of their data. The 40GB size is chosen
because it is approximately the size of OpenWebText, which is representative of the amount of data
usually used to train models of this size.
3.1

Results

Of the tasks evaluated, several tasks remained near
chance or had very high variance, resulting in no
clear trend. Of the remainder, an absolute majority
exhibited an initial increase in performance and
then a decrease in performance after the amount
of documents discarded surpassed a threshold that
varied by task. Additionally, for almost all tasks
the most filtered model was not the best performing. Some tasks like BoolQ exhibit little clear trend.
Not all tasks have the same optimal α—compare
PiQA and LAMBADA—and some tasks like PubmedQA show a much more sudden decrease in accuracy. For results on all tasks, see Figure 2.
3.2

Analysis

We hypothesize that this decline in performance
is because of misalignment between the classifier
objective, intended to be a proxy for quality, and
actual document quality. For instance, a classifier
to distinguish WebText2 from Common Crawl, as
in GPT-3, would also exclude domains of text data
not found as often in WebText2.
We also hypothesize that the difference in optimal
α between different tasks is because the characteristics of the different types of data that help the
most with each task are over/underdiscarded to a
different extent due to spurious correlations with
the quality metric. As such, we do not expect the
exact thresholds to transfer to other tasks, classifiers, or datasets. This is an expected consequence
of Goodharting, because the degree to which different types of text data correlate with the features
learned by the classifier is mostly spurious.

sifier to classify between BookCorpus2 (Gao
et al., 2020) and OpenWebText (Gokaslan and Cohen, 2019), and compute the mean BookCorpus2probability of each training set. If the classification
model is favoring OpenWebText-like data over generally high-quality data, then as filtering increases
in intensity, the proportion of BookCorpus2-like
data should decrease as the data consists increasingly of OpenWebText-like text. Conversely, if
the classification model is robustly favoring high
quality text, then as filtering increases in intensity,
the proportion of BookCorpus2-like data should
increase, as low-quality text looks nothing like
BookCorpus2. We also repeat this experiment for
Pubmed Abstracts.
We chose BookCorpus2 and Pubmed Abstracts because of their similarity in distribution to LAMBADA and PubmedQA respectively, in the hopes
of observing a similarity between the task evaluation curves and the data domain curves.
4.1

As seen in Figure 3, the fraction of BookCorpus2like data remains mostly constant until around 0.6,
after which it declines sharply. A similar pattern
is observed with Pubmed Abstracts, albeit with an
earlier drop (Figure 4).
The BookCorpus2-like data curve’s drop precedes
the LAMBADA performance drop by about 0.2.
Similarly, the Pubmed Abstracts drop also precedes
the PubmedQA’s main drop slightly.
4.2

Domain Misalignment
Experiment

To test the hypothesis that the misalignment
of the objective leads to the exclusion of nonOpenWebText2-like data, we train a fasttext clas-

Analysis

The decrease in Pubmed Abstracts and BookCorpus2 like data as filtering increases in aggressiveness supports the hypothesis that part of the problem is that text domains not similar to OpenWebText2 are being discarded.
Our main hypothesis for why the domain data content starts decreasing before the evaluation metric performance does is that these tasks are sufficiently different in distribution to the respective
datasets.

5
4

Results

Limitations

This work is intended to show that the common
assumption that more aggressive data filtering is
better is not always true, and thus focuses on one
particular classifier used in the real world as an
illustrative example. Depending on the type of

Figure 3: Fraction of documents in filtered Common
Crawl classified as BookCorpus2-like by a shallow
classifier trained to distinguish OpenWebtext and BookCorpus2. Note that this plot has a different x-axis scale
from the task evaluation plots.

Figure 4: Fraction of documents in filtered Common
Crawl classified as PubmedAbstracts-like by a shallow
classifier trained to distinguish OpenWebtext and PubmedAbstracts. Note that this plot has a different x-axis
scale from the task evaluation plots.

classifier, the training data used for the classifier,
and the downstream task, this effect may not be
relevant in certain settings. We leave an exhaustive
exploration of the contribution of these various
factors to future work.

References

6

Conclusion

In this paper, we explored the effect of filtering the
training data using a shallow model trained on a
proxy for quality on downstream language model
performance. We showed that increasing the aggressiveness of filtering against this signal actually
decreases model performance past a certain point,
and speculate that this is due to Goodhart’s law, as
the misalignment between proxy and true objective
becomes more significant with increased optimization pressure. We hope that this work leads to more
careful analysis of the effects of filtering in future
language modeling work.

Acknowledgements
The author would like to thank TPU Research
Cloud for providing the computational resources
for the training, and CoreWeave for providing the
computational resources for data processing and
evaluation.
The author would also like to thank Stella Biderman, Sid Black, Charles Foster, Eric Hallahan,
Kyle McDonell, Jason Phang, and Laria Reynolds
for providing feedback on the manuscript.

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik
Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019.
MathQA: Towards interpretable
math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–
2367, Minneapolis, Minnesota. Association for Computational Linguistics.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng
Gao, and Yejin Choi. 2019. Piqa: Reasoning about
physical commonsense in natural language. arXiv
preprint arXiv:1911.11641.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arXiv preprint
arXiv:2005.14165.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom
Kwiatkowski, Michael Collins, and Kristina Toutanova.
2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages

2924–2936, Minneapolis, Minnesota. Association for
Computational Linguistics.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
and Veselin Stoyanov. 2020. Unsupervised crosslingual representation learning at scale. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107–124.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of deep
bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers). Association for
Computational Linguistics.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser,
and Connor Leahy. 2020. The pile: An 800gb dataset
of diverse text for language modeling. arXiv preprint
arXiv:2101.00027.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason
Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben
Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation.
Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/
OpenWebTextCorpus.
C. A. E. Goodhart. 1984. Problems of Monetary Management: The UK Experience, pages 91–121. Macmillan Education UK, London.
Andrew Gordon, Zornitsa Kozareva, and Melissa
Roemmele. 2012. SemEval-2012 task 7: Choice of
plausible alternatives: An evaluation of commonsense
causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the
shared task, and Volume 2: Proceedings of the Sixth
International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W.
Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset
for biomedical research question answering. arXiv
preprint arXiv:1909.06146.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient text
classification. In Proceedings of the 15th Conference

of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages
427–431. Association for Computational Linguistics.
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,
Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of
the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
pages 252–262, New Orleans, Louisiana. Association
for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa:
A robustly optimized BERT pretraining approach.
arXiv preprint arXiv:1907.11692.
David Manheim and Scott Garrabrant. 2019. Categorizing variants of goodhart’s law. arXiv preprint
arXiv:1803.04585.
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages
2381–2391, Brussels, Belgium. Association for Computational Linguistics.
Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial nli: A new benchmark for natural language
understanding. arXiv preprint arXiv:1910.14599.
Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent
Romary. 2019. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges
in the Management of Large Corpora (CMLC-7) 2019.
Cardiff, 22nd July 2019, pages 9 – 16, Mannheim.
Leibniz-Institut für Deutsche Sprache.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro
Pezzelle, Marco Baroni, Gemma Boleda, and Raquel
Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
Alec Radford, Karthik Narasimhan, Time Salimans,
and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report,
OpenAI.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8):9.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J. Liu. 2020. Exploring the limits of
transfer learning with a unified text-to-text transformer.
arXiv preprint arXiv:1910.10683.
C Rosset. 2019. Turing-NLG: A 17-billion-parameter
language model by Microsoft. Microsoft Blog.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint
arXiv:1907.10641.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
2019. Megatron-LM: Training multi-billion parameter
language models using gpu model parallelism. arXiv
preprint arXiv:1909.08053.
Johannes Welbl, Nelson F. Liu, and Matt Gardner.
2017. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy
User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand
Joulin, and Edouard Grave. 2020. CCNet: Extracting
high quality monolingual datasets from web crawl data.
In Proceedings of the 12th Language Resources and
Evaluation Conference, pages 4003–4012, Marseille,
France. European Language Resources Association.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand
Joulin, and Edouard Grave. 2019. Ccnet: Extracting
high quality monolingual datasets from web crawl data.
arXiv preprint arXiv:1911.00359.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
2020. Xlnet: Generalized autoregressive pretraining for language understanding.
arXiv preprint
arXiv:1906.08237.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy.
Association for Computational Linguistics.
Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao,
Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng
Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu,
Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao,
Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang,
Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin,
Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi
Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun
Liu, and Yonghong Tian. 2021. Pangu-α: Largescale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint
arXiv:2104.12369.