February 2025

Scaling Pre-training to One Hundred Billion
Data for Vision Language Models
Xiao Wang† , Ibrahim Alabdulmohsin† , Daniel Salz, Zhe Li, Keran Rong and Xiaohua Zhai

arXiv:2502.07617v1 [cs.CV] 11 Feb 2025

† Corresponding Authors: {wangxiao, ibomohsin}@google.com

We provide an empirical investigation of the potential of pre-training vision-language models on an
unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this
scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions.
Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web
data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model’s multilinguality
and show gains in low-resource languages as well. In addition, we observe that reducing the size of
the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may
inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight
that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100
billion examples, this data scale is vital for building truly inclusive multimodal systems.

1. Introduction
The progress in vision-language models (VLMs)
has been intrinsically linked to the availability
of large-scale datasets. Larger datasets fuel the
development of more powerful models, which are
capable of understanding and generating complex
relationships between images and text. In turn,
such models have pushed boundaries in tasks like
zero-shot image classification, image captioning
and visual question answering.
This relationship between data scale and model
performance often follows a power law 𝑓 ( 𝑥 ) =
𝛼 𝑥 − 𝑐 + 𝜀, where 𝑓 ( 𝑥 ) is a model performance
metric such as its error rate and 𝑥 is the data
size [2, 8, 29, 33, 37, 38, 49, 58, 76]. These “scaling laws,” as they came to be known in the literature, have been used, among others, to determine
the training data size needed to achieve a specified level of accuracy [9, 18, 26] and to optimize
the model size [4, 34, 38]. They have also been
justified theoretically using space-partitioning arguments [7, 35, 61]. Importantly, a power law
implies that increasing the amount of training
data can yield diminishing, but still worthwhile,
returns in terms of accuracy and capability.
Driven by these potential benefits, the field has
witnessed a concerted effort towards scaling up
the size of vision-language datasets. Early works

© 2025 Google DeepMind. All rights reserved

focused on web curated datasets like Conceptual
Captions [60], which provided millions of imagecaption pairs for pre-training [60]. Subsequent
work leveraged large-scale web crawling to create even larger datasets. In particular, the Common Crawl project [20]—a repository of publicly
available web data—became a foundational resource for constructing many of these web-scale
datasets. From this foundation emerged datasets
like LAION-400M/2B/5B [59], DataComp [27],
WebLI [15] and Multimodal C4 [80], pushing the
boundaries of dataset size to billions of image-text
pairs, thereby accelerating progress in VLMs. This
is similar to how ImageNet [22], JFT-300M [64]–
a dataset of 300 million images with noisy labels–
and its larger variant JFT-3B [76] accelerated
progress in supervised image pre-training previously.
Despite these advancements, the largest reported datasets to date have plateaued at around
10 billion image-text pairs. This raises the question: what further benefits are unlocked by pushing
the data scale by one order of magnitude to 100
billion unique examples?
To answer this question, we introduce WebLI100B, a novel dataset containing 100 billion
image-text pairs, representing a tenfold increase over the largest reported vision-langauge
datasets. To recall, the original WebLI dataset

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Culture

Western

74
72
70
68
66
64
62
60
58

Fairness

COCO img2txt

58

DS Geoloc

56
54

40

52

46
10B
Data Size

100B

44

25
20
15

30

48

1B

30

35

50

10

25
1B

10B
Data Size

100B

Telugu img2txt

35

45

Recall@1

ImageNet 0-shot

3

Accuracy

0

2

Recall@1

1

1

Accuracy

Multilinguality

B
L
H

5
1B

10B
Data Size

100B

1B

10B
Data Size

100B

Figure 1 | left: Scaling the data from 10 billion to 100 billion examples enhances cultural diversity
and multilingual capabilities more prominently than other metrics. The numbers represent the
improved accuracy (in absolute terms) when data scale is increased, averaged across all tasks. See
details in Section 4. righ t : Illustrative examples of the impact of data scale. The leftmost two
are Western-centric metrics, which do not benefit much by scaling the data to 100 billion, while
the rightmost two are illustrative of cultural diversity and multilinguality. The language Telugu, for
example, makes up < 0.04% of the web and benefits a lot from the 100 billion data scale.
contains 10 billion examples and has been instrumental in training state-of-the-art models like
PaliGemma [10, 63] and SigLIP [78], and influenced the development of other research directions, such as mitigating social biases [3], improving cultural diversity [53], and scaling openvocabulary object detection [48].
In this work, our primary goal is to provide an
empirical investigation to the impact of this data
scale on a range of downstream tasks and, importantly, to explore aspects beyond traditional performance metrics. For instance, while our experiments demonstrate that 100 billion scale can lead
to tiny improvements on established benchmarks,
we reveal its significant impact on less-explored
areas, particularly those related to cultural diversity and multilinguality.
For example, when applied to geo-localization
tasks based on Dollar Street [57]—a metric for
evaluating cultural diversity—ViT-L/16 trained
on a single epoch of 100 billion data achieves an
accuracy of 41.7%. By contrast, the same model
trained on ten epochs of 10 billion data achieves
an accuracy of 35.9% only, despite both models
using the same amount of training compute. We
attribute these gains, in part, to the dataset’s ability to capture a wider range of long-tail cultural
concepts that require a substantial data size to
become salient. Furthermore, data scaling also

enhances the multilinguality of trained models,
leading to an improvement in low-resource languages. Figure 1 summarizes the improvements
in cultural diversity and multilinguality achieved
through data scaling.

Statement of Contribution. Our goal in this
paper is to answer the following question: should
one invest in scaling up the size of the pretraining
dataset to 100 billion examples? We make the
following contributions:
• We provide an empirical investigation of the
potential of pre-training VLMs on a scale of
100 billion unique examples. To the best
of our knowledge, studying the impact of
this data scale for VLMs has never been conducted before in the literature.
• We demonstrate that a scale of 100 billion
image-text pairs is beneficial for VLMs in areas beyond traditional benchmarks, such as
cultural diversity, multilinguality, and reducing performance disparity across subgroups.
Hence, this data scale is vital for building
truly inclusive multimodal systems.
• We investigate the impact of applying quality
filters that reduce the size of the dataset, such
as those based on CLIP. While such filters
are often employed to improve overall data

2

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

quality, we find that they can inadvertently
reduce the representation of certain cultural
contexts, thereby limiting the diversity of
the dataset, even when the original dataset
contains 100 billion examples.

2. Related Work
Data Scaling. The study of scaling laws in large
language models (LLMs) has become a critical
area of research in NLP. Hestness et al. [33] and
Kaplan et al. [38] were among the first to systematically explore the relationship among model
size, dataset size, and compute, demonstrating
predictable power-law scaling of performance.
Henighan et al. [32] further emphasized the crucial role of data, showing that substantial performance gains can be achieved by increasing the
size and quality of the training dataset, even with
fixed model size. DeepMind’s Chinchilla [34] provided compelling evidence for this data-centric
approach, demonstrating that smaller models
trained on much larger datasets can achieve comparable or superior performance to larger models,
given the same computational budget. This work
has shifted the focus of LLM development towards
optimizing the scale of data.
In computer vision, early works, such as ImageNet [22], demonstrated the profound impact
of dataset size and diversity on model generalization. Subsequent efforts like JFT-300M [64]
emphasized the importance of large-scale and
high-quality datasets for training state-of-the-art
vision models. Zhai et al. [76] further explored
scaling behavior in Vision Transformers [24] using the JFT-3B dataset, showing that scaling both
data and model size simultaneously leads to improved generalization.
The pivotal role of data scaling is equally applicable to vision-language modeling, as highlighted
in Cherti et al. [17]. This has led to a substantial increase in the development of image-text
datasets over the last ten years. Early datasets,
such as COCO Captions [14] and Flickr30k [73],
were created to enable tasks like image captioning
and visual question answering with high-quality
annotations. However, their limited size, due to
the cost of human annotation, hindered further

scaling of the datasets. To address this, Conceptual Captions [60] started to filter image-text
pairs from the web based on heuristic rules, leading to millions of image-caption pairs. Going forward along this road, larger image-text datasets
have been created from web sources, using increasingly complex filtering techniques [23, 25,
27]. These datasets, ranging from hundreds of
millions to several billion image-text pairs, have
enabled the training of powerful vision-language
models like CLIP [54] and ALIGN [36], which
have demonstrated impressive performance on
a wide range of vision-language tasks. Notably,
LAION-5B [59] and WebLI [15] stand out as the
largest publicly and privately available image-text
datasets, with 5 billion and 10 billion multilingual
image-text pairs respectively.
However, the rapidly growing web contains
vastly more data. The impact of scaling to much
larger datasets, such as 100 billion samples, remains largely unknown.

Vision-Language Pre-training. The field of
large vision-language models is advancing quickly,
building upon remarkable progress in both computer vision and natural language processing. A
prevalent and highly effective strategy is to learn
visual representations and language modeling independently, followed by joint pre-training of the
vision-language model using high-quality multimodal data.
Since the advent of CLIP [54], contrastive
learning on large, noisy web datasets has become the dominant approach for acquiring powerful visual representations [13]. This weakly
supervised paradigm surpasses traditional supervised learning methods [41, 62], primarily
due to the large scale and high diversity of web
data [36, 52, 74, 75]. An alternative approach
gaining traction involves learning visual features
from web data using generative methods [66, 68],
which predict paired text for given images. While
vision models trained in this manner exhibit superior transferability to generative language models,
the high computational cost limits its widespread
adoption.
Despite the acquired zero-shot capabilities,

3

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

which can be directly applied to tasks such as
zero-shot classification [22] and image-text retrieval [14, 73], the strong visual representations
learned by contrastively trained models often lead
to their utilization as image encoders. This is
often leveraged in vision-language tasks by integrating visual tokens with language tokens,
enabling LLMs to process multimodal information [5, 10, 15, 16, 45, 46]. Following this approach, PaLI-3 [16] has demonstrated that vision
models trained on large-scale web data outperform those trained on weakly annotated images of
a similar scale, which further underscores the importance of the data diversity inherently present
in the web corpus.

Inclusive Models. Recent studies have highlighted that popular techniques employed to enhance the performance of vision-language models, such as English-language-based filtering,
may inadvertently diminish cultural understanding [6, 30, 50, 53, 56]. Hence, we also evaluate cultural diversity in this work, as outlined
in Pouget et al. [53], which falls into two categories.
The first category, geo-localization, involves
predicting the country or region of origin for
an image using few-shot classification. The second category utilizes zero-shot classification on
datasets curated from various geographical regions. Prominent examples within this category
include Dollar Street [57], GeoDE [55], and
Google Landmarks Dataset v2 (GLDv2) [69]. Dollar Street comprises 38K images depicting household items from 63 countries. GeoDE features
62K manually annotated images collected from
diverse geographic locations. Finally, GLDv2 contains 1,542 images representing 884 landmarks
across 84 countries, enabling the assessment of
model performance on recognizing culturally important locations. In our evaluations, we employ
all three aforementioned datasets. For the zeroshot evaluation on Dollar Street, we adhere to
the methodology used in Rojas et al. [57], mapping 96 specific topics within the dataset to corresponding ImageNet classes. This mapping results in a curated subset of 21K images, which we
utilize for our analysis. These geographically di-

verse benchmarks, employed collectively, provide
a comprehensive framework for evaluating the
impact of performance optimization techniques
on cultural understanding within vision-language
models.

3. Experimental Setup
3.1. Pre-training Datasets
We describe the dataset splits we use in the pretraining.

Raw Datasets. To assess the performance of
vision-language models on large-scale image-text
data, we construct a dataset with 100 billion
image-text pairs from the web, inspired by the
work of Chen et al. [15], Jia et al. [36], Schuhmann et al. [59], Zhai et al. [77]. We refer to
this as WebLI-100B, and refer to its subsets with
1 billion and 10 billion examples as 1B and 10B,
respectively. The 1B and 10B datasets are created by randomly sampling 1% and 10%, respectively, from the 100 billion dataset. In this work,
we apply only essential data filters, such as removing harmful images and personally identifiable information (PII). This approach ensures the
dataset remains as multilingual and diverse as
possible. We utilize both the alt-text and page
title associated with each image as the paired
text. To ensure fair evaluations, we remove nearduplicate images across more than 90 common
vision-language tasks from our dataset.

Quality-filtered Datasets. To examine the impact of scaling on quality-filtered data, we adopt
the common approach of using the CLIP-L/14
model [54] as a filter, retaining a high-quality
dataset with 5 billion pairs of images and English
alt-text. To further solidify our results, we train a
VLM on the web data to classify image-text pairs
as aligned or misaligned, and tune its threshold to
retrain another filtered dataset of the same size.
Unless otherwise noted, we use the language of
web pages1 for multilingual experiments, thereby
1 The “content-language" meta tag in the head of an

HTML document.

4

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Table 1 | The attention map visualization of the ViT-L/16 models trained on different scales of data.
Images are selected to represent cultures in Western-centric countries and countries where lowresource languages are spoken.
Concept

Image

1B Data

10B Data

100B Data

Igorot Dance (Igorot)

Igloo (Inuit)

Bison (Yellowstone)

avoiding potential inaccuracies from language
detection on the noisy web text.
Language-rebalanced Datasets. In the language rebalancing experiments in Section 5.2, we
adjust the mixing ratio of the low-resource languages used in the Crossmodal-3600 [65] benchmark. These low-resource languages are Bengali (bn), Filipino (fil), Hindi (hi), Hebrew (iw),
Maori (mi), Swahili (sw), and Telugu (te)2 , ranging from 0.001% to 0.267% in our dataset (Appendix F). In model training, we upsample each
of them to 1%, with remaining 93% comprising
of the original data.
3.2. Contrastive Vision-Language Pretraining
To study the impact of data scale on model performance, we train SigLIP [78] models using three
different dataset sizes: 1 billion, 10 billion and
100 billion. We also vary the model size using ViTB/16, ViT-L/16, and ViT-H/14 architectures for
both image and text encoders. During contrastive
training, inspired by Zhai et al. [76], we utilize a
large batch size of 32K and an inverse square root
2Cusco Quechua (quz) is excluded from our experiments

because it is not supported by our language detection
method.

learning rate schedule with 200 million warmup
and cooldown examples. The learning rate and
weight decay are set to 0.001 and 0.0001 respectively. In the preprocessing stage, images are resized to a resolution of 224x224 pixels, and texts
are tokenized using the multilingual mt5 [72]
tokenizer with a maximum sequence length of 64
tokens.
All models are trained on a maximum of 100
billion examples; e.g. a maximum of 100 epochs
when using 1B examples. We cool down the models at various training steps where they have seen
3, 7, 10, 17, 26, 33, 49, 66, and 100 billion examples, and evaluate them after the cool-downs. Unless otherwise specified, we report results using
the checkpoints where models have been trained
on 100 billion examples. All models are compared
on a compute-matched regime.
3.3. Evaluations
The model’s capabilities are evaluated across a
diverse range of benchmarks, spanning from traditional Western-centric tasks to those measuring
inclusivity.
Western-centric. Our first set of evaluations
uses diverse, well-established benchmarks.

5

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Table 2 | Evaluations and scaling laws on Western-centric benchmarks, where scaling from 10B to
100B examples shows limited benefits.
Model

Metric (err%)

Value @ 100B ex

Scaling Laws
100B

1B

limit
10B

100B

Zero-shot classification
39.0
-0.58
-0.97
36.8
-0.26
-0.23
22.3
-0.43
-0.45

-0.65
-0.24
-0.37

40.1
33.8
22.3

38.5
32.5
21.7

37.9
33.7
18.4

29.7
23.8
12.5

28.5
23.4
9.5

-0.92
-0.26
-0.61

-0.91
-0.32
-0.57

-0.82
-0.43
-0.51

30.7
22.7
12.3

29.0
20.7
9.6

27.1
21.1
7.0

25.6
19.8
7.5

24.9
21.4
7.2

-0.36
-0.25
-0.45

-0.64
-0.36
-0.42

-0.52
-0.29
-0.50

26.7
20.6
8.1

24.5
18.0
5.3

23.3
17.6
4.6

1B

10B

100B

B

ImageNet
CIFAR100
Pet

41.2
36.6
25.4

39.4
35.9
23.7

L

ImageNet
CIFAR100
Pet

31.2
25.0
14.4

H

ImageNet
CIFAR100
Pet

29.6
23.5
10.3

1B

exponent
10B

Retrieval @1

B

COCO I2T@1
COCO T2I@1
Flickr I2T@1
Flickr T2I@1

56.5
70.9
24.2
43.1

51.6
68.8
21.2
40.3

53.4
70.0
21.1
40.4

-0.24
-0.34
-0.24
-0.32

-0.49
-0.39
-0.34
-0.42

-0.30
-0.69
-0.23
-0.30

52.4
69.6
21.5
40.9

49.9
67.1
18.1
37.5

50.7
69.5
17.0
36.7

L

COCO I2T@1
COCO T2I@1
Flickr I2T@1
Flickr T2I@1

49.7
68.2
20.4
39.9

47.2
64.3
15.5
32.3

45.3
62.5
16.6
32.5

-0.24
-0.19
-0.21
-0.10

-0.41
-0.42
-0.45
-0.42

-0.30
-0.41
-0.21
-0.42

45.8
64.2
16.5
34.6

44.7
62.6
14.1
30.7

42.9
60.5
13.4
30.7

H

COCO I2T@1
COCO T2I@1
Flickr I2T@1
Flickr T2I@1

48.6
64.9
16.8
34.3

42.0
60.3
13.5
28.5

42.5
59.3
13.9
28.0

-0.21
-0.30
-0.23
-0.23

-0.62
-0.55
-0.40
-0.56

-0.47
-0.43
-0.23
-0.46

44.6
62.8
12.2
29.6

40.3
58.9
11.4
26.8

40.6
57.3
11.3
25.9

B

Imagenet
Birds
Caltech
Cars
CIFAR100
Colorectal
Pet
DTD

46.6
53.8
8.4
18.3
38.7
26.5
22.9
29.7

45.6
53.5
8.3
16.8
38.6
29.2
23.2
30.9

44.7
53.9
8.2
17.6
39.0
27.0
22.1
30.9

10-shot
-0.82
-0.34
-0.30
-0.63
-0.19
-0.02
-1.77
-0.28

-0.61
-0.40
-0.24
-0.68
-0.22
-0.06
-0.62
-0.24

-0.49
-0.51
-0.23
-0.60
-0.20
-0.16
-0.77
-0.19

46.2
51.5
7.1
17.1
35.2
20.2
21.6
27.9

44.4
51.6
7.2
15.5
34.9
22.6
21.3
28.3

43.3
52.8
6.8
16.3
35.9
24.4
20.6
27.2

L

Imagenet
Birds
Caltech
Cars
CIFAR100
Colorectal
Pet
DTD

35.1
44.0
6.4
11.1
27.5
24.0
12.3
28.5

35.0
45.3
7.4
11.3
26.7
23.5
12.5
27.1

33.7
44.3
7.5
11.5
25.5
22.6
11.8
27.9

-0.67
-0.51
-0.43
-0.54
-0.24
-0.18
-0.70
-0.22

-0.68
-0.43
-0.17
-0.49
-0.29
-0.20
-0.65
-0.25

-0.63
-0.51
-0.18
-0.41
-0.41
-0.27
-0.53
-0.23

34.1
42.1
5.9
10.1
24.0
18.8
11.3
25.2

34.0
43.2
4.8
9.7
23.7
20.2
11.4
25.1

32.5
42.7
4.8
9.9
22.9
20.5
10.3
25.5

H

Imagenet
Birds
Caltech
Cars
CIFAR100
Colorectal
Pet
DTD

32.4
41.6
5.7
11.3
25.8
25.2
10.8
29.2

29.8
39.1
6.0
10.3
23.8
26.2
9.1
26.1

29.3
36.3
8.9
9.6
24.2
25.9
8.7
26.8

-0.41
-0.67
-0.21
-0.27
-0.22
-0.22
-0.92
-0.16

-0.73
-0.52
-0.08
-0.88
-0.25
-0.20
-0.48
-0.23

-0.79
-0.47
-0.11
-0.44
-0.24
-0.15
-0.46
-0.23

30.3
40.6
4.3
9.1
21.4
19.7
10.3
25.0

29.0
37.4
3.7
10.1
21.1
17.9
7.6
23.8

28.3
33.9
4.6
8.3
19.7
20.7
6.5
24.8

For zero-shot classification, we employ ImageNet [22], CIFAR-100 [43], and Oxford-IIIT
Pet [51] datasets. Additionally, for 10-shot
evaluations, we use Caltech-UCSD Birds [67],
Caltech 101 [44], Cars196 [42], Colorectal

Histology [40], and Describable Textures
Dataset (DTD) [19] benchmarks to assess the
representation capabilities of vision models. We
also conduct zero-shot retrieval evaluations on
COCO Captions [14] and Flickr30k [73], in both

6

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Table 3 | Evaluations and scaling laws on culture diversity benchmarks, where scaling from 10B to
100B examples shows larger benefits.
Model

Metric (err %)

Value @ 100B ex

Scaling Laws
100B

1B

limit
10B

100B

10-shot Geolocalization
72.1
-0.38
-0.36
71.4
-0.35
-0.31
59.2
-0.26
-0.22

-0.37
-0.37
-0.29

76.3
70.8
58.8

73.7
69.6
57.0

70.2
68.9
57.3

64.1
62.3
53.6

58.3
57.8
48.3

-1.09
-0.40
-0.15

-0.38
-0.30
-0.16

-0.94
-1.11
-0.39

63.2
58.8
49.9

60.1
58.0
46.9

57.5
56.6
46.3

64.6
56.9
54.6

59.1
50.2
47.6

53.7
47.6
44.7

-0.30
-0.23
0.00

-0.56
-0.78
-0.38

-0.64
-0.62
-0.31

61.0
52.2
50.1

56.4
49.4
45.3

52.5
46.1
41.0

Dollar Street
GeoDE
GLDv2

52.0
7.8
65.0

51.9
8.3
61.0

Zero-shot classification
51.6
-0.38
-0.25
8.7
-0.24
-0.26
59.4
-0.46
-0.72

-0.28
-0.25
-0.51

50.4
6.1
61.6

49.7
6.7
59.3

49.7
5.4
56.8

Dollar Street
GeoDE
GLDv2

50.2
6.0
50.4

48.1
5.9
46.4

49.0
4.9
45.7

-0.22
-0.29
-0.53

-0.35
-0.17
-0.93

-0.17
-0.25
-0.89

46.9
4.7
48.5

46.2
4.3
44.8

46.2
3.3
44.1

Dollar Street
GeoDE
GLDv2

50.0
6.0
48.1

48.6
4.9
40.1

47.4
4.8
38.8

-0.15
-0.19
-0.52

-0.13
-0.22
-1.34

-0.20
-0.24
-0.80

43.9
3.3
46.0

44.2
3.3
39.0

44.1
3.5
36.8

1B

10B

100B

B

Dollar Street
GeoDE-Country
GeoDE-Region

77.7
72.8
61.1

75.8
71.5
60.8

L

Dollar Street
GeoDE-Country
GeoDE-Region

63.6
61.9
54.2

H

Dollar Street
GeoDE-Country
GeoDE-Region

B

L

H

image-to-text and text-to-image directions.

Cultural Diversity. Besides the above metrics,
we also incorporate a range of benchmarks aimed
at evaluating cultural diversity, following the recommendations in [53]. Specifically, we include
zero-shot classification using Dollar Street [57],
GeoDE [55], and Google Landmarks Dataset v2
(GLDv2) [69]. See Section 2 for a brief description about each dataset. We also include 10-shot
geolocalization using Dollar Street and GeoDE.

Multilinguality. We evaluate the model’s
multilinguality using the Crossmodal-3600
dataset [65], a geographically diverse set of
3600 images with human-generated captions in
36 languages. We assess the model’s zero-shot
retrieval in both image-to-text and text-to-image
directions for each language. In addition to
per-language results, we also present average
scores for low-resource languages (Bengali,
Filipino, Hindi, Hebrew, Maori, Swahili, and
Telugu) and high-resource languages (others).

1B

exponent
10B

Fairness. In addition, we also evaluate the presence of societal biases in the trained model. We
report on representation bias (RB) and association bias (AB) between gender and occupation,
as defined in Alabdulmohsin et al. [3]. These
measure unwanted associations w.r.t. the gender
attribute using 1st and 2nd order statistics. Also,
we report performance disparity by income in
Dollar Street zero-shot accuracy and by region in
GeoDE zero-shot accuracy.

Transfer to Generative Models. Finally, to assess how well our contrastively trained vision
models transfer to generative vision-language
tasks, we utilize the compact and versatile PaliGemma model [10].
We initialize
PaliGemma’s vision component with our contrastively trained models and pretrain it on 50 million seen examples, following its stage-1 recipe
at 224x224 resolution. During the pre-training,
we explore two common transfer settings: freezing [15, 46, 79] and unfreezing [10, 16, 63, 71]
the vision model. We then use PaliGemma’s default configuration to finetune on a variety of
downstream tasks, covering image captioning,

7

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

visual question answering, and segmentation,
which require the understanding of semantics,
OCR, multilinguality, and remote sensing.

4. Results
4.1. Established Benchmarks
We begin by evaluating all vision-language models on established benchmarks, based on ImageNet and COCO Captions, among other datasets.
As revealed in Table 2, increasing the dataset size
from 10 billion to 100 billion examples does not
improve performance substantially. This is statistically supported by Wilcoxon’s signed rank
test [70], which gives a 𝑝-value of 0.9, indicating
that differences are not significant.
In addition, we also fit data scaling laws for
every combination of model and dataset following the recipe proposed in Alabdulmohsin et al.
[2]. This allows us to evaluate whether or not the
performance gap is expected to increase or decrease in the infinite-compute regime. We report
the resulting scaling exponents and asymptotic
performance limits in the tables. Again, we do
not observe significant differences at the 95%
confidence level ( 𝑝-value of 0.09).

4.2. Cultural Diversity
Unlike the Western-oriented metrics reported in
Section 4.1, cultural diversity metrics present an
entirely different picture. We observe notable
gains when scaling the size of the dataset from
10 billion to 100 billion examples in Table 3. For
example, scaling training data from 10 billion to
100 billion examples yields substantial gains on
Dollar Street 10-shot classification task, where
ViT-L and ViT-H see absolute improvements of
5.8% and 5.4%, respectively. These gains outperform the typical improvements (less than 1%)
observed on Western-oriented 10-shot metrics by
a large margin. Using Wilcoxon’s signed rank test,
we obtain a 𝑝-value of 0.002, indicating a statistically significant evidence at the 99% confidence
level.

4.3. Multilinguality
Our multilingual benchmark, Crossmodal-3600
zero-shot retrieval [65], shows a disparity in performance gains: low-resource languages benefit
more from the 100 billion scale than the highresource ones. The disparity, illustrated in Figure 3, which not only exists in all model sizes
but also widens as the models become larger. Detailed results for each language can be found in
Appendix B.
4.4. Fairness
For fairness, we report on 3 metrics discussed in
Section 3.3.
Representation Bias. The first metric is representation bias (RB), with results detailed in
Table 4. We observe that models trained on unbalanced web data have a significantly higher
preference to associate a randomly chosen image
from ImageNet [22] with the label “Male” over
the label “Female.”
In fact, this occurs nearly 85% of the time.
Training on 100B examples does not mitigate
this effect. This finding aligns with previous research highlighting the necessity of bias mitigation strategies, such as data balancing [3], to
address inherent biases in web-scale datasets.
Model

1B

10B

100B

B
L
H

83.2
88.2
86.8

84.5
86.4
85.0

85.2
85.5
86.6

Table 4 | Representation bias w.r.t. gender (see
Section 4). Here, values [%] indicate how often
the model prefers to associate a random image
with the label “Male” over “Female”.

Association Bias. Second, Figure 2 shows the
association bias in SigLIP-H/14 between gender
and occupation as we scale the data from 10 to
100 billion examples. Specifically, we plot the
probability that the model would prefer a particular occupation label, such as “secretary” over

8

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Occupation

0.66

0.4
0.2

secretary

receptionist

nurse

0.66

0.66

0.94

0.97

0.13

0.11

0.25

0.56

0.8

0.4

secretary

receptionist

0.2

Occupation

0.65

0.069

0.16

0.25

0.63

receptionist

secretary

0.2

nurse

0.94

librarian

0.25

0.4

0.88

housekeeper

0.088

Occupation

0.28

Female

0.8

Gender

0.12

secretary

0.66

0.63

Male

1

receptionist

Female

Occupation

0.88

0.8

Model = H, Data = 100B

nurse

0.2

0.73

0.68

Occupation

nurse

0.2

0.6

librarian

0.29

0.4

0.61

0.076 0.038

librarian

0.67

Occupation

0.94

0.98

0.6

housekeeper

0.18

Female

0.086

0.4

0.95

Gender

0.8

Male

0.33

housekeeper

0.47

0.6

Gender

0.058

0.8

Male

0.75

secretary

librarian

0.92

receptionist

0.45

0.71

0.97

Model = L, Data = 100B

nurse

0.42
housekeeper

Female

Gender

Male

0.86

0.8

secretary

0.85

Model = B, Data = 100B
0.89

0.63

0.38

Model = H, Data = 10B
0.6

housekeeper

Occupation

0.86

receptionist

0.2

0.93

nurse

0.4

Female

0.6

librarian

0.2

0.8

Gender

0.66

Male

0.14
receptionist

0.057

0.65

0.93

0.21

0.6

Model = L, Data = 10B

secretary

0.48

nurse

housekeeper

0.15
librarian

Female

Gender

Male

0.66

0.56

0.2

0.96

librarian

0.79

0.4

Female

0.66

0.6

Gender

0.62 0.0026 0.025

Model = B, Data = 10B
0.89

0.97

housekeeper

Occupation

0.95

0.8

Male

secretary

0.2

0.22

secretary

0.58

0.016

receptionist

0.027

0.4

0.85

housekeeper

0.11

receptionist

0.41
librarian

0.86

nurse

0.6

nurse

0.96

Model = H, Data = 1B

librarian

0.45

Female

0.62

Gender

0.78

Model = L, Data = 1B
0.8

Male

0.97

housekeeper

Male

Gender

Female

Model = B, Data = 1B

0.46

0.65

0.88

0.96

0.8
0.6
0.4
0.2

Occupation

Figure 2 | Association bias between gender and occupation, evaluated in scaled models and data.
err %

Average XM3600 Retrieval

90
1.11

80

2.14
2.76

70
60

Lang Resource
Low
High

-0.11

50
40

ViT Size
B/16
L/16
H/16

1.32
1.29
1B

10B
Data Scale

100B

Figure 3 | Scaling up to 100B examples leads to
more notable improvements in low-resource languages. Δ denotes the improved accuracy when
scaling from 10B examples to 100B.

another label, such as “manager” when images
correspond to males or females. In this evaluation, we use the Fairface [39] dataset. The labels we compare are: “librarian” vs. “scientist”,
“nurse” vs. “doctor”, “housekeeper” vs. “homeowner”, “receptionist” vs. “executive” and “secretary” vs. “manager”. Again, we do not see a
reduction in association bias by simply increasing
the size of the training data.

Performance Disparity. Finally, one common
definition of fairness in machine learning is
maintaining similar performance across different
groups. See, for instance, Dehghani et al. [21]
and the related notions of “Equality of Opportunity” and “Equalized Odds” [31]. Table 5 show
that scaling the data to 100 billion examples improves performance disparity, which is consistent
with the improvement in cultural diversity.

4.5. Transfer To Generative Models
We use PaliGemma [10] with both frozen and unfrozen vision component to assess the transferability of our vision models, which were contrastively
pre-trained on datasets of different scales. In
Table 6, when taking the noise level into consideration, we do not observe consistent performance
gains across downstream tasks as we scale the
pre-training dataset. More details can be found
in Appendix C.

9

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Table 5 | Performance disparity results for various SigLIP models pretrained on 100 billion seen
examples of 1B, 10B, and 100B datasets. Here, disparity corresponds to the maximum gap across
subgroups in Dollar Street (by income level) and GeoDE (by geographic region). Pretraining on 100B
examples tends to improve disparity overall.
Model

Data Scale

Performance per Subgroup
0-shot Dollar Street
200-685
685-1998

0-200

Disparity

>1998

B
B
B

1B
10B
100B

29.4
31.6
32.0

43.9
44.0
44.3

56.5
55.4
56.3

62.0
61.5
61.0

32.5
29.9
29.0

L
L
L

1B
10B
100B

33.7
35.7
33.7

44.7
47.8
46.6

57.3
58.7
59.5

63.4
65.5
64.1

29.7
29.8
30.4

H
H
H

1B
10B
100B

32.3
33.9
34.1

44.9
46.3
48.2

58.4
58.6
62.2

64.5
66.9
66.1

32.2
33.0
32.1

0-shot GeoDE
Americas
East-Asia

Africa

Europe

South-East
Asia

West Asia

B
B
B

1B
10B
100B

89.4
88.4
88.8

92.1
91.8
91.4

91.8
91.4
91.0

94.1
94.0
93.3

92.5
92.2
91.7

93.4
93.0
92.2

4.7
5.5
4.4

L
L
L

1B
10B
100B

92.0
91.8
93.5

94.0
94.4
95.1

94.0
94.0
95.4

95.2
95.8
96.2

94.2
94.2
95.0

94.9
94.7
95.8

3.2
4.0
2.8

H
H
H

1B
10B
100B

91.5
93.4
93.6

94.4
95.4
95.1

94.7
95.0
95.3

95.2
96.5
96.3

94.1
95.1
95.2

94.5
95.6
95.8

3.6
3.0
2.7

Data

Semantics

OCR

Multiling

RS

Avg

1B
10B
100B
1B
10B
100B

76.0
75.4
76.4
77.1
76.4
77.2

66.8
65.2
67.0
69.5
66.9
70.0

67.0
66.3
66.9
66.9
66.0
67.0

92.3
91.9
92.1
92.0
91.8
91.8

73.6
72.7
73.9
75.1
73.7
75.3

Table 6 | The PaliGemma transfer results of ViTL/16 models pretrained on 10B and 100B examples, with both frozen (top) and unfrozen (bottom) vision components. Results are aggregated.

5. Analysis

also train a classifier model on the raw web data,
resulting in a filtered dataset of the same size.
Additionally, we sample an English subset of the
same size from the raw data to serve as a baseline.
We train ViT-L models on the three datasets and
represent the results in Figure 4 and Appendix D.
The CLIP filter excels in Western-centric tasks,
consistent with data-centric research showing
that effective data filtering enhances model performance [1, 12, 25, 47]. However, all filtered
datasets underperform in other tasks, particularly
those involving cultural diversity. This illustrates
a key drawback of data filtering, that it can inadvertently introduce biases into the filtered dataset,
in agreement with prior works [11, 28, 53].

5.1. Data Quality Filtering
Raw web data is often too noisy for training effective vision-language models. To address this,
a common strategy is to use a data filter model
to remove less relevant image-text pairs. In this
work, we utilize the CLIP-L/14 model to filter
the raw data and retrain 5 billion high-quality
English image-text pairs. For comparison, we

5.2. Language Rebalancing
The low-resource languages in our raw data collectively represent only 0.5%, which prevents
sufficient model learning of the concepts existing in these languages or areas. To address this,
we upsample each low-resource language to a

10

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

err %

26
24

26.5

62.5
Average Fairness

28

27.0

65.0
Average Culture Diversity

30
Average Western

err %

err %

Baseline (en)
CLIP
Classifier

60.0
57.5
55.0

0

5

10
15
20
Examples Seen (billion)

25

30

25.5
25.0
24.5

52.5

24.0

50.0

22

26.0

0

5

10
15
20
Examples Seen (billion)

25

30

0

5

10
15
20
Examples Seen (billion)

25

30

Figure 4 | Quality filtering can hinder cultural diversity (middle) and fairness (right), even when it
benefits Western-centric (left) tasks. This observation holds for both the widely-used CLIP filter and a
classifier filter trained on web data.
fixed 1% representation. This rebalancing, visualized in Figure 5, improves model performance on
the low-resource language benchmark. Accordingly, the performance on the high-resource language slightly decreases, but still remains comparable (also applies to other English-only zero-shot
retrieval tasks), which results in an overall improvement on the entire multilingual benchmark.
Additionally, we observe a mild improvement in
cultural diversity tasks, while other tasks show
slightly worse results, potentially due to the reduction in Western-centric examples, as most evaluations are based on the English language. Full
evaluation results can be found in Appendix E.
5.3. Qualitative Examples
We visualize the attention maps from the vision
models trained on different scales of data in Table 1. Models trained on larger data tends to
have more focused attention on semantically relevant regions. For example, in the “Igorot Dance”
image, the 100B-trained model captures finer details, such as intricate patterns on traditional decorations and culturally significant objects. In the
“Igloo” image, the 100B-trained model accurately
focuses on the igloo’ structural details (its dome
shape), unlike other models which are distracted
by background elements like mountains and ice.
Beyond low-resource concepts, 100B data can
also improve performance on common concepts.
As shown in the “Bison" image, models trained on
larger datasets more precisely capture the bison,
rather than the surrounding landscape. More
visualized examples can be found in Table 7.

6. Discussion
Data Filtering. Data filtering is a common technique used to improve data quality in visionlanguage pre-training. As demonstrated in Section 5.1, CLIP filter remarkably improves model’s
performance on the traditional tasks. Given the
noted impact of filtering on cultural diversity in
our experiments, we focus on the impact of scaling raw, unfiltered data, and leave the improvement of data quality at the 100 billion scale for
future work. We encourage the community to
conduct further research into new data filtering
techniques that preserve cultural diversity, as well
as novel training architectures or methods that
improve model inclusivity without requiring additional training data.
Limitations. The benchmarks used in this paper
to evaluate VLM inclusivity are necessarily limited, since inclusivity is a broad societal concept
that should be reduced to a handful of metrics.
For instance, while we utilize Crossmodal-3600
in a zero-shot setting to assess multilinguality, it
only covers 36 languages.

7. Conclusion
In this paper, we investigate the impact of scaling
image-text data up to 100 billion unique examples, on vision-language pre-training. We demonstrate that a scale of 100 billion image-text pairs
is beneficial for vision-language models in areas
beyond traditional Western-centric benchmarks,
such as cultural diversity, multilinguality, and re-

11

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

ducing performance disparity across subgroups.
Hence, this data scale remains fundamentally important for the development of truly inclusive multimodal systems. We also investigate the impact
of applying quality filters, such as those based on
CLIP, to large-scale image-text datasets. These filters, though often beneficial for traditional tasks,
can negatively impact data diversity by reducing
the representation of certain cultural contexts.
Overall, our results highlight the importance of
data scale for VLMs. While traditional benchmarks may not benefit significantly from the scaling of noisy, raw web data to 100 billion, this
data scale remains crucial for training inclusive
vision-language models.

Acknowledgments
We thank Daniel Keysers and Jeremiah Harmse
for their insightful reviews and suggestions;
Matthias Minderer for valuable discussions and
experiments on scaling open-vocabulary detection; Lucas Beyer for input on multilingual rebalancing; and Google DeepMind at large for
providing a supportive research environment.

References
[1] A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Dataefficient learning at web-scale through
semantic deduplication. arXiv preprint
arXiv:2303.09540, 2023.
[2] I. Alabdulmohsin, B. Neyshabur, and X. Zhai.
Revisiting neural scaling laws in language
and vision. In NeurIPS, 2022.
[3] I. Alabdulmohsin, X. Wang, A. Steiner,
P. Goyal, A. D’Amour, and X. Zhai. Clip
the bias: How useful is balancing data in
multimodal learning? In ICLR, 2024.
[4] I. Alabdulmohsin, X. Zhai, A. Kolesnikov,
and L. Beyer. Getting ViT in shape: Scaling laws for compute-optimal model design.
2024.
[5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech,
I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Mil-

lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.
Advances in neural information processing
systems, 35:23716–23736, 2022.
[6] A. Ananthram, E. Stengel-Eskin, C. Vondrick, M. Bansal, and K. McKeown. See it
from my perspective: Diagnosing the western cultural bias of large vision-language
models in image understanding. arXiv
preprint arXiv:2406.11665, 2024.
[7] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and
U. Sharma. Explaining neural scaling laws.
arXiv preprint arXiv:2102.06701, 2021.
[8] Y. Bansal, B. Ghorbani, A. Garg, B. Zhang,
M. Krikun, C. Cherry, B. Neyshabur, and
O. Firat. Data scaling laws in NMT: The effect of noise and architecture. arXiv preprint
arXiv:2202.01994, 2022.
[9] C. Beleites, U. Neugebauer, T. Bocklitz,
C. Krafft, and J. Popp. Sample size planning
for classification models. Analytica chimica
acta, 760:25–33, 2013.
[10] L. Beyer, A. Steiner, A. S. Pinto,
A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen,
E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint
arXiv:2407.07726, 2024.
[11] A. Birhane, V. U. Prabhu, and E. Kahembwe.
Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv
preprint arXiv:2110.01963, 2021.
[12] L. Cao, B. Zhang, C. Chen, Y. Yang, X. Du,
W. Zhang, Z. Lu, and Y. Zheng. Less is
more: Removing text-regions improves clip
training efficiency and robustness. arXiv
preprint arXiv:2305.05095, 2023.
[13] T. Chen, S. Kornblith, M. Norouzi, and
G. Hinton. A simple framework for contrastive learning of visual representations.
In International conference on machine learning, pages 1597–1607. PMLR, 2020.

12

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

[14] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325, 2015.
[15] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer,
et al. Pali: A jointly-scaled multilingual
language-image model. arXiv preprint
arXiv:2209.06794, 2022.
[16] X. Chen, X. Wang, L. Beyer, A. Kolesnikov,
J. Wu, P. Voigtlaender, B. Mustafa, S. Goodman, I. Alabdulmohsin, P. Padlewski,
et al.
Pali-3 vision language models:
Smaller, faster, stronger. arXiv preprint
arXiv:2310.09199, 2023.
[17] M. Cherti, R. Beaumont, R. Wightman,
M. Wortsman, G. Ilharco, C. Gordon,
C. Schuhmann, L. Schmidt, and J. Jitsev.
Reproducible scaling laws for contrastive
language-image learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2818–2829,
2023.
[18] J. Cho, K. Lee, E. Shin, G. Choy, and S. Do.
How much data is needed to train a medical image deep learning system to achieve
necessary high accuracy? arXiv preprint
arXiv:1511.06348, 2015.
[19] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures
in the wild. In Proceedings of the IEEE Conf.
on Computer Vision and Pattern Recognition
(CVPR), 2014.
[20] C. Crawl. Common crawl dataset, 2021.
URL https://commoncrawl.org/.
[21] M. Dehghani, J. Djolonga, B. Mustafa,
P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner,
M. Caron, R. Geirhos, I. Alabdulmohsin,
et al. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li,
and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[23] H. Dong, Z. Kang, W. Yin, X. Liang, C. Feng,
and J. Ran. Scalable vision language model
training via high quality data curation.
arXiv preprint arXiv:2501.05952, 2025.
[24] A. Dosovitskiy, L. Beyer, A. Kolesnikov,
D. Weissenborn, X. Zhai, T. Unterthiner,
M. Dehghani, M. Minderer, G. Heigold,
S. Gelly, et al. An image is worth 16x16
words: Transformers for image recognition
at scale. ICLR, 2020.
[25] A. Fang, A. M. Jose, A. Jain, L. Schmidt,
A. Toshev, and V. Shankar. Data filtering
networks. arXiv preprint arXiv:2309.17425,
2023.
[26] R. L. Figueroa, Q. Zeng-Treitler, S. Kandula,
and L. H. Ngo. Predicting sample size required for classification performance. BMC
medical informatics and decision making, 12
(1):1–10, 2012.
[27] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase,
G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann,
R. Vencu, M. Cherti, R. Krishna, P. W. W.
Koh, O. Saukh, A. J. Ratner, S. Song,
H. Hajishirzi, A. Farhadi, R. Beaumont,
S. Oh, A. Dimakis, J. Jitsev, Y. Carmon,
V. Shankar, and L. Schmidt. Datacomp: In
search of the next generation of multimodal
datasets. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine,
editors, Advances in Neural Information Processing Systems, volume 36, pages 27092–
27112. Curran Associates, Inc., 2023.
[28] N. Garcia, Y. Hirota, Y. Wu, and
Y. Nakashima.
Uncurated image-text
datasets: Shedding light on demographic
bias.
In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, pages 6957–6966, 2023.
[29] B. Ghorbani, O. Firat, M. Freitag, A. Bapna,
M. Krikun, X. Garcia, C. Chelba, and
C. Cherry.
Scaling laws for neural
machine translation.
arXiv preprint
arXiv:2109.07740, 2021.

13

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

[30] P. Goyal, Q. Duval, I. Seessel, M. Caron,
I. Misra, L. Sagun, A. Joulin, and P. Bojanowski. Vision models are more robust
and fair when pretrained on uncurated images without supervision. arXiv preprint
arXiv:2202.08360, 2022.
[31] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning,
2016. URL https://arxiv.org/abs/
1610.02413.

[38] J. Kaplan, S. McCandlish, T. Henighan, T. B.
Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws
for neural language models. arXiv preprint
arXiv:2001.08361, 2020.
[39] K. Karkkainen and J. Joo. Fairface: Face
attribute dataset for balanced race, gender,
and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pages 1548–1558, 2021.

[32] T. Henighan, J. Kaplan, M. Katz, M. Chen,
C. Hesse, J. Jackson, H. Jun, T. B. Brown,
P. Dhariwal, S. Gray, et al. Scaling laws for
autoregressive generative modeling. arXiv
preprint arXiv:2010.14701, 2020.

[40] J. N. Kather, C.-A. Weis, F. Bianconi, S. M.
Melchers, L. R. Schad, T. Gaiser, A. Marx,
and F. G. Z"ollner. Multi-class texture analysis in colorectal cancer histology. Scientific
reports, 6:27988, 2016.

[33] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary,
M. Ali, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv
preprint arXiv:1712.00409, 2017.

[41] A. Kolesnikov, L. Beyer, X. Zhai,
J. Puigcerver, J. Yung, S. Gelly, and
N. Houlsby. Big transfer (BiT): General
visual representation learning. In ECCV,
pages 491–507, 2020.

[34] J. Hoffmann, S. Borgeaud, A. Mensch,
E. Buchatskaya, T. Cai, E. Rutherford,
D. d. L. Casas, L. A. Hendricks, J. Welbl,
A. Clark, et al. Training compute-optimal
large language models. In NeurIPS, 2022.
[35] M. Hutter. Learning curve theory. arXiv
preprint arXiv:2102.04074, 2021.
[36] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh,
H. Pham, Q. Le, Y.-H. Sung, Z. Li, and
T. Duerig. Scaling up visual and visionlanguage representation learning with
noisy text supervision. In International conference on machine learning, pages 4904–
4916. PMLR, 2021.
[37] M. Johnson, P. Anderson, M. Dras, and
M. Steedman. Predicting accuracy on large
datasets from smaller pilot data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages
450–455, Melbourne, Australia, 2018. Association for Computational Linguistics. doi:
10.18653/v1/P18-2072. URL https://
aclanthology.org/P18-2072.

[42] J. Krause, M. Stark, J. Deng, and L. Fei-Fei.
3d object representations for fine-grained
categorization. In 4th International IEEE
Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
[43] A. Krizhevsky, G. Hinton, et al. Learning
multiple layers of features from tiny images.
2009.
[44] F.-F. Li, M. Andreeto, M. Ranzato, and
P. Perona. Caltech 101, Apr 2022.
[45] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2:
Bootstrapping language-image pre-training
with frozen image encoders and large language models. In International conference
on machine learning, pages 19730–19742.
PMLR, 2023.
[46] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual
instruction tuning. Advances in neural information processing systems, 36, 2024.
[47] P. Maini, S. Goyal, Z. C. Lipton, J. Z. Kolter,
and A. Raghunathan. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint
arXiv:2307.03132, 2023.

14

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

[48] M. Minderer, A. Gritsenko, and N. Houlsby.
Scaling open-vocabulary object detection.
Advances in Neural Information Processing
Systems, 36, 2024.

[56] M. Richards, P. Kirichenko, D. Bouchacourt,
and M. Ibrahim. Does progress on object
recognition benchmarks improve real-world
generalization? In ICLR, 2024.

[49] S. Mukherjee, P. Tamayo, S. Rogers,
R. Rifkin, A. Engle, C. Campbell, T. R.
Golub, and J. P. Mesirov. Estimating dataset
size requirements for classifying dna microarray data. Journal of computational biology, 10(2):119–142, 2003.

[57] W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kanter, V. J. Reddi, and C. Coleman. The dollar
street dataset: Images representing the geographic and socioeconomic diversity of the
world. In Thirty-sixth Conference on Neural
Information Processing Systems Datasets and
Benchmarks Track, 2022.

[50] T. Nguyen, M. Wallingford, S. Santy, W.C. Ma, S. Oh, L. Schmidt, P. W. Koh,
and R. Krishna. Multilingual diversity
improves vision-language representations,
2024. URL https://arxiv.org/abs/
2405.16915.
[51] O. M. Parkhi, A. Vedaldi, A. Zisserman, and
C. V. Jawahar. Cats and dogs. In 2012 IEEE
Conference on Computer Vision and Pattern
Recognition, pages 3498–3505, 2012. doi:
10.1109/CVPR.2012.6248092.
[52] H. Pham, Z. Dai, G. Ghiasi, K. Kawaguchi,
H. Liu, A. W. Yu, J. Yu, Y.-T. Chen, M.-T.
Luong, Y. Wu, et al. Combined scaling for
zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
[53] A. Pouget, L. Beyer, E. Bugliarello, X. Wang,
A. P. Steiner, X. Zhai, and I. Alabdulmohsin.
No filter: Cultural and socioeconomic diversityin contrastive vision-language models.
In NeurIPS, 2024.
[54] A. Radford, J. W. Kim, C. Hallacy,
A. Ramesh, G. Goh, S. Agarwal, G. Sastry,
A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International
conference on machine learning, pages 8748–
8763. PMLR, 2021.
[55] V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Adcock, L. van der Maaten, D. Ghadiyaram,
and O. Russakovsky. Geode: a geographically diverse evaluation dataset for object
recognition. Advances in Neural Information
Processing Systems, 36, 2024.

[58] J. S. Rosenfeld, A. Rosenfeld, Y. Belinkov,
and N. Shavit. A constructive prediction of
the generalization error across scales. arXiv
preprint arXiv:1909.12673, 2019.
[59] C. Schuhmann, R. Beaumont, R. Vencu,
C. Gordon, R. Wightman, M. Cherti,
T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale
dataset for training next generation imagetext models. Advances in Neural Information
Processing Systems, 35:25278–25294, 2022.
[60] P. Sharma, N. Ding, S. Goodman, and
R. Soricut. Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of
ACL, 2018.
[61] U. Sharma and J. Kaplan. Scaling laws from
the data manifold dimension. JMLR, 23(9):
1–34, 2022.
[62] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer. How to
train your vit? data, augmentation, and
regularization in vision transformers. arXiv
preprint arXiv:2106.10270, 2021.
[63] A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y. Bitton, A. Gritsenko,
M. Minderer, A. Sherbondy, S. Long, S. Qin,
R. Ingle, E. Bugliarello, S. Kazemzadeh,
T. Mesnard, I. Alabdulmohsin, L. Beyer, and
X. Zhai. Paligemma 2: A family of versatile vlms for transfer, 2024. URL https:
//arxiv.org/abs/2412.03555.

15

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

[64] C. Sun, A. Shrivastava, S. Singh, and
A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In
Proceedings of the IEEE international conference on computer vision, pages 843–852,
2017.

[73] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions.
Transactions of the Association for Computational Linguistics, 2:67–78, 2014.

[65] A. V. Thapliyal, J. Pont-Tuset, X. Chen, and
R. Soricut. Crossmodal-3600: A massively
multilingual multimodal evaluation dataset.
arXiv preprint arXiv:2205.12522, 2022.

[74] J. Yu, Z. Wang, V. Vasudevan, L. Yeung,
M. Seyedhosseini, and Y. Wu.
Coca:
Contrastive captioners are image-text
foundation models.
arXiv preprint
arXiv:2205.01917, 2022.

[66] M. Tschannen, M. Kumar, A. Steiner, X. Zhai,
N. Houlsby, and L. Beyer. Image captioners
are scalable vision learners too. Advances in
Neural Information Processing Systems, 36,
2024.
[67] C. Wah, S. Branson, P. Welinder, P. Perona,
and S. Belongie. The caltech-ucsd birds200-2011 dataset. 2011.
[68] B. Wan, M. Tschannen, Y. Xian, F. Pavetic,
I. Alabdulmohsin, X. Wang, A. S. Pinto,
A. Steiner, L. Beyer, and X. Zhai. Locca:
Visual pretraining with location-aware captioners. arXiv preprint arXiv:2403.19596,
2024.
[69] T. Weyand, A. Araujo, B. Cao, and J. Sim.
Google landmarks dataset v2-a large-scale
benchmark for instance-level recognition
and retrieval. In Proceedings of the IEEE/CVF
conference on computer vision and pattern
recognition, pages 2575–2584, 2020.
[70] F. Wilcoxon. Individual comparisons by
ranking methods. In Breakthroughs in statistics: Methodology and distribution, pages
196–202. Springer, 1992.
[71] B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu,
M. Zeng, C. Liu, and L. Yuan. Florence2: Advancing a unified representation for
a variety of vision tasks. In Proceedings of
the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 4818–4829,
2024.

[75] L. Yuan, D. Chen, Y.-L. Chen, N. Codella,
X. Dai, J. Gao, H. Hu, X. Huang, B. Li,
C. Li, et al. Florence: A new foundation
model for computer vision. arXiv preprint
arXiv:2111.11432, 2021.
[76] X. Zhai, A. Kolesnikov, N. Houlsby, and
L. Beyer. Scaling vision transformers. In
CVPR, 2022.
[77] X. Zhai, X. Wang, B. Mustafa, A. Steiner,
D. Keysers, A. Kolesnikov, and L. Beyer. Lit:
Zero-shot transfer with locked-image text
tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18123–18133, 2022.
[78] X. Zhai, B. Mustafa, A. Kolesnikov, and
L. Beyer. Sigmoid loss for language image
pre-training, 2023. URL https://arxiv.
org/abs/2303.15343.
[79] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing visionlanguage understanding with advanced
large language models. arXiv preprint
arXiv:2304.10592, 2023.
[80] W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre,
J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y.
Wang, and Y. Choi. Multimodal c4: An open,
billion-scale corpus of images interleaved
with text. Advances in Neural Information
Processing Systems, 36, 2024.

[72] L. Xue. mt5: A massively multilingual
pre-trained text-to-text transformer. arXiv
preprint arXiv:2010.11934, 2020.

16

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

A. Qualitative Examples
Table 7 | The attention map visualization of the ViT-L/16 models trained on different scales of
data. Images are selected to represent cultures in Western-centric countries and countries where
low-resource languages are spoken.
Concept

Image

1B

10B

100B

Street (New York) 3

Pub (London) 4

Bison (Yellowstone) 5

Igorot Dance (Igorot) 6

Kathputli Kala Chitra (Hindi) 7

Igloo (Inuit) 8

3 By Terabass, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=134418052
4 By Ricardalovesmonuments - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=122810839
5 Source:

Yellowstone National Park, https://www.yellowstonenationalparklodges.com/connect/yellowstone-hotspot/yellowstone-where-the-bison-roam/
6 Source: Itogon, https://itogon.wordpress.com/2012/04/26/book-goes-to-heart-of-igorot-people/
7 Source: The Better India, https://thebetterindia.com/57220/journey-indian-handicraft-landscape/
8 Source: https://commons.wikimedia.org/w/index.php?curid=3648025

17

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Pohela Boishakh (Bengali) 9

9 Source: EyeNews, https://www.eyenews.news/english/Today-is-Pahela-Baishakh-the-first-day-of-Bengal-1430/757

18

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

B. Evaluations of Data Scaling
Table 8 | Detailed evaluation results of ViT-B/L/H models on 1/10/100 billion scale datasets. All
metrics are measured by error rate, with the exception of “Representation Bias”, which is measured
by disparity, where lower values are better.
ViT-B/16
Metric

Category

1B

10B

ViT-L/16
100B

1B

10B

ViT-H/16
100B

1B

10B

100B

ImageNet 0-shot Classification
41.21 39.35 39.04 31.23 29.70 28.49
Cifar100 0-shot Classification
36.62 35.87 36.80 25.02 23.75 23.36
Pet 0-shot Classification
25.40 23.71 22.27 14.36 12.46
9.46
ImageNet 10-shot Classification
46.65 45.63 44.74 35.11 34.95 33.71
Cifar100 10-shot Classification
38.73 38.63 39.02 27.50 26.70 25.49
Pet 10-shot Classification
22.95 23.19 22.08 12.32 12.48 11.80
Bird 10-shot Classification
53.80 53.47 53.90 44.05 45.25 44.29
Caltech 10-shot Classification
Western
8.37
8.33
8.23
6.41
7.40
7.53
Cars 10-shot Classification
18.29 16.79 17.60 11.14 11.33 11.47
Colorectal 10-shot Classification
26.53 29.23 27.00 24.00 23.53 22.57
DTD 10-shot Classification
29.73 30.85 30.90 28.46 27.07 27.93
56.46 51.62 53.44 49.70 47.18 45.28
COCO Image-Text 0-shot Retrieval
COCO Text-Image 0-shot Retrieval
70.90 68.84 70.01 68.16 64.32 62.51
Flickr Image-Text 0-shot Retrieval
24.20 21.20 21.10 20.40 15.50 16.60
Flickr Text-Image 0-shot Retrieval
43.12 40.26 40.42 39.94 32.32 32.52
...............................................................
Dollar Street 0-shot Classification
52.04 51.88 51.60 50.23 48.10 49.03
Dollar Street 10-shot Classification
77.69 75.81 72.12 63.56 64.09 58.29
GeoDE 0-shot Classification
7.85
8.27
8.65
6.01
5.90
4.88
Culture
72.75 71.47 71.36 61.94 62.31 57.85
GeoDE/country 10-shot Classification
GeoDE/region 10-shot Classification
61.09 60.80 59.18 54.21 53.59 48.29
GLDv2 0-shot Classification
65.05 60.96 59.40 50.39 46.37 45.72
...............................................................
Representation Bias
33.15 34.54 35.21 38.18 36.35 35.51
Income 0-200 Classification
70.57 68.43 67.97 66.30 64.35 66.30
Income 200-285 Classification
56.07 55.98 55.70 55.33 52.18 53.38
43.45 44.57 43.73 42.71 41.32 40.48
Income 285-685 Classification
Income >1998 Classification
38.05 38.51 38.98 36.56 34.51 35.91
GeoDE: Africa
10.58 11.56 11.15
7.99
8.24
6.55
GeoDE: Americas
Fairness
7.94
8.16
8.58
6.03
5.57
4.92
GeoDE: EastAsia
8.15
8.57
8.99
5.98
5.96
4.56
GeoDE: Europe
5.92
6.02
6.75
4.81
4.20
3.75
GeoDE: SouthEastAsia
7.51
7.81
8.26
5.78
5.78
5.02
GeoDE: WestAsia
6.57
7.01
7.85
5.11
5.30
4.19
...............................................................
XM3600 Image-Text: Arabic
61.78 53.42 53.36 53.58 45.00 44.56
XM3600 Image-Text: Bengali
95.69 80.64 77.06 90.81 66.36 63.75
XM3600 Image-Text: Czech
60.78 51.89 50.83 52.31 43.81 42.22
XM3600 Image-Text: Danish
55.58 45.39 45.75 45.08 35.06 31.00
XM3600 Image-Text: German
39.47 31.53 31.78 30.61 24.28 24.03
XM3600 Image-Text: Greek
74.36 63.00 61.86 67.86 53.64 50.14
XM3600 Image-Text: English
56.53 55.03 55.50 54.14 52.42 51.67
XM3600 Image-Text: Spanish
49.17 42.94 44.22 41.56 38.44 35.81
XM3600 Image-Text: Persian
58.94 51.17 51.58 49.64 38.97 40.17
XM3600 Image-Text: Finnish
70.64 53.83 53.61 59.25 42.67 39.06
87.86 82.06 81.92 82.72 72.86 71.36
XM3600 Image-Text: Filipino
Multiling
XM3600 Image-Text: French
47.08 38.92 39.06 39.08 31.78 29.92
XM3600 Image-Text: Hindi
83.53 74.78 72.39 77.67 65.67 63.47
XM3600 Image-Text: Croatian
64.53 53.28 51.33 53.08 37.94 35.78
XM3600 Image-Text: Hungarian
64.50 49.06 47.53 53.81 38.64 34.42

29.60
23.49
10.33
32.44
25.76
10.85
41.65
5.70
11.32
25.17
29.20
48.62
64.86
16.80
34.26

25.60
19.79
7.47
29.76
23.79
9.13
39.13
6.02
10.30
26.17
26.12
42.04
60.32
13.50
28.46

24.90
21.42
7.17
29.34
24.21
8.67
36.31
8.93
9.60
25.87
26.76
42.48
59.29
13.90
28.00

50.00
64.60
5.99
56.94
54.56
48.05

48.58
59.10
4.87
50.22
47.63
40.08

47.35
53.69
4.81
47.55
44.68
38.78

36.76
67.69
55.14
41.60
35.53
8.46
5.60
5.30
4.83
5.86
5.50

35.01
66.11
53.66
41.41
33.12
6.56
4.57
5.01
3.53
4.89
4.42

36.61
65.92
51.81
37.79
33.86
6.40
4.86
4.68
3.75
4.76
4.19

52.25
88.17
49.94
43.03
29.17
65.67
53.22
40.03
46.61
57.39
81.31
36.58
76.92
47.81
51.22

41.64
61.22
40.11
29.92
22.75
49.50
51.42
33.89
33.72
34.83
66.14
28.53
62.33
32.44
32.67

41.00
56.69
39.44
28.75
21.89
47.33
49.64
34.28
34.06
32.86
63.03
28.19
60.64
30.44
30.36

19

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

XM3600 Image-Text: Indonesian
XM3600 Image-Text: Italian
XM3600 Image-Text: Hebrew
XM3600 Image-Text: Japanese
XM3600 Image-Text: Korean
XM3600 Image-Text: Maori
XM3600 Image-Text: Dutch
XM3600 Image-Text: Norwegian
XM3600 Image-Text: Polish
XM3600 Image-Text: Portuguese
XM3600 Image-Text: Quechua
XM3600 Image-Text: Romanian
XM3600 Image-Text: Russian
XM3600 Image-Text: Swedish
XM3600 Image-Text: Swahili
XM3600 Image-Text: Telugu
XM3600 Image-Text: Thai
XM3600 Image-Text: Turkish
XM3600 Image-Text: Ukrainian
XM3600 Image-Text: Vietnamese
XM3600 Image-Text: Chinese
XM3600 Text-Image: Arabic
XM3600 Text-Image: Bengali
XM3600 Text-Image: Czech
XM3600 Text-Image: Danish
XM3600 Text-Image: German
XM3600 Text-Image: Greek
XM3600 Text-Image: English
XM3600 Text-Image: Spanish
XM3600 Text-Image: Persian
XM3600 Text-Image: Finnish
XM3600 Text-Image: Filipino
XM3600 Text-Image: French
XM3600 Text-Image: Hindi
XM3600 Text-Image: Croatian
XM3600 Text-Image: Hungarian
XM3600 Text-Image: Indonesian
XM3600 Text-Image: Italian
XM3600 Text-Image: Hebrew
XM3600 Text-Image: Japanese
XM3600 Text-Image: Korean
XM3600 Text-Image: Maori
XM3600 Text-Image: Dutch
XM3600 Text-Image: Norwegian
XM3600 Text-Image: Polish
XM3600 Text-Image: Portuguese
XM3600 Text-Image: Quechua
XM3600 Text-Image: Romanian
XM3600 Text-Image: Russian
XM3600 Text-Image: Swedish
XM3600 Text-Image: Swahili
XM3600 Text-Image: Telugu
XM3600 Text-Image: Thai
XM3600 Text-Image: Turkish
XM3600 Text-Image: Ukrainian
XM3600 Text-Image: Vietnamese
XM3600 Text-Image: Chinese
Avg Western 0-shot Classification
Avg Western 10-shot Classification
Avg Western 0-shot Retrieval

Multiling

Western

44.81
48.58
67.06
67.36
58.64
99.61
53.97
56.56
53.97
51.03
95.53
64.56
51.56
54.03
92.14
98.06
79.33
60.33
62.39
54.31
63.92
73.77
97.19
71.81
68.23
55.15
82.61
62.32
57.35
71.80
81.00
93.60
56.70
91.01
75.52
74.24
60.08
57.90
76.50
76.74
70.82
99.78
63.50
70.36
63.73
62.16
98.46
74.48
61.65
66.11
96.30
98.76
86.81
72.31
75.01
70.38
73.98
34.41
30.63
48.67

38.14
41.00
50.28
55.67
49.61
99.50
47.47
46.78
44.89
44.19
94.08
51.39
42.36
44.25
88.17
87.08
68.67
50.03
52.25
45.33
51.08
67.79
89.25
64.49
59.97
47.80
75.69
59.41
52.74
65.18
70.80
90.28
50.23
86.55
67.53
63.83
52.90
51.51
64.76
69.20
64.88
99.78
59.25
63.58
57.39
57.16
97.94
65.48
53.83
59.05
94.01
92.69
80.38
65.24
66.08
64.82
64.78
32.98
30.77
45.48

37.08
40.86
49.86
55.42
49.53
99.42
48.78
47.89
44.22
44.39
93.89
52.03
42.28
45.69
88.72
80.53
67.47
50.06
49.78
45.22
51.19
68.49
89.53
65.48
61.73
49.18
75.71
60.78
55.49
65.58
68.28
91.07
50.57
86.09
66.85
63.53
53.96
52.08
62.76
68.99
67.23
99.78
59.05
63.44
57.71
57.93
97.85
65.11
54.17
60.50
94.73
90.40
79.47
65.17
65.35
64.64
64.96
32.70
30.43
46.24

35.83
38.42
56.75
59.00
50.75
99.58
47.11
45.33
45.97
43.33
94.64
52.19
42.78
44.50
89.94
96.08
72.61
52.78
55.19
43.19
53.67
67.49
95.17
65.52
60.01
45.85
77.96
58.97
52.64
62.65
72.96
90.89
48.33
87.43
66.68
66.49
50.28
47.96
69.11
69.56
64.52
99.73
57.41
61.54
56.06
54.54
97.88
65.20
53.47
58.78
94.55
97.76
81.83
65.21
68.84
61.84
64.87
23.54
23.62
44.55

28.47
33.33
39.44
45.42
40.33
99.22
41.14
36.11
35.50
36.03
93.53
38.31
35.14
34.94
81.33
76.67
59.47
40.72
41.25
34.00
42.47
59.74
79.72
58.57
51.18
39.88
69.11
57.57
49.06
55.06
59.11
83.98
43.31
81.38
54.42
53.73
44.05
42.80
56.25
62.34
56.76
99.56
52.02
53.81
47.92
49.48
98.14
54.05
47.58
50.72
90.09
87.47
74.60
55.12
57.74
54.00
59.03
21.97
23.59
39.83

28.53
30.97
35.72
44.97
38.31
99.25
38.39
34.28
34.11
34.56
93.92
35.39
33.22
34.78
79.47
69.69
58.86
39.72
37.83
32.44
42.50
59.86
77.31
58.18
49.50
39.75
67.35
56.32
48.31
56.09
56.24
83.70
42.10
80.01
54.22
50.75
43.97
42.60
54.14
58.44
56.51
99.62
51.48
52.99
47.09
48.72
98.04
52.41
45.36
51.82
89.57
83.03
73.67
56.70
55.32
53.39
57.33
20.44
23.10
39.23

33.39
36.47
52.03
58.47
46.81
99.31
44.56
43.39
41.75
41.14
94.58
47.92
41.19
40.69
88.92
96.36
71.25
48.56
52.75
40.75
54.17
65.87
94.22
63.59
56.72
43.80
75.68
58.15
51.24
59.79
70.79
89.55
47.52
87.71
63.21
64.26
49.27
48.03
65.88
69.16
61.52
99.75
55.49
60.04
53.28
52.44
98.18
61.69
51.60
55.34
93.85
98.18
82.21
62.35
66.07
58.46
65.25
21.14
22.76
41.13

24.86
29.64
33.86
42.22
35.39
98.92
38.06
31.81
33.00
32.69
93.06
32.36
31.97
31.14
76.86
73.08
56.86
36.56
36.94
29.06
40.53
56.22
76.36
55.79
46.53
36.56
65.45
56.40
47.27
52.93
51.07
80.61
40.48
79.21
50.71
48.31
41.45
40.62
51.49
57.06
53.57
99.67
49.88
49.16
45.05
47.48
98.28
48.77
43.58
47.66
87.47
84.44
73.31
53.59
54.18
50.29
56.15
17.62
21.30
36.08

20

25.33
28.89
30.81
37.94
35.08
99.17
37.44
30.19
31.06
32.28
92.78
30.11
30.31
30.78
74.14
65.31
52.78
34.94
33.25
29.08
38.42
54.91
72.42
55.07
45.46
36.99
64.10
55.82
46.62
49.69
49.35
77.92
39.96
78.22
48.53
45.72
40.81
40.34
49.99
54.78
52.67
99.51
49.10
48.20
44.49
46.64
98.26
47.09
43.08
47.93
85.67
79.57
69.67
52.19
50.84
48.76
56.68
17.83
21.21
35.92

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Avg Western Classification

31.66 31.37 31.05 23.60 23.15 22.37
...............................................................
Avg Dollar Street Classification
64.87 63.85 61.86 56.89 56.09 53.66
Culture
Avg GeoDE Classification
47.23 46.85 46.39 40.72 40.60 37.01
...............................................................
Avg Income Classification
52.03 51.87 51.59 50.22 48.09 49.02
Avg Geographic Classification
Fairness
7.78
8.19
8.59
5.95
5.84
4.83
25.24 24.43 27.49 24.91 26.47 25.50
Avg Demography Classification
...............................................................
Avg Multiling: Low-Resource Lang
91.22 84.27 83.16 87.73 77.14 75.01
Multiling
Avg Multiling: High-Resource Lang
63.66 55.42 55.53 55.54 46.75 45.43
Average Western-centric
Average Cultural Diversity
Average Fairness
Average Multilinguality

36.20
56.08
25.44
65.23

35.13
54.87
25.46
56.09

35.10
53.72
26.08
55.61

29.19
47.72
23.87
57.52

27.60
46.72
23.36
47.23

26.87
44.01
23.01
45.40

22.32

20.30

20.29

57.30
39.16

53.84
34.24

50.52
32.35

49.99
5.92
25.50

48.57
4.83
25.13

47.35
4.77
27.22

86.58
53.38

73.69
43.11

70.93
41.81

27.34
46.69
23.88
55.38

24.51
41.75
22.80
43.33

24.46
39.48
22.70
41.63

21

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

C. Evaluations of Transferability to Generative Models
The downstream tasks in Table 9 are categorized as the following groups and reported in Table 6:
1. Semantics: “COCOcap”, “NoCaps”, “COCO-35L (en)”, “XM3600 (en)”, “OKVQA”, “AOKVQA-MC (val)”, “AOKVQA-DA
(val)”, “GQA”, “NLVR2”, “MARVL (avg5)”, “VizWizVQA (val)”, “TallyQA (simple)”, “TallyQA (complex)”, “CountBenchQA”, “RefCOCO (testA)”, “RefCOCO (testB)”, “RefCOCO+ (testA)”, “RefCOCO+ (testB)”, “RefCOCOg (test)”
2. OCR: “DocVQA (val)”, “OCR-VQA”, “ChartQA (avg)”, “ChartQA (human)”, “ChartQA (aug)”, “SciCap”, “AI2D”,
“ScienceQA”, “InfoVQA (val)”, “TextCaps”, “TextVQA (val)”, “ST-VQA (val)”, “Screen2Words”, “WidgetCap”
3. Multilinguality: “xGQA (avg8)”, “XM3600 (avg36)”, “COCO-35L (avg35)”
4. Remote Sensing: “RSVQA-lr”, “RSVQA-hr (test)”, “RSVQA-hr (test2)”

Table 9 | Detailed evaluation results of the transferability of contrastively trained vision models
(ViT-L/16) to generative vision-language models (PaliGemma), with both frozen and unfrozen setups.
Task-specific Numbers are reported for vision models trained on 1 billion, 10 billion and 100 billion
raw data respectively, using PaliGemma’s default fine-tuning configuration.
Frozen ViT
Metric
COCOcap
NoCaps
COCO-35L (avg35)
COCO-35L (avg34)
COCO-35L (en)
XM3600 (en)
XM3600 (avg36)
Screen2Words
TextCaps
SciCap
WidgetCap
VQAv2 (minival)
OKVQA
AOKVQA-MC (val)
AOKVQA-DA (val)
GQA
NLVR2
MARVL (avg5)
AI2D
ScienceQA
RSVQA-lr
RSVQA-hr (test)
RSVQA-hr (test2)
ChartQA (avg)
ChartQA (human)
ChartQA (aug)
VizWizVQA (val)
TallyQA (simple)
TallyQA (complex)
CountBenchQA
OCR-VQA
TextVQA (val)
DocVQA (val)
InfoVQA (val)
ST-VQA (val)
xGQA (avg8)
xGQA (avg7)
RefCOCO (testA)
RefCOCO (testB)

Unfrozen ViT

1B Data

10B Data

100B Data

1B Data

10B Data

100B Data

134.6
114.1
107.6
106.9
130.6
75.5
37.9
108.9
86.5
149.7
120.1
79.4
60.4
74.2
58.5
63.4
87.5
76.7
69.8
95.4
93.0
92.5
90.4
45.1
31.8
58.5
72.3
76.6
65.0
68.2
68.3
44.5
25.0
22.3
46.6
55.2
54.1
67.4
62.7

132.9
110.5
105.9
105.2
130.4
74.9
36.9
107.5
79.3
146.9
109.6
78.8
59.6
72.7
56.8
63.5
86.7
76.2
70.0
94.9
92.4
92.5
90.4
43.6
31.8
55.4
71.2
75.7
65
69.0
67.5
41.4
23.5
22.2
42.8
55.2
54.0
67.5
62.0

134.4
112.8
108.0
107.3
133.4
75.2
38.0
109.9
93.2
150.0
117.9
79.8
59.7
73.0
57.3
63.6
87.2
76.6
70.6
94.4
92.3
92.7
90.5
45.0
32.6
57.4
72.8
75.9
65.5
67.3
68.2
44.7
25.8
23
46.7
55
53.8
67.9
63.8

135.0
113.4
107.7
107.0
132.4
75.3
37.7
105.0
87.6
146.1
113.3
79.2
59.6
73.0
59.1
63.8
86.4
76.3
68.2
94.5
93.6
92.6
90.5
41.4
29.8
53.0
72.0
76.6
65.4
60.6
66.9
41.2
23.4
21.4
43.5
55.6
54.5
64.5
60.2

132.1
111.4
106.8
106.0
132.5
75.4
37.5
105.3
81.8
144.6
108.4
78.6
59.7
72.7
57.7
63.0
86.4
76.8
68.5
92.9
92.8
92.6
90.4
40.3
28.3
52.3
71.6
75.7
64.5
61.2
66.0
40.4
21.7
22.0
40.1
54.5
53.3
64.2
59.6

134.0
113.3
107.8
107.1
133.4
76.0
38.0
105.5
83.8
147.1
114.9
78.6
59.9
74.2
57.9
63.5
87.0
77.0
68.6
94.7
93.0
92.6
90.6
42.5
30.5
54.5
71.9
76.9
65.3
63.7
67.1
41.2
23.1
22.1
43.2
54.8
53.6
65.1
60.9

22

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

RefCOCO+ (testA)
RefCOCO+ (testB)
RefCOCOg (test)
Avg Semantics
Avg OCR
Avg Multilinguality
Avg Remote Sensing
Avg

63
55.6
59.1
77.1
69.5
66.9
92.0
75.1

62.7
54.9
58.9
76.4
66.9
66.0
91.8
73.7

63.5
56.2
60
77.2
70.0
67.0
91.8
75.3

60.2
53.2
56.5
76.0
66.8
67.0
92.3
73.6

59.9
52.5
56.1
75.4
65.2
66.3
91.9
72.7

60.3
53.3
57.2
76.4
67.0
66.9
92.1
73.9

23

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

D. Evaluations of Data Quality Filtering
Table 10 | Detailed evaluation results of data quality filtering on ViT-L/16 models. All evaluations are
conducted on datasets of 5 billion image-text pairs and across different number of seen examples. All
metrics are measured by error rate, with the exception of “Representation Bias”, which is measured
by disparity.
Metric

Filter

1B

5B

10B

20B

30B

ImageNet 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

34.67
31.18
34.50

28.17
26.76
29.52

26.68
25.14
28.13

26.15
24.39
26.70

24.32
23.90
26.45

Cifar100 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

33.05
31.69
36.07

26.08
26.96
35.27

24.37
25.37
29.95

24.52
24.68
32.58

23.99
25.76
30.78

Pet 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

17.25
13.68
14.04

11.99
10.49
9.62

11.69
8.78
8.99

9.13
8.59
7.28

8.72
8.23
6.62

ImageNet 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

42.41
38.57
38.32

35.25
32.53
32.32

33.17
30.60
30.42

33.17
29.20
29.05

30.68
28.72
28.46

Cifar100 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

36.61
32.83
35.30

30.02
28.44
35.56

27.39
28.04
31.18

27.23
26.20
32.26

26.82
27.40
31.79

Pet 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

22.95
17.31
14.15

16.93
11.72
10.38

15.32
10.44
9.08

15.26
8.97
7.63

11.72
8.83
7.52

Bird 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

41.18
32.38
34.57

31.69
25.20
27.01

29.91
23.85
26.30

29.60
22.21
24.65

27.37
21.95
23.73

Caltech 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

10.45
11.18
8.97

9.94
10.68
9.25

9.34
10.44
9.01

9.63
10.50
8.30

9.60
10.50
9.06

Cars 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

16.47
13.07
16.84

11.03
9.70
13.07

10.16
8.89
12.52

10.05
7.75
11.30

8.94
8.01
11.30

Colorectal Histology 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

27.80
25.97
24.53

27.17
22.90
24.70

24.77
20.80
25.47

27.03
24.23
27.10

25.33
27.13
26.53

DTD 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

31.12
29.20
28.09

26.91
25.69
26.81

26.33
25.37
24.73

26.97
23.51
24.52

26.86
23.72
23.56

COCO Image-Text 0-shot Retrieval

Baseline (en)
CLIP filtered
Other filtered

46.80
41.06
42.92

40.28
36.04
38.32

39.30
36.48
36.80

39.18
34.84
35.96

37.04
34.02
36.24

COCO Text-Image 0-shot Retrieval

Baseline (en)
CLIP filtered
Other filtered

62.26
59.11
60.53

56.78
55.27
56.01

54.78
54.45
54.60

55.22
53.12
53.23

53.20
53.03
53.27

Flickr Image-Text 0-shot Retrieval

Baseline (en)
CLIP filtered
Other filtered

16.70
14.80
16.70

11.30
9.90
13.80

11.30
9.70
12.60

11.30
9.60
13.10

10.90
8.90
12.00

Flickr Text-Image 0-shot Retrieval

Baseline (en)
CLIP filtered

32.26
29.52

24.78
24.98

24.74
23.34

24.90
22.12

22.66
22.02

24

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Other filtered

32.84

27.18

26.48

24.82

24.32

Dollar Street 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

54.67
53.71
50.23

50.44
52.58
47.63

49.81
51.88
47.86

49.98
50.63
47.45

49.37
51.44
47.08

Dollar Street 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

84.87
88.86
90.16

79.27
84.59
89.46

77.18
84.73
87.91

76.21
82.80
88.72

72.54
82.80
87.77

GeoDE 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

8.98
9.64
9.50

6.48
8.54
7.69

6.43
8.02
7.50

6.26
7.42
7.50

6.23
7.22
7.53

GeoDE (country) 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

84.29
85.82
91.37

77.28
81.98
89.52

73.22
80.11
88.30

73.37
78.08
87.65

68.85
78.24
86.76

GeoDE (region) 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

66.67
70.68
75.82

61.66
68.16
72.39

57.71
66.99
72.95

58.77
64.81
72.13

55.78
63.68
71.27

GLDv2 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

65.50
61.15
80.87

53.18
52.46
74.06

50.13
49.55
72.37

49.48
47.41
72.37

44.16
46.37
70.17

Representation Bias

Baseline (en)
CLIP filtered
Other filtered

33.89
11.46
39.31

28.22
19.14
36.44

36.00
20.03
39.01

33.52
26.57
40.57

30.96
14.05
35.51

Income 0-200 Classification

Baseline (en)
CLIP filtered
Other filtered

71.31
69.36
69.36

67.22
69.36
67.97

68.34
68.71
65.65

67.50
66.67
66.11

67.04
67.87
66.67

Income 200-285 Classification

Baseline (en)
CLIP filtered
Other filtered

60.15
58.48
54.22

55.33
57.46
50.88

54.87
56.63
52.64

54.49
54.59
51.16

55.33
56.63
51.16

Income 285-685 Classification

Baseline (en)
CLIP filtered
Other filtered

46.61
46.43
40.95

42.99
44.75
39.09

41.04
44.20
39.37

42.43
42.90
39.37

40.76
43.45
37.70

Income >1998 Classification

Baseline (en)
CLIP filtered
Other filtered

40.56
40.56
36.37

36.19
38.70
32.56

34.98
37.95
33.77

35.44
38.33
33.12

34.33
37.77
32.74

Africa

Baseline (en)
CLIP filtered
Other filtered

11.51
11.00
12.04

8.19
9.74
9.97

7.88
9.37
9.51

7.72
9.28
9.85

7.85
8.44
9.88

Americas

Baseline (en)
CLIP filtered
Other filtered

8.59
9.57
9.63

6.74
8.60
7.68

6.15
8.30
7.32

6.37
7.29
7.53

6.27
7.16
7.48

EastAsia

Baseline (en)
CLIP filtered
Other filtered

9.90
10.45
10.52

7.10
9.34
8.92

7.37
8.88
8.63

7.29
7.72
8.21

6.71
7.67
8.48

Europe

Baseline (en)
CLIP filtered
Other filtered

6.75
7.71
7.29

4.82
6.89
5.62

5.29
6.52
5.57

5.01
5.52
5.45

5.17
6.01
5.51

SouthEastAsia

Baseline (en)
CLIP filtered
Other filtered

8.69
9.74
8.89

6.23
8.47
7.28

6.00
7.40
7.47

5.77
7.74
7.16

6.01
7.32
7.11

WestAsia

Baseline (en)
CLIP filtered

8.14
9.24

5.61
8.16

5.64
7.59

5.17
6.75

5.08
6.52

25

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Other filtered

8.32

6.34

6.17

6.47

6.35

Perceived Gender

Baseline (en)
CLIP filtered
Other filtered

8.41
8.43
13.08

6.42
7.63
10.64

5.78
8.08
11.13

5.98
7.56
11.02

5.64
6.35
9.55

Perceived Race

Baseline (en)
CLIP filtered
Other filtered

37.87
33.08
53.52

44.74
40.63
52.46

43.93
38.98
53.21

48.30
41.89
52.83

44.95
43.04
56.43

Average Western 0-shot Classification

Baseline (en)
CLIP filtered
Other filtered

28.33
25.52
28.20

22.08
21.40
24.81

20.91
19.76
22.36

19.93
19.22
22.18

19.01
19.30
21.28

Average Western 10-shot Classification

Baseline (en)
CLIP filtered
Other filtered

28.62
25.06
25.10

23.62
20.86
22.39

22.05
19.80
21.09

22.37
19.07
20.60

20.92
19.53
20.25

Average Western 0-shot Retrieval

Baseline (en)
CLIP filtered
Other filtered

39.50
36.12
38.25

33.29
31.55
33.83

32.53
30.99
32.62

32.65
29.92
31.78

30.95
29.49
31.46

Average Western Classification

Baseline (en)
CLIP filtered
Other filtered

28.54
25.19
25.94

23.20
21.01
23.05

21.74
19.79
21.43

21.70
19.11
21.03

20.40
19.47
20.53

Average Dollar Street Classification

Baseline (en)
CLIP filtered
Other filtered

69.77
71.29
70.19

64.86
68.58
68.55

63.50
68.30
67.89

63.09
66.71
68.08

60.96
67.12
67.42

Average GeoDE Classification

Baseline (en)
CLIP filtered
Other filtered

53.32
55.38
58.90

48.48
52.89
56.54

45.79
51.71
56.25

46.13
50.10
55.76

43.62
49.71
55.18

Average Income Classification

Baseline (en)
CLIP filtered
Other filtered

54.66
53.71
50.22

50.43
52.57
47.62

49.81
51.87
47.86

49.97
50.62
47.44

49.36
51.43
47.07

Average Geographic Classification

Baseline (en)
CLIP filtered
Other filtered

8.93
9.62
9.45

6.44
8.53
7.63

6.39
8.01
7.44

6.22
7.39
7.45

6.18
7.19
7.47

Average Demography Classification

Baseline (en)
CLIP filtered
Other filtered

23.14
20.76
33.30

25.58
24.13
31.55

24.86
23.53
32.17

27.14
24.72
31.93

25.30
24.70
32.99

Average Western-centric

Baseline (en)
CLIP filtered
Other filtered

31.47
28.10
29.22

25.89
23.82
25.92

24.62
22.78
24.42

24.62
21.99
23.90

23.21
22.14
23.44

Average Cultural Diversity

Baseline (en)
CLIP filtered
Other filtered

60.83
61.64
66.33

54.72
58.05
63.46

52.41
56.88
62.82

52.34
55.19
62.64

49.49
54.96
61.76

Average Fairness

Baseline (en)
CLIP filtered
Other filtered

26.54
26.17
27.02

24.30
25.81
24.95

23.94
25.22
25.04

24.29
24.69
24.86

23.76
24.85
24.92

26

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

E. Evaluations of Language Rebalancing
err %
Language Rebalance
Before
After

85.0
82.5
80.0
77.5
75.0
72.5
70.0
1B

10B
Data Scale

err %
58

56

56

54

Average Multilingual

87.5

Average Multilingual: High-Resource Lang

Average Multilingual: Low-Resource Lang

err %

52
50
48

1B

10B
Data Scale

Average Fairness

Average Western

Average Culture Diversity

28.5
28.0

100B

10B
Data Scale

100B

24.0
23.8
23.6
23.4
23.2

27.0

10B
Data Scale

100B

24.2

27.5

1B

10B
Data Scale

24.4

29.0

44

1B

24.6

29.5

45

48

err %

48

46

50

44

100B

err %

47

52

46

46

100B

err %

54

23.0
1B

10B
Data Scale

100B

1B

Figure 5 | Rebalancing low-resource languages leads to significant improvements on corresponding
benchmarks and slight improvements on aggregated multilingual/cultural diversity tasks. However,
other tasks may experience decreased performance due to less Western-centric examples.
Table 11 | Detailed evaluation results of the rebalancing of low-resource languages on ViT-L/16 models
and datasets of 1/10/100 billion scales, with 100 billion examples seen in training. All metrics are
measured by error rate, with the exception of “Representation Bias”, which is measured by disparity.
1B Data

10B Data

100B Data

Metric

Before

After

Before

After

Before

After

ImageNet 0-shot Classification
Cifar100 0-shot Classification
Pet 0-shot Classification
ImageNet 10-shot Classification
Cifar100 10-shot Classification
Pet 10-shot Classification
Bird 10-shot Classification
Caltech 10-shot Classification
Cars 10-shot Classification
Colorectal Histology 10-shot Classification
DTD 10-shot Classification
COCO Image-Text 0-shot Retrieval
COCO Text-Image 0-shot Retrieval
Flickr Image-Text 0-shot Retrieval
Flickr Text-Image 0-shot Retrieval
Dollar Street 0-shot Classification
Dollar Street 10-shot Classification
GeoDE 0-shot Classification
GeoDE (country) 10-shot Classification
GeoDE (region) 10-shot Classification
GLDv2 0-shot Classification
Representation Bias
Income 0-200 Classification

31.23
25.02
14.36
35.11
27.50
12.32
44.05
6.41
11.14
24.00
28.46
49.70
68.16
20.40
39.94
50.23
63.56
6.01
61.94
54.21
50.39
38.18
66.30

31.39
24.96
13.00
34.94
27.82
13.71
42.75
8.09
11.34
25.50
29.31
52.92
67.50
24.30
37.88
51.16
65.04
6.03
59.79
53.99
51.82
35.21
67.32

29.70
23.75
12.46
34.95
26.70
12.48
45.25
7.40
11.33
23.53
27.07
47.18
64.32
15.50
32.32
48.10
64.09
5.90
62.31
53.59
46.37
36.35
64.35

30.47
24.04
12.05
34.99
26.50
15.59
45.29
8.97
11.54
24.43
27.39
50.28
63.60
20.30
32.64
49.42
65.51
5.97
60.52
53.30
47.73
32.61
65.83

28.49
23.36
9.46
33.71
25.49
11.80
44.29
7.53
11.47
22.57
27.93
45.28
62.51
16.60
32.52
49.03
58.29
4.88
57.85
48.29
45.72
35.51
66.30

28.80
23.51
11.23
33.89
25.05
13.46
42.89
8.35
11.21
28.00
29.04
45.90
62.16
16.40
33.30
49.23
59.42
5.42
53.34
48.05
44.29
32.74
65.37

27

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Income 200-285 Classification
Income 285-685 Classification
Income >1998 Classification
Africa
Americas
EastAsia
Europe
SouthEastAsia
WestAsia
Perceived Gender
Perceived Race
Crossmodal-3600 Image-Text Retrieval: Arabic
Crossmodal-3600 Image-Text Retrieval: Bengali
Crossmodal-3600 Image-Text Retrieval: Czech
Crossmodal-3600 Image-Text Retrieval: Danish
Crossmodal-3600 Image-Text Retrieval: German
Crossmodal-3600 Image-Text Retrieval: Greek
Crossmodal-3600 Image-Text Retrieval: English
Crossmodal-3600 Image-Text Retrieval: Spanish
Crossmodal-3600 Image-Text Retrieval: Persian
Crossmodal-3600 Image-Text Retrieval: Finnish
Crossmodal-3600 Image-Text Retrieval: Filipino
Crossmodal-3600 Image-Text Retrieval: French
Crossmodal-3600 Image-Text Retrieval: Hindi
Crossmodal-3600 Image-Text Retrieval: Croatian
Crossmodal-3600 Image-Text Retrieval: Hungarian
Crossmodal-3600 Image-Text Retrieval: Indonesian
Crossmodal-3600 Image-Text Retrieval: Italian
Crossmodal-3600 Image-Text Retrieval: Hebrew
Crossmodal-3600 Image-Text Retrieval: Japanese
Crossmodal-3600 Image-Text Retrieval: Korean
Crossmodal-3600 Image-Text Retrieval: Maori
Crossmodal-3600 Image-Text Retrieval: Dutch
Crossmodal-3600 Image-Text Retrieval: Norwegian
Crossmodal-3600 Image-Text Retrieval: Polish
Crossmodal-3600 Image-Text Retrieval: Portuguese
Crossmodal-3600 Image-Text Retrieval: Quechua
Crossmodal-3600 Image-Text Retrieval: Romanian
Crossmodal-3600 Image-Text Retrieval: Russian
Crossmodal-3600 Image-Text Retrieval: Swedish
Crossmodal-3600 Image-Text Retrieval: Swahili
Crossmodal-3600 Image-Text Retrieval: Telugu
Crossmodal-3600 Image-Text Retrieval: Thai
Crossmodal-3600 Image-Text Retrieval: Turkish
Crossmodal-3600 Image-Text Retrieval: Ukrainian
Crossmodal-3600 Image-Text Retrieval: Vietnamese
Crossmodal-3600 Image-Text Retrieval: Chinese
Crossmodal-3600 Text-Image Retrieval: Arabic
Crossmodal-3600 Text-Image Retrieval: Bengali
Crossmodal-3600 Text-Image Retrieval: Czech
Crossmodal-3600 Text-Image Retrieval: Danish
Crossmodal-3600 Text-Image Retrieval: German
Crossmodal-3600 Text-Image Retrieval: Greek
Crossmodal-3600 Text-Image Retrieval: English
Crossmodal-3600 Text-Image Retrieval: Spanish
Crossmodal-3600 Text-Image Retrieval: Persian
Crossmodal-3600 Text-Image Retrieval: Finnish
Crossmodal-3600 Text-Image Retrieval: Filipino
Crossmodal-3600 Text-Image Retrieval: French
Crossmodal-3600 Text-Image Retrieval: Hindi

55.33
42.71
36.56
7.99
6.03
5.98
4.81
5.78
5.11
5.25
44.57
53.58
90.81
52.31
45.08
30.61
67.86
54.14
41.56
49.64
59.25
82.72
39.08
77.67
53.08
53.81
35.83
38.42
56.75
59.00
50.75
99.58
47.11
45.33
45.97
43.33
94.64
52.19
42.78
44.50
89.94
96.08
72.61
52.78
55.19
43.19
53.67
67.49
95.17
65.52
60.01
45.85
77.96
58.97
52.64
62.65
72.96
90.89
48.33
87.43

54.22
44.75
38.33
8.34
5.51
6.07
4.41
6.21
5.30
5.27
49.02
56.44
76.03
52.81
45.22
32.00
70.17
54.58
43.50
55.33
60.11
72.56
39.72
71.67
53.72
54.61
37.47
40.69
47.75
61.58
53.06
97.94
48.06
46.81
45.81
42.53
94.97
52.72
45.00
46.19
75.06
81.00
74.72
54.94
57.33
42.22
54.81
65.43
83.83
65.19
59.93
47.48
75.46
56.93
52.79
63.27
72.06
83.32
49.81
83.45

52.18
41.32
34.51
8.24
5.57
5.96
4.20
5.78
5.30
6.06
46.88
45.00
66.36
43.81
35.06
24.28
53.64
52.42
38.44
38.97
42.67
72.86
31.78
65.67
37.94
38.64
28.47
33.33
39.44
45.42
40.33
99.22
41.14
36.11
35.50
36.03
93.53
38.31
35.14
34.94
81.33
76.67
59.47
40.72
41.25
34.00
42.47
59.74
79.72
58.57
51.18
39.88
69.11
57.57
49.06
55.06
59.11
83.98
43.31
81.38

53.48
42.80
35.53
7.81
5.84
5.90
4.23
6.15
5.67
5.96
45.89
45.89
63.53
43.36
34.81
24.36
53.42
51.58
38.00
41.97
42.42
62.72
31.47
65.44
38.86
37.81
30.94
33.50
37.39
45.78
40.00
95.00
41.42
36.72
35.61
38.33
93.83
38.06
35.11
36.06
67.64
67.78
60.50
41.25
40.97
35.22
44.67
59.02
75.56
59.19
52.77
40.72
69.24
57.52
49.90
55.54
58.61
78.41
44.62
80.96

53.38
40.48
35.91
6.55
4.92
4.56
3.75
5.02
4.19
4.97
46.04
44.56
63.75
42.22
31.00
24.03
50.14
51.67
35.81
40.17
39.06
71.36
29.92
63.47
35.78
34.42
28.53
30.97
35.72
44.97
38.31
99.25
38.39
34.28
34.11
34.56
93.92
35.39
33.22
34.78
79.47
69.69
58.86
39.72
37.83
32.44
42.50
59.86
77.31
58.18
49.50
39.75
67.35
56.32
48.31
56.09
56.24
83.70
42.10
80.01

53.20
40.76
37.58
7.46
5.20
5.27
4.00
5.50
4.79
5.03
47.35
44.78
61.47
41.61
32.53
23.11
51.94
50.89
35.89
38.11
40.28
60.22
29.61
63.53
35.64
34.78
28.42
31.03
34.19
46.69
38.58
96.08
39.94
34.47
34.33
34.11
93.42
34.86
33.42
34.19
65.81
66.33
59.92
39.89
39.19
32.86
43.97
59.70
73.33
57.56
49.74
39.50
68.25
56.51
48.76
54.64
56.42
74.94
42.34
79.22

28

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Crossmodal-3600 Text-Image Retrieval: Croatian
Crossmodal-3600 Text-Image Retrieval: Hungarian
Crossmodal-3600 Text-Image Retrieval: Indonesian
Crossmodal-3600 Text-Image Retrieval: Italian
Crossmodal-3600 Text-Image Retrieval: Hebrew
Crossmodal-3600 Text-Image Retrieval: Japanese
Crossmodal-3600 Text-Image Retrieval: Korean
Crossmodal-3600 Text-Image Retrieval: Maori
Crossmodal-3600 Text-Image Retrieval: Dutch
Crossmodal-3600 Text-Image Retrieval: Norwegian
Crossmodal-3600 Text-Image Retrieval: Polish
Crossmodal-3600 Text-Image Retrieval: Portuguese
Crossmodal-3600 Text-Image Retrieval: Quechua
Crossmodal-3600 Text-Image Retrieval: Romanian
Crossmodal-3600 Text-Image Retrieval: Russian
Crossmodal-3600 Text-Image Retrieval: Swedish
Crossmodal-3600 Text-Image Retrieval: Swahili
Crossmodal-3600 Text-Image Retrieval: Telugu
Crossmodal-3600 Text-Image Retrieval: Thai
Crossmodal-3600 Text-Image Retrieval: Turkish
Crossmodal-3600 Text-Image Retrieval: Ukrainian
Crossmodal-3600 Text-Image Retrieval: Vietnamese
Crossmodal-3600 Text-Image Retrieval: Chinese
Average Western 0-shot Classification
Average Western 10-shot Classification
Average Western 0-shot Retrieval
Average Western Classification
Average Dollar Street Classification
Average GeoDE Classification
Average Income Classification
Average Geographic Classification
Average Demography Classification
Average Multilingual: Low-Resource Lang
Average Multilingual: High-Resource Lang
Average Western-centric
Average Cultural Diversity
Average Fairness
Average Multilinguality

66.68
66.49
50.28
47.96
69.11
69.56
64.52
99.73
57.41
61.54
56.06
54.54
97.88
65.20
53.47
58.78
94.55
97.76
81.83
65.21
68.84
61.84
64.87
23.54
23.62
44.55
23.60
56.89
40.72
50.22
5.95
24.91
87.73
55.54
29.19
47.72
23.87
57.52

65.73
66.66
49.62
49.51
60.25
71.62
64.72
97.92
58.78
61.46
56.43
54.07
97.89
65.55
53.75
59.12
84.91
87.85
80.83
64.41
68.01
61.28
65.56
23.12
24.18
45.65
23.89
58.10
39.94
51.15
5.97
27.14
78.82
56.21
29.69
47.97
24.56
56.64

54.42
53.73
44.05
42.80
56.25
62.34
56.76
99.56
52.02
53.81
47.92
49.48
98.14
54.05
47.58
50.72
90.09
87.47
74.60
55.12
57.74
54.00
59.03
21.97
23.59
39.83
23.15
56.09
40.60
48.09
5.84
26.47
77.14
46.75
27.60
46.72
23.36
47.23

56.10
54.57
44.58
45.41
55.62
63.34
57.83
96.30
53.88
54.35
49.96
51.03
98.03
54.79
48.43
52.50
80.20
82.04
75.72
58.01
59.49
55.01
61.21
22.18
24.34
41.70
23.75
57.46
39.93
49.41
5.93
25.93
72.04
47.53
28.54
47.07
23.76
46.43

54.22
50.75
43.97
42.60
54.14
58.44
56.51
99.62
51.48
52.99
47.09
48.72
98.04
52.41
45.36
51.82
89.57
83.03
73.67
56.70
55.32
53.39
57.33
20.44
23.10
39.23
22.37
53.66
37.01
49.02
4.83
25.50
75.01
45.43
26.87
44.01
23.01
45.40

53.60
51.16
44.30
42.66
51.65
61.42
57.58
96.19
51.82
53.50
47.16
48.34
97.88
51.93
46.83
50.97
78.20
80.15
75.03
56.82
57.30
53.51
59.49
21.18
23.99
39.44
23.22
54.33
35.60
49.23
5.37
26.19
70.10
45.75
27.55
43.29
23.46
44.61

29

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

F. Distribution of Languages
We reuse the 35 languages10 reported in Crossmodal-3600 benchmark [65] for multilingual experiments.
Table 12 | Distribution of the 35 languages used in multilingual evaluations.
Language

Type

Maori
Telugu
Swahili
Filipino
Bengali
Hebrew
Hindi
Croatian
Norwegian
Finnish
Danish
Hungarian
Ukrainian
Romanian
Greek
Swedish
Czech
Persian
Thai
Dutch
Arabic
Vietnamese
Turkish
Polish
Italian
Korean
Portuguese
Indonesian
French
Chinese
German
Russian
Spanish
Japanese
English
Low-resource All
High-resource All

Low-resource
Low-resource
Low-resource
Low-resource
Low-resource
Low-resource
Low-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
High-resource
Low-resource
High-resource

Pages (%)
0.001
0.036
0.046
0.111
0.113
0.240
0.267
0.284
0.290
0.296
0.370
0.378
0.476
0.489
0.560
0.660
0.727
0.881
1.167
1.173
1.258
1.337
1.554
1.825
1.964
2.519
3.054
3.181
3.354
3.544
3.869
6.981
8.214
8.752
35.353
0.814
94.510

10“Quechua” is excluded as it is not supported by the language detection method we used.

30

M
a
Te ori
lu
Sw gu
a
Fi hili
lip
Be ino
n
H gali
eb
re
w
H
C ind
ro i
N a
or tia
w n
eg
Fi ian
nn
is
h
D

h n
is ria
an ga
un
H

U
kr
R ain
om ia
an n
ia
G n
Sw ree
ed k
is
C h
z
Pe ech
rs
ia
n
Th
a
D i
ut
Vi A ch
et ra
na bi
m c
e
Tu se
rk
is
Po h
lis
Ita h
lia
Po Kor n
rtu ea
In gu n
do es
ne e
si
a
Fr n
e
C nc
hi h
n
G ese
er
R man
us
s
Sp ian
Ja ani
pa sh
ne
En se
gl
is
h

20%
H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

H

L

L

L

L

L

L

L

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

40%

30%

10%

0%

Figure 6 | Visualization of the language distribution, where “L” and “H” denote low-resource and
high-resource language respectively.

31