February 2025 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Xiao Wang† , Ibrahim Alabdulmohsin† , Daniel Salz, Zhe Li, Keran Rong and Xiaohua Zhai arXiv:2502.07617v1 [cs.CV] 11 Feb 2025 † Corresponding Authors: {wangxiao, ibomohsin}@google.com We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model’s multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems. 1. Introduction The progress in vision-language models (VLMs) has been intrinsically linked to the availability of large-scale datasets. Larger datasets fuel the development of more powerful models, which are capable of understanding and generating complex relationships between images and text. In turn, such models have pushed boundaries in tasks like zero-shot image classification, image captioning and visual question answering. This relationship between data scale and model performance often follows a power law 𝑓 ( 𝑥 ) = 𝛼 𝑥 − 𝑐 + 𝜀, where 𝑓 ( 𝑥 ) is a model performance metric such as its error rate and 𝑥 is the data size [2, 8, 29, 33, 37, 38, 49, 58, 76]. These “scaling laws,” as they came to be known in the literature, have been used, among others, to determine the training data size needed to achieve a specified level of accuracy [9, 18, 26] and to optimize the model size [4, 34, 38]. They have also been justified theoretically using space-partitioning arguments [7, 35, 61]. Importantly, a power law implies that increasing the amount of training data can yield diminishing, but still worthwhile, returns in terms of accuracy and capability. Driven by these potential benefits, the field has witnessed a concerted effort towards scaling up the size of vision-language datasets. Early works © 2025 Google DeepMind. All rights reserved focused on web curated datasets like Conceptual Captions [60], which provided millions of imagecaption pairs for pre-training [60]. Subsequent work leveraged large-scale web crawling to create even larger datasets. In particular, the Common Crawl project [20]—a repository of publicly available web data—became a foundational resource for constructing many of these web-scale datasets. From this foundation emerged datasets like LAION-400M/2B/5B [59], DataComp [27], WebLI [15] and Multimodal C4 [80], pushing the boundaries of dataset size to billions of image-text pairs, thereby accelerating progress in VLMs. This is similar to how ImageNet [22], JFT-300M [64]– a dataset of 300 million images with noisy labels– and its larger variant JFT-3B [76] accelerated progress in supervised image pre-training previously. Despite these advancements, the largest reported datasets to date have plateaued at around 10 billion image-text pairs. This raises the question: what further benefits are unlocked by pushing the data scale by one order of magnitude to 100 billion unique examples? To answer this question, we introduce WebLI100B, a novel dataset containing 100 billion image-text pairs, representing a tenfold increase over the largest reported vision-langauge datasets. To recall, the original WebLI dataset Scaling Pre-training to One Hundred Billion Data for Vision Language Models Culture Western 74 72 70 68 66 64 62 60 58 Fairness COCO img2txt 58 DS Geoloc 56 54 40 52 46 10B Data Size 100B 44 25 20 15 30 48 1B 30 35 50 10 25 1B 10B Data Size 100B Telugu img2txt 35 45 Recall@1 ImageNet 0-shot 3 Accuracy 0 2 Recall@1 1 1 Accuracy Multilinguality B L H 5 1B 10B Data Size 100B 1B 10B Data Size 100B Figure 1 | left: Scaling the data from 10 billion to 100 billion examples enhances cultural diversity and multilingual capabilities more prominently than other metrics. The numbers represent the improved accuracy (in absolute terms) when data scale is increased, averaged across all tasks. See details in Section 4. righ t : Illustrative examples of the impact of data scale. The leftmost two are Western-centric metrics, which do not benefit much by scaling the data to 100 billion, while the rightmost two are illustrative of cultural diversity and multilinguality. The language Telugu, for example, makes up < 0.04% of the web and benefits a lot from the 100 billion data scale. contains 10 billion examples and has been instrumental in training state-of-the-art models like PaliGemma [10, 63] and SigLIP [78], and influenced the development of other research directions, such as mitigating social biases [3], improving cultural diversity [53], and scaling openvocabulary object detection [48]. In this work, our primary goal is to provide an empirical investigation to the impact of this data scale on a range of downstream tasks and, importantly, to explore aspects beyond traditional performance metrics. For instance, while our experiments demonstrate that 100 billion scale can lead to tiny improvements on established benchmarks, we reveal its significant impact on less-explored areas, particularly those related to cultural diversity and multilinguality. For example, when applied to geo-localization tasks based on Dollar Street [57]—a metric for evaluating cultural diversity—ViT-L/16 trained on a single epoch of 100 billion data achieves an accuracy of 41.7%. By contrast, the same model trained on ten epochs of 10 billion data achieves an accuracy of 35.9% only, despite both models using the same amount of training compute. We attribute these gains, in part, to the dataset’s ability to capture a wider range of long-tail cultural concepts that require a substantial data size to become salient. Furthermore, data scaling also enhances the multilinguality of trained models, leading to an improvement in low-resource languages. Figure 1 summarizes the improvements in cultural diversity and multilinguality achieved through data scaling. Statement of Contribution. Our goal in this paper is to answer the following question: should one invest in scaling up the size of the pretraining dataset to 100 billion examples? We make the following contributions: • We provide an empirical investigation of the potential of pre-training VLMs on a scale of 100 billion unique examples. To the best of our knowledge, studying the impact of this data scale for VLMs has never been conducted before in the literature. • We demonstrate that a scale of 100 billion image-text pairs is beneficial for VLMs in areas beyond traditional benchmarks, such as cultural diversity, multilinguality, and reducing performance disparity across subgroups. Hence, this data scale is vital for building truly inclusive multimodal systems. • We investigate the impact of applying quality filters that reduce the size of the dataset, such as those based on CLIP. While such filters are often employed to improve overall data 2 Scaling Pre-training to One Hundred Billion Data for Vision Language Models quality, we find that they can inadvertently reduce the representation of certain cultural contexts, thereby limiting the diversity of the dataset, even when the original dataset contains 100 billion examples. 2. Related Work Data Scaling. The study of scaling laws in large language models (LLMs) has become a critical area of research in NLP. Hestness et al. [33] and Kaplan et al. [38] were among the first to systematically explore the relationship among model size, dataset size, and compute, demonstrating predictable power-law scaling of performance. Henighan et al. [32] further emphasized the crucial role of data, showing that substantial performance gains can be achieved by increasing the size and quality of the training dataset, even with fixed model size. DeepMind’s Chinchilla [34] provided compelling evidence for this data-centric approach, demonstrating that smaller models trained on much larger datasets can achieve comparable or superior performance to larger models, given the same computational budget. This work has shifted the focus of LLM development towards optimizing the scale of data. In computer vision, early works, such as ImageNet [22], demonstrated the profound impact of dataset size and diversity on model generalization. Subsequent efforts like JFT-300M [64] emphasized the importance of large-scale and high-quality datasets for training state-of-the-art vision models. Zhai et al. [76] further explored scaling behavior in Vision Transformers [24] using the JFT-3B dataset, showing that scaling both data and model size simultaneously leads to improved generalization. The pivotal role of data scaling is equally applicable to vision-language modeling, as highlighted in Cherti et al. [17]. This has led to a substantial increase in the development of image-text datasets over the last ten years. Early datasets, such as COCO Captions [14] and Flickr30k [73], were created to enable tasks like image captioning and visual question answering with high-quality annotations. However, their limited size, due to the cost of human annotation, hindered further scaling of the datasets. To address this, Conceptual Captions [60] started to filter image-text pairs from the web based on heuristic rules, leading to millions of image-caption pairs. Going forward along this road, larger image-text datasets have been created from web sources, using increasingly complex filtering techniques [23, 25, 27]. These datasets, ranging from hundreds of millions to several billion image-text pairs, have enabled the training of powerful vision-language models like CLIP [54] and ALIGN [36], which have demonstrated impressive performance on a wide range of vision-language tasks. Notably, LAION-5B [59] and WebLI [15] stand out as the largest publicly and privately available image-text datasets, with 5 billion and 10 billion multilingual image-text pairs respectively. However, the rapidly growing web contains vastly more data. The impact of scaling to much larger datasets, such as 100 billion samples, remains largely unknown. Vision-Language Pre-training. The field of large vision-language models is advancing quickly, building upon remarkable progress in both computer vision and natural language processing. A prevalent and highly effective strategy is to learn visual representations and language modeling independently, followed by joint pre-training of the vision-language model using high-quality multimodal data. Since the advent of CLIP [54], contrastive learning on large, noisy web datasets has become the dominant approach for acquiring powerful visual representations [13]. This weakly supervised paradigm surpasses traditional supervised learning methods [41, 62], primarily due to the large scale and high diversity of web data [36, 52, 74, 75]. An alternative approach gaining traction involves learning visual features from web data using generative methods [66, 68], which predict paired text for given images. While vision models trained in this manner exhibit superior transferability to generative language models, the high computational cost limits its widespread adoption. Despite the acquired zero-shot capabilities, 3 Scaling Pre-training to One Hundred Billion Data for Vision Language Models which can be directly applied to tasks such as zero-shot classification [22] and image-text retrieval [14, 73], the strong visual representations learned by contrastively trained models often lead to their utilization as image encoders. This is often leveraged in vision-language tasks by integrating visual tokens with language tokens, enabling LLMs to process multimodal information [5, 10, 15, 16, 45, 46]. Following this approach, PaLI-3 [16] has demonstrated that vision models trained on large-scale web data outperform those trained on weakly annotated images of a similar scale, which further underscores the importance of the data diversity inherently present in the web corpus. Inclusive Models. Recent studies have highlighted that popular techniques employed to enhance the performance of vision-language models, such as English-language-based filtering, may inadvertently diminish cultural understanding [6, 30, 50, 53, 56]. Hence, we also evaluate cultural diversity in this work, as outlined in Pouget et al. [53], which falls into two categories. The first category, geo-localization, involves predicting the country or region of origin for an image using few-shot classification. The second category utilizes zero-shot classification on datasets curated from various geographical regions. Prominent examples within this category include Dollar Street [57], GeoDE [55], and Google Landmarks Dataset v2 (GLDv2) [69]. Dollar Street comprises 38K images depicting household items from 63 countries. GeoDE features 62K manually annotated images collected from diverse geographic locations. Finally, GLDv2 contains 1,542 images representing 884 landmarks across 84 countries, enabling the assessment of model performance on recognizing culturally important locations. In our evaluations, we employ all three aforementioned datasets. For the zeroshot evaluation on Dollar Street, we adhere to the methodology used in Rojas et al. [57], mapping 96 specific topics within the dataset to corresponding ImageNet classes. This mapping results in a curated subset of 21K images, which we utilize for our analysis. These geographically di- verse benchmarks, employed collectively, provide a comprehensive framework for evaluating the impact of performance optimization techniques on cultural understanding within vision-language models. 3. Experimental Setup 3.1. Pre-training Datasets We describe the dataset splits we use in the pretraining. Raw Datasets. To assess the performance of vision-language models on large-scale image-text data, we construct a dataset with 100 billion image-text pairs from the web, inspired by the work of Chen et al. [15], Jia et al. [36], Schuhmann et al. [59], Zhai et al. [77]. We refer to this as WebLI-100B, and refer to its subsets with 1 billion and 10 billion examples as 1B and 10B, respectively. The 1B and 10B datasets are created by randomly sampling 1% and 10%, respectively, from the 100 billion dataset. In this work, we apply only essential data filters, such as removing harmful images and personally identifiable information (PII). This approach ensures the dataset remains as multilingual and diverse as possible. We utilize both the alt-text and page title associated with each image as the paired text. To ensure fair evaluations, we remove nearduplicate images across more than 90 common vision-language tasks from our dataset. Quality-filtered Datasets. To examine the impact of scaling on quality-filtered data, we adopt the common approach of using the CLIP-L/14 model [54] as a filter, retaining a high-quality dataset with 5 billion pairs of images and English alt-text. To further solidify our results, we train a VLM on the web data to classify image-text pairs as aligned or misaligned, and tune its threshold to retrain another filtered dataset of the same size. Unless otherwise noted, we use the language of web pages1 for multilingual experiments, thereby 1 The “content-language" meta tag in the head of an HTML document. 4 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Table 1 | The attention map visualization of the ViT-L/16 models trained on different scales of data. Images are selected to represent cultures in Western-centric countries and countries where lowresource languages are spoken. Concept Image 1B Data 10B Data 100B Data Igorot Dance (Igorot) Igloo (Inuit) Bison (Yellowstone) avoiding potential inaccuracies from language detection on the noisy web text. Language-rebalanced Datasets. In the language rebalancing experiments in Section 5.2, we adjust the mixing ratio of the low-resource languages used in the Crossmodal-3600 [65] benchmark. These low-resource languages are Bengali (bn), Filipino (fil), Hindi (hi), Hebrew (iw), Maori (mi), Swahili (sw), and Telugu (te)2 , ranging from 0.001% to 0.267% in our dataset (Appendix F). In model training, we upsample each of them to 1%, with remaining 93% comprising of the original data. 3.2. Contrastive Vision-Language Pretraining To study the impact of data scale on model performance, we train SigLIP [78] models using three different dataset sizes: 1 billion, 10 billion and 100 billion. We also vary the model size using ViTB/16, ViT-L/16, and ViT-H/14 architectures for both image and text encoders. During contrastive training, inspired by Zhai et al. [76], we utilize a large batch size of 32K and an inverse square root 2Cusco Quechua (quz) is excluded from our experiments because it is not supported by our language detection method. learning rate schedule with 200 million warmup and cooldown examples. The learning rate and weight decay are set to 0.001 and 0.0001 respectively. In the preprocessing stage, images are resized to a resolution of 224x224 pixels, and texts are tokenized using the multilingual mt5 [72] tokenizer with a maximum sequence length of 64 tokens. All models are trained on a maximum of 100 billion examples; e.g. a maximum of 100 epochs when using 1B examples. We cool down the models at various training steps where they have seen 3, 7, 10, 17, 26, 33, 49, 66, and 100 billion examples, and evaluate them after the cool-downs. Unless otherwise specified, we report results using the checkpoints where models have been trained on 100 billion examples. All models are compared on a compute-matched regime. 3.3. Evaluations The model’s capabilities are evaluated across a diverse range of benchmarks, spanning from traditional Western-centric tasks to those measuring inclusivity. Western-centric. Our first set of evaluations uses diverse, well-established benchmarks. 5 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Table 2 | Evaluations and scaling laws on Western-centric benchmarks, where scaling from 10B to 100B examples shows limited benefits. Model Metric (err%) Value @ 100B ex Scaling Laws 100B 1B limit 10B 100B Zero-shot classification 39.0 -0.58 -0.97 36.8 -0.26 -0.23 22.3 -0.43 -0.45 -0.65 -0.24 -0.37 40.1 33.8 22.3 38.5 32.5 21.7 37.9 33.7 18.4 29.7 23.8 12.5 28.5 23.4 9.5 -0.92 -0.26 -0.61 -0.91 -0.32 -0.57 -0.82 -0.43 -0.51 30.7 22.7 12.3 29.0 20.7 9.6 27.1 21.1 7.0 25.6 19.8 7.5 24.9 21.4 7.2 -0.36 -0.25 -0.45 -0.64 -0.36 -0.42 -0.52 -0.29 -0.50 26.7 20.6 8.1 24.5 18.0 5.3 23.3 17.6 4.6 1B 10B 100B B ImageNet CIFAR100 Pet 41.2 36.6 25.4 39.4 35.9 23.7 L ImageNet CIFAR100 Pet 31.2 25.0 14.4 H ImageNet CIFAR100 Pet 29.6 23.5 10.3 1B exponent 10B Retrieval @1 B COCO I2T@1 COCO T2I@1 Flickr I2T@1 Flickr T2I@1 56.5 70.9 24.2 43.1 51.6 68.8 21.2 40.3 53.4 70.0 21.1 40.4 -0.24 -0.34 -0.24 -0.32 -0.49 -0.39 -0.34 -0.42 -0.30 -0.69 -0.23 -0.30 52.4 69.6 21.5 40.9 49.9 67.1 18.1 37.5 50.7 69.5 17.0 36.7 L COCO I2T@1 COCO T2I@1 Flickr I2T@1 Flickr T2I@1 49.7 68.2 20.4 39.9 47.2 64.3 15.5 32.3 45.3 62.5 16.6 32.5 -0.24 -0.19 -0.21 -0.10 -0.41 -0.42 -0.45 -0.42 -0.30 -0.41 -0.21 -0.42 45.8 64.2 16.5 34.6 44.7 62.6 14.1 30.7 42.9 60.5 13.4 30.7 H COCO I2T@1 COCO T2I@1 Flickr I2T@1 Flickr T2I@1 48.6 64.9 16.8 34.3 42.0 60.3 13.5 28.5 42.5 59.3 13.9 28.0 -0.21 -0.30 -0.23 -0.23 -0.62 -0.55 -0.40 -0.56 -0.47 -0.43 -0.23 -0.46 44.6 62.8 12.2 29.6 40.3 58.9 11.4 26.8 40.6 57.3 11.3 25.9 B Imagenet Birds Caltech Cars CIFAR100 Colorectal Pet DTD 46.6 53.8 8.4 18.3 38.7 26.5 22.9 29.7 45.6 53.5 8.3 16.8 38.6 29.2 23.2 30.9 44.7 53.9 8.2 17.6 39.0 27.0 22.1 30.9 10-shot -0.82 -0.34 -0.30 -0.63 -0.19 -0.02 -1.77 -0.28 -0.61 -0.40 -0.24 -0.68 -0.22 -0.06 -0.62 -0.24 -0.49 -0.51 -0.23 -0.60 -0.20 -0.16 -0.77 -0.19 46.2 51.5 7.1 17.1 35.2 20.2 21.6 27.9 44.4 51.6 7.2 15.5 34.9 22.6 21.3 28.3 43.3 52.8 6.8 16.3 35.9 24.4 20.6 27.2 L Imagenet Birds Caltech Cars CIFAR100 Colorectal Pet DTD 35.1 44.0 6.4 11.1 27.5 24.0 12.3 28.5 35.0 45.3 7.4 11.3 26.7 23.5 12.5 27.1 33.7 44.3 7.5 11.5 25.5 22.6 11.8 27.9 -0.67 -0.51 -0.43 -0.54 -0.24 -0.18 -0.70 -0.22 -0.68 -0.43 -0.17 -0.49 -0.29 -0.20 -0.65 -0.25 -0.63 -0.51 -0.18 -0.41 -0.41 -0.27 -0.53 -0.23 34.1 42.1 5.9 10.1 24.0 18.8 11.3 25.2 34.0 43.2 4.8 9.7 23.7 20.2 11.4 25.1 32.5 42.7 4.8 9.9 22.9 20.5 10.3 25.5 H Imagenet Birds Caltech Cars CIFAR100 Colorectal Pet DTD 32.4 41.6 5.7 11.3 25.8 25.2 10.8 29.2 29.8 39.1 6.0 10.3 23.8 26.2 9.1 26.1 29.3 36.3 8.9 9.6 24.2 25.9 8.7 26.8 -0.41 -0.67 -0.21 -0.27 -0.22 -0.22 -0.92 -0.16 -0.73 -0.52 -0.08 -0.88 -0.25 -0.20 -0.48 -0.23 -0.79 -0.47 -0.11 -0.44 -0.24 -0.15 -0.46 -0.23 30.3 40.6 4.3 9.1 21.4 19.7 10.3 25.0 29.0 37.4 3.7 10.1 21.1 17.9 7.6 23.8 28.3 33.9 4.6 8.3 19.7 20.7 6.5 24.8 For zero-shot classification, we employ ImageNet [22], CIFAR-100 [43], and Oxford-IIIT Pet [51] datasets. Additionally, for 10-shot evaluations, we use Caltech-UCSD Birds [67], Caltech 101 [44], Cars196 [42], Colorectal Histology [40], and Describable Textures Dataset (DTD) [19] benchmarks to assess the representation capabilities of vision models. We also conduct zero-shot retrieval evaluations on COCO Captions [14] and Flickr30k [73], in both 6 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Table 3 | Evaluations and scaling laws on culture diversity benchmarks, where scaling from 10B to 100B examples shows larger benefits. Model Metric (err %) Value @ 100B ex Scaling Laws 100B 1B limit 10B 100B 10-shot Geolocalization 72.1 -0.38 -0.36 71.4 -0.35 -0.31 59.2 -0.26 -0.22 -0.37 -0.37 -0.29 76.3 70.8 58.8 73.7 69.6 57.0 70.2 68.9 57.3 64.1 62.3 53.6 58.3 57.8 48.3 -1.09 -0.40 -0.15 -0.38 -0.30 -0.16 -0.94 -1.11 -0.39 63.2 58.8 49.9 60.1 58.0 46.9 57.5 56.6 46.3 64.6 56.9 54.6 59.1 50.2 47.6 53.7 47.6 44.7 -0.30 -0.23 0.00 -0.56 -0.78 -0.38 -0.64 -0.62 -0.31 61.0 52.2 50.1 56.4 49.4 45.3 52.5 46.1 41.0 Dollar Street GeoDE GLDv2 52.0 7.8 65.0 51.9 8.3 61.0 Zero-shot classification 51.6 -0.38 -0.25 8.7 -0.24 -0.26 59.4 -0.46 -0.72 -0.28 -0.25 -0.51 50.4 6.1 61.6 49.7 6.7 59.3 49.7 5.4 56.8 Dollar Street GeoDE GLDv2 50.2 6.0 50.4 48.1 5.9 46.4 49.0 4.9 45.7 -0.22 -0.29 -0.53 -0.35 -0.17 -0.93 -0.17 -0.25 -0.89 46.9 4.7 48.5 46.2 4.3 44.8 46.2 3.3 44.1 Dollar Street GeoDE GLDv2 50.0 6.0 48.1 48.6 4.9 40.1 47.4 4.8 38.8 -0.15 -0.19 -0.52 -0.13 -0.22 -1.34 -0.20 -0.24 -0.80 43.9 3.3 46.0 44.2 3.3 39.0 44.1 3.5 36.8 1B 10B 100B B Dollar Street GeoDE-Country GeoDE-Region 77.7 72.8 61.1 75.8 71.5 60.8 L Dollar Street GeoDE-Country GeoDE-Region 63.6 61.9 54.2 H Dollar Street GeoDE-Country GeoDE-Region B L H image-to-text and text-to-image directions. Cultural Diversity. Besides the above metrics, we also incorporate a range of benchmarks aimed at evaluating cultural diversity, following the recommendations in [53]. Specifically, we include zero-shot classification using Dollar Street [57], GeoDE [55], and Google Landmarks Dataset v2 (GLDv2) [69]. See Section 2 for a brief description about each dataset. We also include 10-shot geolocalization using Dollar Street and GeoDE. Multilinguality. We evaluate the model’s multilinguality using the Crossmodal-3600 dataset [65], a geographically diverse set of 3600 images with human-generated captions in 36 languages. We assess the model’s zero-shot retrieval in both image-to-text and text-to-image directions for each language. In addition to per-language results, we also present average scores for low-resource languages (Bengali, Filipino, Hindi, Hebrew, Maori, Swahili, and Telugu) and high-resource languages (others). 1B exponent 10B Fairness. In addition, we also evaluate the presence of societal biases in the trained model. We report on representation bias (RB) and association bias (AB) between gender and occupation, as defined in Alabdulmohsin et al. [3]. These measure unwanted associations w.r.t. the gender attribute using 1st and 2nd order statistics. Also, we report performance disparity by income in Dollar Street zero-shot accuracy and by region in GeoDE zero-shot accuracy. Transfer to Generative Models. Finally, to assess how well our contrastively trained vision models transfer to generative vision-language tasks, we utilize the compact and versatile PaliGemma model [10]. We initialize PaliGemma’s vision component with our contrastively trained models and pretrain it on 50 million seen examples, following its stage-1 recipe at 224x224 resolution. During the pre-training, we explore two common transfer settings: freezing [15, 46, 79] and unfreezing [10, 16, 63, 71] the vision model. We then use PaliGemma’s default configuration to finetune on a variety of downstream tasks, covering image captioning, 7 Scaling Pre-training to One Hundred Billion Data for Vision Language Models visual question answering, and segmentation, which require the understanding of semantics, OCR, multilinguality, and remote sensing. 4. Results 4.1. Established Benchmarks We begin by evaluating all vision-language models on established benchmarks, based on ImageNet and COCO Captions, among other datasets. As revealed in Table 2, increasing the dataset size from 10 billion to 100 billion examples does not improve performance substantially. This is statistically supported by Wilcoxon’s signed rank test [70], which gives a 𝑝-value of 0.9, indicating that differences are not significant. In addition, we also fit data scaling laws for every combination of model and dataset following the recipe proposed in Alabdulmohsin et al. [2]. This allows us to evaluate whether or not the performance gap is expected to increase or decrease in the infinite-compute regime. We report the resulting scaling exponents and asymptotic performance limits in the tables. Again, we do not observe significant differences at the 95% confidence level ( 𝑝-value of 0.09). 4.2. Cultural Diversity Unlike the Western-oriented metrics reported in Section 4.1, cultural diversity metrics present an entirely different picture. We observe notable gains when scaling the size of the dataset from 10 billion to 100 billion examples in Table 3. For example, scaling training data from 10 billion to 100 billion examples yields substantial gains on Dollar Street 10-shot classification task, where ViT-L and ViT-H see absolute improvements of 5.8% and 5.4%, respectively. These gains outperform the typical improvements (less than 1%) observed on Western-oriented 10-shot metrics by a large margin. Using Wilcoxon’s signed rank test, we obtain a 𝑝-value of 0.002, indicating a statistically significant evidence at the 99% confidence level. 4.3. Multilinguality Our multilingual benchmark, Crossmodal-3600 zero-shot retrieval [65], shows a disparity in performance gains: low-resource languages benefit more from the 100 billion scale than the highresource ones. The disparity, illustrated in Figure 3, which not only exists in all model sizes but also widens as the models become larger. Detailed results for each language can be found in Appendix B. 4.4. Fairness For fairness, we report on 3 metrics discussed in Section 3.3. Representation Bias. The first metric is representation bias (RB), with results detailed in Table 4. We observe that models trained on unbalanced web data have a significantly higher preference to associate a randomly chosen image from ImageNet [22] with the label “Male” over the label “Female.” In fact, this occurs nearly 85% of the time. Training on 100B examples does not mitigate this effect. This finding aligns with previous research highlighting the necessity of bias mitigation strategies, such as data balancing [3], to address inherent biases in web-scale datasets. Model 1B 10B 100B B L H 83.2 88.2 86.8 84.5 86.4 85.0 85.2 85.5 86.6 Table 4 | Representation bias w.r.t. gender (see Section 4). Here, values [%] indicate how often the model prefers to associate a random image with the label “Male” over “Female”. Association Bias. Second, Figure 2 shows the association bias in SigLIP-H/14 between gender and occupation as we scale the data from 10 to 100 billion examples. Specifically, we plot the probability that the model would prefer a particular occupation label, such as “secretary” over 8 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Occupation 0.66 0.4 0.2 secretary receptionist nurse 0.66 0.66 0.94 0.97 0.13 0.11 0.25 0.56 0.8 0.4 secretary receptionist 0.2 Occupation 0.65 0.069 0.16 0.25 0.63 receptionist secretary 0.2 nurse 0.94 librarian 0.25 0.4 0.88 housekeeper 0.088 Occupation 0.28 Female 0.8 Gender 0.12 secretary 0.66 0.63 Male 1 receptionist Female Occupation 0.88 0.8 Model = H, Data = 100B nurse 0.2 0.73 0.68 Occupation nurse 0.2 0.6 librarian 0.29 0.4 0.61 0.076 0.038 librarian 0.67 Occupation 0.94 0.98 0.6 housekeeper 0.18 Female 0.086 0.4 0.95 Gender 0.8 Male 0.33 housekeeper 0.47 0.6 Gender 0.058 0.8 Male 0.75 secretary librarian 0.92 receptionist 0.45 0.71 0.97 Model = L, Data = 100B nurse 0.42 housekeeper Female Gender Male 0.86 0.8 secretary 0.85 Model = B, Data = 100B 0.89 0.63 0.38 Model = H, Data = 10B 0.6 housekeeper Occupation 0.86 receptionist 0.2 0.93 nurse 0.4 Female 0.6 librarian 0.2 0.8 Gender 0.66 Male 0.14 receptionist 0.057 0.65 0.93 0.21 0.6 Model = L, Data = 10B secretary 0.48 nurse housekeeper 0.15 librarian Female Gender Male 0.66 0.56 0.2 0.96 librarian 0.79 0.4 Female 0.66 0.6 Gender 0.62 0.0026 0.025 Model = B, Data = 10B 0.89 0.97 housekeeper Occupation 0.95 0.8 Male secretary 0.2 0.22 secretary 0.58 0.016 receptionist 0.027 0.4 0.85 housekeeper 0.11 receptionist 0.41 librarian 0.86 nurse 0.6 nurse 0.96 Model = H, Data = 1B librarian 0.45 Female 0.62 Gender 0.78 Model = L, Data = 1B 0.8 Male 0.97 housekeeper Male Gender Female Model = B, Data = 1B 0.46 0.65 0.88 0.96 0.8 0.6 0.4 0.2 Occupation Figure 2 | Association bias between gender and occupation, evaluated in scaled models and data. err % Average XM3600 Retrieval 90 1.11 80 2.14 2.76 70 60 Lang Resource Low High -0.11 50 40 ViT Size B/16 L/16 H/16 1.32 1.29 1B 10B Data Scale 100B Figure 3 | Scaling up to 100B examples leads to more notable improvements in low-resource languages. Δ denotes the improved accuracy when scaling from 10B examples to 100B. another label, such as “manager” when images correspond to males or females. In this evaluation, we use the Fairface [39] dataset. The labels we compare are: “librarian” vs. “scientist”, “nurse” vs. “doctor”, “housekeeper” vs. “homeowner”, “receptionist” vs. “executive” and “secretary” vs. “manager”. Again, we do not see a reduction in association bias by simply increasing the size of the training data. Performance Disparity. Finally, one common definition of fairness in machine learning is maintaining similar performance across different groups. See, for instance, Dehghani et al. [21] and the related notions of “Equality of Opportunity” and “Equalized Odds” [31]. Table 5 show that scaling the data to 100 billion examples improves performance disparity, which is consistent with the improvement in cultural diversity. 4.5. Transfer To Generative Models We use PaliGemma [10] with both frozen and unfrozen vision component to assess the transferability of our vision models, which were contrastively pre-trained on datasets of different scales. In Table 6, when taking the noise level into consideration, we do not observe consistent performance gains across downstream tasks as we scale the pre-training dataset. More details can be found in Appendix C. 9 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Table 5 | Performance disparity results for various SigLIP models pretrained on 100 billion seen examples of 1B, 10B, and 100B datasets. Here, disparity corresponds to the maximum gap across subgroups in Dollar Street (by income level) and GeoDE (by geographic region). Pretraining on 100B examples tends to improve disparity overall. Model Data Scale Performance per Subgroup 0-shot Dollar Street 200-685 685-1998 0-200 Disparity >1998 B B B 1B 10B 100B 29.4 31.6 32.0 43.9 44.0 44.3 56.5 55.4 56.3 62.0 61.5 61.0 32.5 29.9 29.0 L L L 1B 10B 100B 33.7 35.7 33.7 44.7 47.8 46.6 57.3 58.7 59.5 63.4 65.5 64.1 29.7 29.8 30.4 H H H 1B 10B 100B 32.3 33.9 34.1 44.9 46.3 48.2 58.4 58.6 62.2 64.5 66.9 66.1 32.2 33.0 32.1 0-shot GeoDE Americas East-Asia Africa Europe South-East Asia West Asia B B B 1B 10B 100B 89.4 88.4 88.8 92.1 91.8 91.4 91.8 91.4 91.0 94.1 94.0 93.3 92.5 92.2 91.7 93.4 93.0 92.2 4.7 5.5 4.4 L L L 1B 10B 100B 92.0 91.8 93.5 94.0 94.4 95.1 94.0 94.0 95.4 95.2 95.8 96.2 94.2 94.2 95.0 94.9 94.7 95.8 3.2 4.0 2.8 H H H 1B 10B 100B 91.5 93.4 93.6 94.4 95.4 95.1 94.7 95.0 95.3 95.2 96.5 96.3 94.1 95.1 95.2 94.5 95.6 95.8 3.6 3.0 2.7 Data Semantics OCR Multiling RS Avg 1B 10B 100B 1B 10B 100B 76.0 75.4 76.4 77.1 76.4 77.2 66.8 65.2 67.0 69.5 66.9 70.0 67.0 66.3 66.9 66.9 66.0 67.0 92.3 91.9 92.1 92.0 91.8 91.8 73.6 72.7 73.9 75.1 73.7 75.3 Table 6 | The PaliGemma transfer results of ViTL/16 models pretrained on 10B and 100B examples, with both frozen (top) and unfrozen (bottom) vision components. Results are aggregated. 5. Analysis also train a classifier model on the raw web data, resulting in a filtered dataset of the same size. Additionally, we sample an English subset of the same size from the raw data to serve as a baseline. We train ViT-L models on the three datasets and represent the results in Figure 4 and Appendix D. The CLIP filter excels in Western-centric tasks, consistent with data-centric research showing that effective data filtering enhances model performance [1, 12, 25, 47]. However, all filtered datasets underperform in other tasks, particularly those involving cultural diversity. This illustrates a key drawback of data filtering, that it can inadvertently introduce biases into the filtered dataset, in agreement with prior works [11, 28, 53]. 5.1. Data Quality Filtering Raw web data is often too noisy for training effective vision-language models. To address this, a common strategy is to use a data filter model to remove less relevant image-text pairs. In this work, we utilize the CLIP-L/14 model to filter the raw data and retrain 5 billion high-quality English image-text pairs. For comparison, we 5.2. Language Rebalancing The low-resource languages in our raw data collectively represent only 0.5%, which prevents sufficient model learning of the concepts existing in these languages or areas. To address this, we upsample each low-resource language to a 10 Scaling Pre-training to One Hundred Billion Data for Vision Language Models err % 26 24 26.5 62.5 Average Fairness 28 27.0 65.0 Average Culture Diversity 30 Average Western err % err % Baseline (en) CLIP Classifier 60.0 57.5 55.0 0 5 10 15 20 Examples Seen (billion) 25 30 25.5 25.0 24.5 52.5 24.0 50.0 22 26.0 0 5 10 15 20 Examples Seen (billion) 25 30 0 5 10 15 20 Examples Seen (billion) 25 30 Figure 4 | Quality filtering can hinder cultural diversity (middle) and fairness (right), even when it benefits Western-centric (left) tasks. This observation holds for both the widely-used CLIP filter and a classifier filter trained on web data. fixed 1% representation. This rebalancing, visualized in Figure 5, improves model performance on the low-resource language benchmark. Accordingly, the performance on the high-resource language slightly decreases, but still remains comparable (also applies to other English-only zero-shot retrieval tasks), which results in an overall improvement on the entire multilingual benchmark. Additionally, we observe a mild improvement in cultural diversity tasks, while other tasks show slightly worse results, potentially due to the reduction in Western-centric examples, as most evaluations are based on the English language. Full evaluation results can be found in Appendix E. 5.3. Qualitative Examples We visualize the attention maps from the vision models trained on different scales of data in Table 1. Models trained on larger data tends to have more focused attention on semantically relevant regions. For example, in the “Igorot Dance” image, the 100B-trained model captures finer details, such as intricate patterns on traditional decorations and culturally significant objects. In the “Igloo” image, the 100B-trained model accurately focuses on the igloo’ structural details (its dome shape), unlike other models which are distracted by background elements like mountains and ice. Beyond low-resource concepts, 100B data can also improve performance on common concepts. As shown in the “Bison" image, models trained on larger datasets more precisely capture the bison, rather than the surrounding landscape. More visualized examples can be found in Table 7. 6. Discussion Data Filtering. Data filtering is a common technique used to improve data quality in visionlanguage pre-training. As demonstrated in Section 5.1, CLIP filter remarkably improves model’s performance on the traditional tasks. Given the noted impact of filtering on cultural diversity in our experiments, we focus on the impact of scaling raw, unfiltered data, and leave the improvement of data quality at the 100 billion scale for future work. We encourage the community to conduct further research into new data filtering techniques that preserve cultural diversity, as well as novel training architectures or methods that improve model inclusivity without requiring additional training data. Limitations. The benchmarks used in this paper to evaluate VLM inclusivity are necessarily limited, since inclusivity is a broad societal concept that should be reduced to a handful of metrics. For instance, while we utilize Crossmodal-3600 in a zero-shot setting to assess multilinguality, it only covers 36 languages. 7. Conclusion In this paper, we investigate the impact of scaling image-text data up to 100 billion unique examples, on vision-language pre-training. We demonstrate that a scale of 100 billion image-text pairs is beneficial for vision-language models in areas beyond traditional Western-centric benchmarks, such as cultural diversity, multilinguality, and re- 11 Scaling Pre-training to One Hundred Billion Data for Vision Language Models ducing performance disparity across subgroups. Hence, this data scale remains fundamentally important for the development of truly inclusive multimodal systems. We also investigate the impact of applying quality filters, such as those based on CLIP, to large-scale image-text datasets. These filters, though often beneficial for traditional tasks, can negatively impact data diversity by reducing the representation of certain cultural contexts. Overall, our results highlight the importance of data scale for VLMs. While traditional benchmarks may not benefit significantly from the scaling of noisy, raw web data to 100 billion, this data scale remains crucial for training inclusive vision-language models. Acknowledgments We thank Daniel Keysers and Jeremiah Harmse for their insightful reviews and suggestions; Matthias Minderer for valuable discussions and experiments on scaling open-vocabulary detection; Lucas Beyer for input on multilingual rebalancing; and Google DeepMind at large for providing a supportive research environment. References [1] A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Dataefficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023. [2] I. Alabdulmohsin, B. Neyshabur, and X. Zhai. Revisiting neural scaling laws in language and vision. In NeurIPS, 2022. [3] I. Alabdulmohsin, X. Wang, A. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? In ICLR, 2024. [4] I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting ViT in shape: Scaling laws for compute-optimal model design. 2024. [5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022. [6] A. Ananthram, E. Stengel-Eskin, C. Vondrick, M. Bansal, and K. McKeown. See it from my perspective: Diagnosing the western cultural bias of large vision-language models in image understanding. arXiv preprint arXiv:2406.11665, 2024. [7] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021. [8] Y. Bansal, B. Ghorbani, A. Garg, B. Zhang, M. Krikun, C. Cherry, B. Neyshabur, and O. Firat. Data scaling laws in NMT: The effect of noise and architecture. arXiv preprint arXiv:2202.01994, 2022. [9] C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft, and J. Popp. Sample size planning for classification models. Analytica chimica acta, 760:25–33, 2013. [10] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. [11] A. Birhane, V. U. Prabhu, and E. Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021. [12] L. Cao, B. Zhang, C. Chen, Y. Yang, X. Du, W. Zhang, Z. Lu, and Y. Zheng. Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023. [13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. 12 Scaling Pre-training to One Hundred Billion Data for Vision Language Models [14] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. [15] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. [16] X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J. Wu, P. Voigtlaender, B. Mustafa, S. Goodman, I. Alabdulmohsin, P. Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. [17] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. [18] J. Cho, K. Lee, E. Shin, G. Choy, and S. Do. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv preprint arXiv:1511.06348, 2015. [19] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. [20] C. Crawl. Common crawl dataset, 2021. URL https://commoncrawl.org/. [21] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In ICML, 2023. [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [23] H. Dong, Z. Kang, W. Yin, X. Liang, C. Feng, and J. Ran. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952, 2025. [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020. [25] A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. Toshev, and V. Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023. [26] R. L. Figueroa, Q. Zeng-Treitler, S. Kandula, and L. H. Ngo. Predicting sample size required for classification performance. BMC medical informatics and decision making, 12 (1):1–10, 2012. [27] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Dimakis, J. Jitsev, Y. Carmon, V. Shankar, and L. Schmidt. Datacomp: In search of the next generation of multimodal datasets. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 27092– 27112. Curran Associates, Inc., 2023. [28] N. Garcia, Y. Hirota, Y. Wu, and Y. Nakashima. Uncurated image-text datasets: Shedding light on demographic bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6957–6966, 2023. [29] B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021. 13 Scaling Pre-training to One Hundred Billion Data for Vision Language Models [30] P. Goyal, Q. Duval, I. Seessel, M. Caron, I. Misra, L. Sagun, A. Joulin, and P. Bojanowski. Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360, 2022. [31] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning, 2016. URL https://arxiv.org/abs/ 1610.02413. [38] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. [39] K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1548–1558, 2021. [32] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020. [40] J. N. Kather, C.-A. Weis, F. Bianconi, S. M. Melchers, L. R. Schad, T. Gaiser, A. Marx, and F. G. Z"ollner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988, 2016. [33] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017. [41] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (BiT): General visual representation learning. In ECCV, pages 491–507, 2020. [34] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. In NeurIPS, 2022. [35] M. Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021. [36] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and visionlanguage representation learning with noisy text supervision. In International conference on machine learning, pages 4904– 4916. PMLR, 2021. [37] M. Johnson, P. Anderson, M. Dras, and M. Steedman. Predicting accuracy on large datasets from smaller pilot data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 450–455, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2072. URL https:// aclanthology.org/P18-2072. [42] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013. [43] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. [44] F.-F. Li, M. Andreeto, M. Ranzato, and P. Perona. Caltech 101, Apr 2022. [45] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. [46] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. [47] P. Maini, S. Goyal, Z. C. Lipton, J. Z. Kolter, and A. Raghunathan. T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023. 14 Scaling Pre-training to One Hundred Billion Data for Vision Language Models [48] M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. Advances in Neural Information Processing Systems, 36, 2024. [56] M. Richards, P. Kirichenko, D. Bouchacourt, and M. Ibrahim. Does progress on object recognition benchmarks improve real-world generalization? In ICLR, 2024. [49] S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T. R. Golub, and J. P. Mesirov. Estimating dataset size requirements for classifying dna microarray data. Journal of computational biology, 10(2):119–142, 2003. [57] W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kanter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [50] T. Nguyen, M. Wallingford, S. Santy, W.C. Ma, S. Oh, L. Schmidt, P. W. Koh, and R. Krishna. Multilingual diversity improves vision-language representations, 2024. URL https://arxiv.org/abs/ 2405.16915. [51] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. doi: 10.1109/CVPR.2012.6248092. [52] H. Pham, Z. Dai, G. Ghiasi, K. Kawaguchi, H. Liu, A. W. Yu, J. Yu, Y.-T. Chen, M.-T. Luong, Y. Wu, et al. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023. [53] A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, and I. Alabdulmohsin. No filter: Cultural and socioeconomic diversityin contrastive vision-language models. In NeurIPS, 2024. [54] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748– 8763. PMLR, 2021. [55] V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Adcock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographically diverse evaluation dataset for object recognition. Advances in Neural Information Processing Systems, 36, 2024. [58] J. S. Rosenfeld, A. Rosenfeld, Y. Belinkov, and N. Shavit. A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019. [59] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation imagetext models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. [60] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018. [61] U. Sharma and J. Kaplan. Scaling laws from the data manifold dimension. JMLR, 23(9): 1–34, 2022. [62] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021. [63] A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, S. Qin, R. Ingle, E. Bugliarello, S. Kazemzadeh, T. Mesnard, I. Alabdulmohsin, L. Beyer, and X. Zhai. Paligemma 2: A family of versatile vlms for transfer, 2024. URL https: //arxiv.org/abs/2412.03555. 15 Scaling Pre-training to One Hundred Billion Data for Vision Language Models [64] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017. [73] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. [65] A. V. Thapliyal, J. Pont-Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. arXiv preprint arXiv:2205.12522, 2022. [74] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022. [66] M. Tschannen, M. Kumar, A. Steiner, X. Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. Advances in Neural Information Processing Systems, 36, 2024. [67] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds200-2011 dataset. 2011. [68] B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. Locca: Visual pretraining with location-aware captioners. arXiv preprint arXiv:2403.19596, 2024. [69] T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584, 2020. [70] F. Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics: Methodology and distribution, pages 196–202. Springer, 1992. [71] B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan. Florence2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829, 2024. [75] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. [76] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In CVPR, 2022. [77] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18123–18133, 2022. [78] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training, 2023. URL https://arxiv. org/abs/2303.15343. [79] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing visionlanguage understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. [80] W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024. [72] L. Xue. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020. 16 Scaling Pre-training to One Hundred Billion Data for Vision Language Models A. Qualitative Examples Table 7 | The attention map visualization of the ViT-L/16 models trained on different scales of data. Images are selected to represent cultures in Western-centric countries and countries where low-resource languages are spoken. Concept Image 1B 10B 100B Street (New York) 3 Pub (London) 4 Bison (Yellowstone) 5 Igorot Dance (Igorot) 6 Kathputli Kala Chitra (Hindi) 7 Igloo (Inuit) 8 3 By Terabass, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=134418052 4 By Ricardalovesmonuments - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=122810839 5 Source: Yellowstone National Park, https://www.yellowstonenationalparklodges.com/connect/yellowstone-hotspot/yellowstone-where-the-bison-roam/ 6 Source: Itogon, https://itogon.wordpress.com/2012/04/26/book-goes-to-heart-of-igorot-people/ 7 Source: The Better India, https://thebetterindia.com/57220/journey-indian-handicraft-landscape/ 8 Source: https://commons.wikimedia.org/w/index.php?curid=3648025 17 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Pohela Boishakh (Bengali) 9 9 Source: EyeNews, https://www.eyenews.news/english/Today-is-Pahela-Baishakh-the-first-day-of-Bengal-1430/757 18 Scaling Pre-training to One Hundred Billion Data for Vision Language Models B. Evaluations of Data Scaling Table 8 | Detailed evaluation results of ViT-B/L/H models on 1/10/100 billion scale datasets. All metrics are measured by error rate, with the exception of “Representation Bias”, which is measured by disparity, where lower values are better. ViT-B/16 Metric Category 1B 10B ViT-L/16 100B 1B 10B ViT-H/16 100B 1B 10B 100B ImageNet 0-shot Classification 41.21 39.35 39.04 31.23 29.70 28.49 Cifar100 0-shot Classification 36.62 35.87 36.80 25.02 23.75 23.36 Pet 0-shot Classification 25.40 23.71 22.27 14.36 12.46 9.46 ImageNet 10-shot Classification 46.65 45.63 44.74 35.11 34.95 33.71 Cifar100 10-shot Classification 38.73 38.63 39.02 27.50 26.70 25.49 Pet 10-shot Classification 22.95 23.19 22.08 12.32 12.48 11.80 Bird 10-shot Classification 53.80 53.47 53.90 44.05 45.25 44.29 Caltech 10-shot Classification Western 8.37 8.33 8.23 6.41 7.40 7.53 Cars 10-shot Classification 18.29 16.79 17.60 11.14 11.33 11.47 Colorectal 10-shot Classification 26.53 29.23 27.00 24.00 23.53 22.57 DTD 10-shot Classification 29.73 30.85 30.90 28.46 27.07 27.93 56.46 51.62 53.44 49.70 47.18 45.28 COCO Image-Text 0-shot Retrieval COCO Text-Image 0-shot Retrieval 70.90 68.84 70.01 68.16 64.32 62.51 Flickr Image-Text 0-shot Retrieval 24.20 21.20 21.10 20.40 15.50 16.60 Flickr Text-Image 0-shot Retrieval 43.12 40.26 40.42 39.94 32.32 32.52 ............................................................... Dollar Street 0-shot Classification 52.04 51.88 51.60 50.23 48.10 49.03 Dollar Street 10-shot Classification 77.69 75.81 72.12 63.56 64.09 58.29 GeoDE 0-shot Classification 7.85 8.27 8.65 6.01 5.90 4.88 Culture 72.75 71.47 71.36 61.94 62.31 57.85 GeoDE/country 10-shot Classification GeoDE/region 10-shot Classification 61.09 60.80 59.18 54.21 53.59 48.29 GLDv2 0-shot Classification 65.05 60.96 59.40 50.39 46.37 45.72 ............................................................... Representation Bias 33.15 34.54 35.21 38.18 36.35 35.51 Income 0-200 Classification 70.57 68.43 67.97 66.30 64.35 66.30 Income 200-285 Classification 56.07 55.98 55.70 55.33 52.18 53.38 43.45 44.57 43.73 42.71 41.32 40.48 Income 285-685 Classification Income >1998 Classification 38.05 38.51 38.98 36.56 34.51 35.91 GeoDE: Africa 10.58 11.56 11.15 7.99 8.24 6.55 GeoDE: Americas Fairness 7.94 8.16 8.58 6.03 5.57 4.92 GeoDE: EastAsia 8.15 8.57 8.99 5.98 5.96 4.56 GeoDE: Europe 5.92 6.02 6.75 4.81 4.20 3.75 GeoDE: SouthEastAsia 7.51 7.81 8.26 5.78 5.78 5.02 GeoDE: WestAsia 6.57 7.01 7.85 5.11 5.30 4.19 ............................................................... XM3600 Image-Text: Arabic 61.78 53.42 53.36 53.58 45.00 44.56 XM3600 Image-Text: Bengali 95.69 80.64 77.06 90.81 66.36 63.75 XM3600 Image-Text: Czech 60.78 51.89 50.83 52.31 43.81 42.22 XM3600 Image-Text: Danish 55.58 45.39 45.75 45.08 35.06 31.00 XM3600 Image-Text: German 39.47 31.53 31.78 30.61 24.28 24.03 XM3600 Image-Text: Greek 74.36 63.00 61.86 67.86 53.64 50.14 XM3600 Image-Text: English 56.53 55.03 55.50 54.14 52.42 51.67 XM3600 Image-Text: Spanish 49.17 42.94 44.22 41.56 38.44 35.81 XM3600 Image-Text: Persian 58.94 51.17 51.58 49.64 38.97 40.17 XM3600 Image-Text: Finnish 70.64 53.83 53.61 59.25 42.67 39.06 87.86 82.06 81.92 82.72 72.86 71.36 XM3600 Image-Text: Filipino Multiling XM3600 Image-Text: French 47.08 38.92 39.06 39.08 31.78 29.92 XM3600 Image-Text: Hindi 83.53 74.78 72.39 77.67 65.67 63.47 XM3600 Image-Text: Croatian 64.53 53.28 51.33 53.08 37.94 35.78 XM3600 Image-Text: Hungarian 64.50 49.06 47.53 53.81 38.64 34.42 29.60 23.49 10.33 32.44 25.76 10.85 41.65 5.70 11.32 25.17 29.20 48.62 64.86 16.80 34.26 25.60 19.79 7.47 29.76 23.79 9.13 39.13 6.02 10.30 26.17 26.12 42.04 60.32 13.50 28.46 24.90 21.42 7.17 29.34 24.21 8.67 36.31 8.93 9.60 25.87 26.76 42.48 59.29 13.90 28.00 50.00 64.60 5.99 56.94 54.56 48.05 48.58 59.10 4.87 50.22 47.63 40.08 47.35 53.69 4.81 47.55 44.68 38.78 36.76 67.69 55.14 41.60 35.53 8.46 5.60 5.30 4.83 5.86 5.50 35.01 66.11 53.66 41.41 33.12 6.56 4.57 5.01 3.53 4.89 4.42 36.61 65.92 51.81 37.79 33.86 6.40 4.86 4.68 3.75 4.76 4.19 52.25 88.17 49.94 43.03 29.17 65.67 53.22 40.03 46.61 57.39 81.31 36.58 76.92 47.81 51.22 41.64 61.22 40.11 29.92 22.75 49.50 51.42 33.89 33.72 34.83 66.14 28.53 62.33 32.44 32.67 41.00 56.69 39.44 28.75 21.89 47.33 49.64 34.28 34.06 32.86 63.03 28.19 60.64 30.44 30.36 19 Scaling Pre-training to One Hundred Billion Data for Vision Language Models XM3600 Image-Text: Indonesian XM3600 Image-Text: Italian XM3600 Image-Text: Hebrew XM3600 Image-Text: Japanese XM3600 Image-Text: Korean XM3600 Image-Text: Maori XM3600 Image-Text: Dutch XM3600 Image-Text: Norwegian XM3600 Image-Text: Polish XM3600 Image-Text: Portuguese XM3600 Image-Text: Quechua XM3600 Image-Text: Romanian XM3600 Image-Text: Russian XM3600 Image-Text: Swedish XM3600 Image-Text: Swahili XM3600 Image-Text: Telugu XM3600 Image-Text: Thai XM3600 Image-Text: Turkish XM3600 Image-Text: Ukrainian XM3600 Image-Text: Vietnamese XM3600 Image-Text: Chinese XM3600 Text-Image: Arabic XM3600 Text-Image: Bengali XM3600 Text-Image: Czech XM3600 Text-Image: Danish XM3600 Text-Image: German XM3600 Text-Image: Greek XM3600 Text-Image: English XM3600 Text-Image: Spanish XM3600 Text-Image: Persian XM3600 Text-Image: Finnish XM3600 Text-Image: Filipino XM3600 Text-Image: French XM3600 Text-Image: Hindi XM3600 Text-Image: Croatian XM3600 Text-Image: Hungarian XM3600 Text-Image: Indonesian XM3600 Text-Image: Italian XM3600 Text-Image: Hebrew XM3600 Text-Image: Japanese XM3600 Text-Image: Korean XM3600 Text-Image: Maori XM3600 Text-Image: Dutch XM3600 Text-Image: Norwegian XM3600 Text-Image: Polish XM3600 Text-Image: Portuguese XM3600 Text-Image: Quechua XM3600 Text-Image: Romanian XM3600 Text-Image: Russian XM3600 Text-Image: Swedish XM3600 Text-Image: Swahili XM3600 Text-Image: Telugu XM3600 Text-Image: Thai XM3600 Text-Image: Turkish XM3600 Text-Image: Ukrainian XM3600 Text-Image: Vietnamese XM3600 Text-Image: Chinese Avg Western 0-shot Classification Avg Western 10-shot Classification Avg Western 0-shot Retrieval Multiling Western 44.81 48.58 67.06 67.36 58.64 99.61 53.97 56.56 53.97 51.03 95.53 64.56 51.56 54.03 92.14 98.06 79.33 60.33 62.39 54.31 63.92 73.77 97.19 71.81 68.23 55.15 82.61 62.32 57.35 71.80 81.00 93.60 56.70 91.01 75.52 74.24 60.08 57.90 76.50 76.74 70.82 99.78 63.50 70.36 63.73 62.16 98.46 74.48 61.65 66.11 96.30 98.76 86.81 72.31 75.01 70.38 73.98 34.41 30.63 48.67 38.14 41.00 50.28 55.67 49.61 99.50 47.47 46.78 44.89 44.19 94.08 51.39 42.36 44.25 88.17 87.08 68.67 50.03 52.25 45.33 51.08 67.79 89.25 64.49 59.97 47.80 75.69 59.41 52.74 65.18 70.80 90.28 50.23 86.55 67.53 63.83 52.90 51.51 64.76 69.20 64.88 99.78 59.25 63.58 57.39 57.16 97.94 65.48 53.83 59.05 94.01 92.69 80.38 65.24 66.08 64.82 64.78 32.98 30.77 45.48 37.08 40.86 49.86 55.42 49.53 99.42 48.78 47.89 44.22 44.39 93.89 52.03 42.28 45.69 88.72 80.53 67.47 50.06 49.78 45.22 51.19 68.49 89.53 65.48 61.73 49.18 75.71 60.78 55.49 65.58 68.28 91.07 50.57 86.09 66.85 63.53 53.96 52.08 62.76 68.99 67.23 99.78 59.05 63.44 57.71 57.93 97.85 65.11 54.17 60.50 94.73 90.40 79.47 65.17 65.35 64.64 64.96 32.70 30.43 46.24 35.83 38.42 56.75 59.00 50.75 99.58 47.11 45.33 45.97 43.33 94.64 52.19 42.78 44.50 89.94 96.08 72.61 52.78 55.19 43.19 53.67 67.49 95.17 65.52 60.01 45.85 77.96 58.97 52.64 62.65 72.96 90.89 48.33 87.43 66.68 66.49 50.28 47.96 69.11 69.56 64.52 99.73 57.41 61.54 56.06 54.54 97.88 65.20 53.47 58.78 94.55 97.76 81.83 65.21 68.84 61.84 64.87 23.54 23.62 44.55 28.47 33.33 39.44 45.42 40.33 99.22 41.14 36.11 35.50 36.03 93.53 38.31 35.14 34.94 81.33 76.67 59.47 40.72 41.25 34.00 42.47 59.74 79.72 58.57 51.18 39.88 69.11 57.57 49.06 55.06 59.11 83.98 43.31 81.38 54.42 53.73 44.05 42.80 56.25 62.34 56.76 99.56 52.02 53.81 47.92 49.48 98.14 54.05 47.58 50.72 90.09 87.47 74.60 55.12 57.74 54.00 59.03 21.97 23.59 39.83 28.53 30.97 35.72 44.97 38.31 99.25 38.39 34.28 34.11 34.56 93.92 35.39 33.22 34.78 79.47 69.69 58.86 39.72 37.83 32.44 42.50 59.86 77.31 58.18 49.50 39.75 67.35 56.32 48.31 56.09 56.24 83.70 42.10 80.01 54.22 50.75 43.97 42.60 54.14 58.44 56.51 99.62 51.48 52.99 47.09 48.72 98.04 52.41 45.36 51.82 89.57 83.03 73.67 56.70 55.32 53.39 57.33 20.44 23.10 39.23 33.39 36.47 52.03 58.47 46.81 99.31 44.56 43.39 41.75 41.14 94.58 47.92 41.19 40.69 88.92 96.36 71.25 48.56 52.75 40.75 54.17 65.87 94.22 63.59 56.72 43.80 75.68 58.15 51.24 59.79 70.79 89.55 47.52 87.71 63.21 64.26 49.27 48.03 65.88 69.16 61.52 99.75 55.49 60.04 53.28 52.44 98.18 61.69 51.60 55.34 93.85 98.18 82.21 62.35 66.07 58.46 65.25 21.14 22.76 41.13 24.86 29.64 33.86 42.22 35.39 98.92 38.06 31.81 33.00 32.69 93.06 32.36 31.97 31.14 76.86 73.08 56.86 36.56 36.94 29.06 40.53 56.22 76.36 55.79 46.53 36.56 65.45 56.40 47.27 52.93 51.07 80.61 40.48 79.21 50.71 48.31 41.45 40.62 51.49 57.06 53.57 99.67 49.88 49.16 45.05 47.48 98.28 48.77 43.58 47.66 87.47 84.44 73.31 53.59 54.18 50.29 56.15 17.62 21.30 36.08 20 25.33 28.89 30.81 37.94 35.08 99.17 37.44 30.19 31.06 32.28 92.78 30.11 30.31 30.78 74.14 65.31 52.78 34.94 33.25 29.08 38.42 54.91 72.42 55.07 45.46 36.99 64.10 55.82 46.62 49.69 49.35 77.92 39.96 78.22 48.53 45.72 40.81 40.34 49.99 54.78 52.67 99.51 49.10 48.20 44.49 46.64 98.26 47.09 43.08 47.93 85.67 79.57 69.67 52.19 50.84 48.76 56.68 17.83 21.21 35.92 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Avg Western Classification 31.66 31.37 31.05 23.60 23.15 22.37 ............................................................... Avg Dollar Street Classification 64.87 63.85 61.86 56.89 56.09 53.66 Culture Avg GeoDE Classification 47.23 46.85 46.39 40.72 40.60 37.01 ............................................................... Avg Income Classification 52.03 51.87 51.59 50.22 48.09 49.02 Avg Geographic Classification Fairness 7.78 8.19 8.59 5.95 5.84 4.83 25.24 24.43 27.49 24.91 26.47 25.50 Avg Demography Classification ............................................................... Avg Multiling: Low-Resource Lang 91.22 84.27 83.16 87.73 77.14 75.01 Multiling Avg Multiling: High-Resource Lang 63.66 55.42 55.53 55.54 46.75 45.43 Average Western-centric Average Cultural Diversity Average Fairness Average Multilinguality 36.20 56.08 25.44 65.23 35.13 54.87 25.46 56.09 35.10 53.72 26.08 55.61 29.19 47.72 23.87 57.52 27.60 46.72 23.36 47.23 26.87 44.01 23.01 45.40 22.32 20.30 20.29 57.30 39.16 53.84 34.24 50.52 32.35 49.99 5.92 25.50 48.57 4.83 25.13 47.35 4.77 27.22 86.58 53.38 73.69 43.11 70.93 41.81 27.34 46.69 23.88 55.38 24.51 41.75 22.80 43.33 24.46 39.48 22.70 41.63 21 Scaling Pre-training to One Hundred Billion Data for Vision Language Models C. Evaluations of Transferability to Generative Models The downstream tasks in Table 9 are categorized as the following groups and reported in Table 6: 1. Semantics: “COCOcap”, “NoCaps”, “COCO-35L (en)”, “XM3600 (en)”, “OKVQA”, “AOKVQA-MC (val)”, “AOKVQA-DA (val)”, “GQA”, “NLVR2”, “MARVL (avg5)”, “VizWizVQA (val)”, “TallyQA (simple)”, “TallyQA (complex)”, “CountBenchQA”, “RefCOCO (testA)”, “RefCOCO (testB)”, “RefCOCO+ (testA)”, “RefCOCO+ (testB)”, “RefCOCOg (test)” 2. OCR: “DocVQA (val)”, “OCR-VQA”, “ChartQA (avg)”, “ChartQA (human)”, “ChartQA (aug)”, “SciCap”, “AI2D”, “ScienceQA”, “InfoVQA (val)”, “TextCaps”, “TextVQA (val)”, “ST-VQA (val)”, “Screen2Words”, “WidgetCap” 3. Multilinguality: “xGQA (avg8)”, “XM3600 (avg36)”, “COCO-35L (avg35)” 4. Remote Sensing: “RSVQA-lr”, “RSVQA-hr (test)”, “RSVQA-hr (test2)” Table 9 | Detailed evaluation results of the transferability of contrastively trained vision models (ViT-L/16) to generative vision-language models (PaliGemma), with both frozen and unfrozen setups. Task-specific Numbers are reported for vision models trained on 1 billion, 10 billion and 100 billion raw data respectively, using PaliGemma’s default fine-tuning configuration. Frozen ViT Metric COCOcap NoCaps COCO-35L (avg35) COCO-35L (avg34) COCO-35L (en) XM3600 (en) XM3600 (avg36) Screen2Words TextCaps SciCap WidgetCap VQAv2 (minival) OKVQA AOKVQA-MC (val) AOKVQA-DA (val) GQA NLVR2 MARVL (avg5) AI2D ScienceQA RSVQA-lr RSVQA-hr (test) RSVQA-hr (test2) ChartQA (avg) ChartQA (human) ChartQA (aug) VizWizVQA (val) TallyQA (simple) TallyQA (complex) CountBenchQA OCR-VQA TextVQA (val) DocVQA (val) InfoVQA (val) ST-VQA (val) xGQA (avg8) xGQA (avg7) RefCOCO (testA) RefCOCO (testB) Unfrozen ViT 1B Data 10B Data 100B Data 1B Data 10B Data 100B Data 134.6 114.1 107.6 106.9 130.6 75.5 37.9 108.9 86.5 149.7 120.1 79.4 60.4 74.2 58.5 63.4 87.5 76.7 69.8 95.4 93.0 92.5 90.4 45.1 31.8 58.5 72.3 76.6 65.0 68.2 68.3 44.5 25.0 22.3 46.6 55.2 54.1 67.4 62.7 132.9 110.5 105.9 105.2 130.4 74.9 36.9 107.5 79.3 146.9 109.6 78.8 59.6 72.7 56.8 63.5 86.7 76.2 70.0 94.9 92.4 92.5 90.4 43.6 31.8 55.4 71.2 75.7 65 69.0 67.5 41.4 23.5 22.2 42.8 55.2 54.0 67.5 62.0 134.4 112.8 108.0 107.3 133.4 75.2 38.0 109.9 93.2 150.0 117.9 79.8 59.7 73.0 57.3 63.6 87.2 76.6 70.6 94.4 92.3 92.7 90.5 45.0 32.6 57.4 72.8 75.9 65.5 67.3 68.2 44.7 25.8 23 46.7 55 53.8 67.9 63.8 135.0 113.4 107.7 107.0 132.4 75.3 37.7 105.0 87.6 146.1 113.3 79.2 59.6 73.0 59.1 63.8 86.4 76.3 68.2 94.5 93.6 92.6 90.5 41.4 29.8 53.0 72.0 76.6 65.4 60.6 66.9 41.2 23.4 21.4 43.5 55.6 54.5 64.5 60.2 132.1 111.4 106.8 106.0 132.5 75.4 37.5 105.3 81.8 144.6 108.4 78.6 59.7 72.7 57.7 63.0 86.4 76.8 68.5 92.9 92.8 92.6 90.4 40.3 28.3 52.3 71.6 75.7 64.5 61.2 66.0 40.4 21.7 22.0 40.1 54.5 53.3 64.2 59.6 134.0 113.3 107.8 107.1 133.4 76.0 38.0 105.5 83.8 147.1 114.9 78.6 59.9 74.2 57.9 63.5 87.0 77.0 68.6 94.7 93.0 92.6 90.6 42.5 30.5 54.5 71.9 76.9 65.3 63.7 67.1 41.2 23.1 22.1 43.2 54.8 53.6 65.1 60.9 22 Scaling Pre-training to One Hundred Billion Data for Vision Language Models RefCOCO+ (testA) RefCOCO+ (testB) RefCOCOg (test) Avg Semantics Avg OCR Avg Multilinguality Avg Remote Sensing Avg 63 55.6 59.1 77.1 69.5 66.9 92.0 75.1 62.7 54.9 58.9 76.4 66.9 66.0 91.8 73.7 63.5 56.2 60 77.2 70.0 67.0 91.8 75.3 60.2 53.2 56.5 76.0 66.8 67.0 92.3 73.6 59.9 52.5 56.1 75.4 65.2 66.3 91.9 72.7 60.3 53.3 57.2 76.4 67.0 66.9 92.1 73.9 23 Scaling Pre-training to One Hundred Billion Data for Vision Language Models D. Evaluations of Data Quality Filtering Table 10 | Detailed evaluation results of data quality filtering on ViT-L/16 models. All evaluations are conducted on datasets of 5 billion image-text pairs and across different number of seen examples. All metrics are measured by error rate, with the exception of “Representation Bias”, which is measured by disparity. Metric Filter 1B 5B 10B 20B 30B ImageNet 0-shot Classification Baseline (en) CLIP filtered Other filtered 34.67 31.18 34.50 28.17 26.76 29.52 26.68 25.14 28.13 26.15 24.39 26.70 24.32 23.90 26.45 Cifar100 0-shot Classification Baseline (en) CLIP filtered Other filtered 33.05 31.69 36.07 26.08 26.96 35.27 24.37 25.37 29.95 24.52 24.68 32.58 23.99 25.76 30.78 Pet 0-shot Classification Baseline (en) CLIP filtered Other filtered 17.25 13.68 14.04 11.99 10.49 9.62 11.69 8.78 8.99 9.13 8.59 7.28 8.72 8.23 6.62 ImageNet 10-shot Classification Baseline (en) CLIP filtered Other filtered 42.41 38.57 38.32 35.25 32.53 32.32 33.17 30.60 30.42 33.17 29.20 29.05 30.68 28.72 28.46 Cifar100 10-shot Classification Baseline (en) CLIP filtered Other filtered 36.61 32.83 35.30 30.02 28.44 35.56 27.39 28.04 31.18 27.23 26.20 32.26 26.82 27.40 31.79 Pet 10-shot Classification Baseline (en) CLIP filtered Other filtered 22.95 17.31 14.15 16.93 11.72 10.38 15.32 10.44 9.08 15.26 8.97 7.63 11.72 8.83 7.52 Bird 10-shot Classification Baseline (en) CLIP filtered Other filtered 41.18 32.38 34.57 31.69 25.20 27.01 29.91 23.85 26.30 29.60 22.21 24.65 27.37 21.95 23.73 Caltech 10-shot Classification Baseline (en) CLIP filtered Other filtered 10.45 11.18 8.97 9.94 10.68 9.25 9.34 10.44 9.01 9.63 10.50 8.30 9.60 10.50 9.06 Cars 10-shot Classification Baseline (en) CLIP filtered Other filtered 16.47 13.07 16.84 11.03 9.70 13.07 10.16 8.89 12.52 10.05 7.75 11.30 8.94 8.01 11.30 Colorectal Histology 10-shot Classification Baseline (en) CLIP filtered Other filtered 27.80 25.97 24.53 27.17 22.90 24.70 24.77 20.80 25.47 27.03 24.23 27.10 25.33 27.13 26.53 DTD 10-shot Classification Baseline (en) CLIP filtered Other filtered 31.12 29.20 28.09 26.91 25.69 26.81 26.33 25.37 24.73 26.97 23.51 24.52 26.86 23.72 23.56 COCO Image-Text 0-shot Retrieval Baseline (en) CLIP filtered Other filtered 46.80 41.06 42.92 40.28 36.04 38.32 39.30 36.48 36.80 39.18 34.84 35.96 37.04 34.02 36.24 COCO Text-Image 0-shot Retrieval Baseline (en) CLIP filtered Other filtered 62.26 59.11 60.53 56.78 55.27 56.01 54.78 54.45 54.60 55.22 53.12 53.23 53.20 53.03 53.27 Flickr Image-Text 0-shot Retrieval Baseline (en) CLIP filtered Other filtered 16.70 14.80 16.70 11.30 9.90 13.80 11.30 9.70 12.60 11.30 9.60 13.10 10.90 8.90 12.00 Flickr Text-Image 0-shot Retrieval Baseline (en) CLIP filtered 32.26 29.52 24.78 24.98 24.74 23.34 24.90 22.12 22.66 22.02 24 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Other filtered 32.84 27.18 26.48 24.82 24.32 Dollar Street 0-shot Classification Baseline (en) CLIP filtered Other filtered 54.67 53.71 50.23 50.44 52.58 47.63 49.81 51.88 47.86 49.98 50.63 47.45 49.37 51.44 47.08 Dollar Street 10-shot Classification Baseline (en) CLIP filtered Other filtered 84.87 88.86 90.16 79.27 84.59 89.46 77.18 84.73 87.91 76.21 82.80 88.72 72.54 82.80 87.77 GeoDE 0-shot Classification Baseline (en) CLIP filtered Other filtered 8.98 9.64 9.50 6.48 8.54 7.69 6.43 8.02 7.50 6.26 7.42 7.50 6.23 7.22 7.53 GeoDE (country) 10-shot Classification Baseline (en) CLIP filtered Other filtered 84.29 85.82 91.37 77.28 81.98 89.52 73.22 80.11 88.30 73.37 78.08 87.65 68.85 78.24 86.76 GeoDE (region) 10-shot Classification Baseline (en) CLIP filtered Other filtered 66.67 70.68 75.82 61.66 68.16 72.39 57.71 66.99 72.95 58.77 64.81 72.13 55.78 63.68 71.27 GLDv2 0-shot Classification Baseline (en) CLIP filtered Other filtered 65.50 61.15 80.87 53.18 52.46 74.06 50.13 49.55 72.37 49.48 47.41 72.37 44.16 46.37 70.17 Representation Bias Baseline (en) CLIP filtered Other filtered 33.89 11.46 39.31 28.22 19.14 36.44 36.00 20.03 39.01 33.52 26.57 40.57 30.96 14.05 35.51 Income 0-200 Classification Baseline (en) CLIP filtered Other filtered 71.31 69.36 69.36 67.22 69.36 67.97 68.34 68.71 65.65 67.50 66.67 66.11 67.04 67.87 66.67 Income 200-285 Classification Baseline (en) CLIP filtered Other filtered 60.15 58.48 54.22 55.33 57.46 50.88 54.87 56.63 52.64 54.49 54.59 51.16 55.33 56.63 51.16 Income 285-685 Classification Baseline (en) CLIP filtered Other filtered 46.61 46.43 40.95 42.99 44.75 39.09 41.04 44.20 39.37 42.43 42.90 39.37 40.76 43.45 37.70 Income >1998 Classification Baseline (en) CLIP filtered Other filtered 40.56 40.56 36.37 36.19 38.70 32.56 34.98 37.95 33.77 35.44 38.33 33.12 34.33 37.77 32.74 Africa Baseline (en) CLIP filtered Other filtered 11.51 11.00 12.04 8.19 9.74 9.97 7.88 9.37 9.51 7.72 9.28 9.85 7.85 8.44 9.88 Americas Baseline (en) CLIP filtered Other filtered 8.59 9.57 9.63 6.74 8.60 7.68 6.15 8.30 7.32 6.37 7.29 7.53 6.27 7.16 7.48 EastAsia Baseline (en) CLIP filtered Other filtered 9.90 10.45 10.52 7.10 9.34 8.92 7.37 8.88 8.63 7.29 7.72 8.21 6.71 7.67 8.48 Europe Baseline (en) CLIP filtered Other filtered 6.75 7.71 7.29 4.82 6.89 5.62 5.29 6.52 5.57 5.01 5.52 5.45 5.17 6.01 5.51 SouthEastAsia Baseline (en) CLIP filtered Other filtered 8.69 9.74 8.89 6.23 8.47 7.28 6.00 7.40 7.47 5.77 7.74 7.16 6.01 7.32 7.11 WestAsia Baseline (en) CLIP filtered 8.14 9.24 5.61 8.16 5.64 7.59 5.17 6.75 5.08 6.52 25 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Other filtered 8.32 6.34 6.17 6.47 6.35 Perceived Gender Baseline (en) CLIP filtered Other filtered 8.41 8.43 13.08 6.42 7.63 10.64 5.78 8.08 11.13 5.98 7.56 11.02 5.64 6.35 9.55 Perceived Race Baseline (en) CLIP filtered Other filtered 37.87 33.08 53.52 44.74 40.63 52.46 43.93 38.98 53.21 48.30 41.89 52.83 44.95 43.04 56.43 Average Western 0-shot Classification Baseline (en) CLIP filtered Other filtered 28.33 25.52 28.20 22.08 21.40 24.81 20.91 19.76 22.36 19.93 19.22 22.18 19.01 19.30 21.28 Average Western 10-shot Classification Baseline (en) CLIP filtered Other filtered 28.62 25.06 25.10 23.62 20.86 22.39 22.05 19.80 21.09 22.37 19.07 20.60 20.92 19.53 20.25 Average Western 0-shot Retrieval Baseline (en) CLIP filtered Other filtered 39.50 36.12 38.25 33.29 31.55 33.83 32.53 30.99 32.62 32.65 29.92 31.78 30.95 29.49 31.46 Average Western Classification Baseline (en) CLIP filtered Other filtered 28.54 25.19 25.94 23.20 21.01 23.05 21.74 19.79 21.43 21.70 19.11 21.03 20.40 19.47 20.53 Average Dollar Street Classification Baseline (en) CLIP filtered Other filtered 69.77 71.29 70.19 64.86 68.58 68.55 63.50 68.30 67.89 63.09 66.71 68.08 60.96 67.12 67.42 Average GeoDE Classification Baseline (en) CLIP filtered Other filtered 53.32 55.38 58.90 48.48 52.89 56.54 45.79 51.71 56.25 46.13 50.10 55.76 43.62 49.71 55.18 Average Income Classification Baseline (en) CLIP filtered Other filtered 54.66 53.71 50.22 50.43 52.57 47.62 49.81 51.87 47.86 49.97 50.62 47.44 49.36 51.43 47.07 Average Geographic Classification Baseline (en) CLIP filtered Other filtered 8.93 9.62 9.45 6.44 8.53 7.63 6.39 8.01 7.44 6.22 7.39 7.45 6.18 7.19 7.47 Average Demography Classification Baseline (en) CLIP filtered Other filtered 23.14 20.76 33.30 25.58 24.13 31.55 24.86 23.53 32.17 27.14 24.72 31.93 25.30 24.70 32.99 Average Western-centric Baseline (en) CLIP filtered Other filtered 31.47 28.10 29.22 25.89 23.82 25.92 24.62 22.78 24.42 24.62 21.99 23.90 23.21 22.14 23.44 Average Cultural Diversity Baseline (en) CLIP filtered Other filtered 60.83 61.64 66.33 54.72 58.05 63.46 52.41 56.88 62.82 52.34 55.19 62.64 49.49 54.96 61.76 Average Fairness Baseline (en) CLIP filtered Other filtered 26.54 26.17 27.02 24.30 25.81 24.95 23.94 25.22 25.04 24.29 24.69 24.86 23.76 24.85 24.92 26 Scaling Pre-training to One Hundred Billion Data for Vision Language Models E. Evaluations of Language Rebalancing err % Language Rebalance Before After 85.0 82.5 80.0 77.5 75.0 72.5 70.0 1B 10B Data Scale err % 58 56 56 54 Average Multilingual 87.5 Average Multilingual: High-Resource Lang Average Multilingual: Low-Resource Lang err % 52 50 48 1B 10B Data Scale Average Fairness Average Western Average Culture Diversity 28.5 28.0 100B 10B Data Scale 100B 24.0 23.8 23.6 23.4 23.2 27.0 10B Data Scale 100B 24.2 27.5 1B 10B Data Scale 24.4 29.0 44 1B 24.6 29.5 45 48 err % 48 46 50 44 100B err % 47 52 46 46 100B err % 54 23.0 1B 10B Data Scale 100B 1B Figure 5 | Rebalancing low-resource languages leads to significant improvements on corresponding benchmarks and slight improvements on aggregated multilingual/cultural diversity tasks. However, other tasks may experience decreased performance due to less Western-centric examples. Table 11 | Detailed evaluation results of the rebalancing of low-resource languages on ViT-L/16 models and datasets of 1/10/100 billion scales, with 100 billion examples seen in training. All metrics are measured by error rate, with the exception of “Representation Bias”, which is measured by disparity. 1B Data 10B Data 100B Data Metric Before After Before After Before After ImageNet 0-shot Classification Cifar100 0-shot Classification Pet 0-shot Classification ImageNet 10-shot Classification Cifar100 10-shot Classification Pet 10-shot Classification Bird 10-shot Classification Caltech 10-shot Classification Cars 10-shot Classification Colorectal Histology 10-shot Classification DTD 10-shot Classification COCO Image-Text 0-shot Retrieval COCO Text-Image 0-shot Retrieval Flickr Image-Text 0-shot Retrieval Flickr Text-Image 0-shot Retrieval Dollar Street 0-shot Classification Dollar Street 10-shot Classification GeoDE 0-shot Classification GeoDE (country) 10-shot Classification GeoDE (region) 10-shot Classification GLDv2 0-shot Classification Representation Bias Income 0-200 Classification 31.23 25.02 14.36 35.11 27.50 12.32 44.05 6.41 11.14 24.00 28.46 49.70 68.16 20.40 39.94 50.23 63.56 6.01 61.94 54.21 50.39 38.18 66.30 31.39 24.96 13.00 34.94 27.82 13.71 42.75 8.09 11.34 25.50 29.31 52.92 67.50 24.30 37.88 51.16 65.04 6.03 59.79 53.99 51.82 35.21 67.32 29.70 23.75 12.46 34.95 26.70 12.48 45.25 7.40 11.33 23.53 27.07 47.18 64.32 15.50 32.32 48.10 64.09 5.90 62.31 53.59 46.37 36.35 64.35 30.47 24.04 12.05 34.99 26.50 15.59 45.29 8.97 11.54 24.43 27.39 50.28 63.60 20.30 32.64 49.42 65.51 5.97 60.52 53.30 47.73 32.61 65.83 28.49 23.36 9.46 33.71 25.49 11.80 44.29 7.53 11.47 22.57 27.93 45.28 62.51 16.60 32.52 49.03 58.29 4.88 57.85 48.29 45.72 35.51 66.30 28.80 23.51 11.23 33.89 25.05 13.46 42.89 8.35 11.21 28.00 29.04 45.90 62.16 16.40 33.30 49.23 59.42 5.42 53.34 48.05 44.29 32.74 65.37 27 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Income 200-285 Classification Income 285-685 Classification Income >1998 Classification Africa Americas EastAsia Europe SouthEastAsia WestAsia Perceived Gender Perceived Race Crossmodal-3600 Image-Text Retrieval: Arabic Crossmodal-3600 Image-Text Retrieval: Bengali Crossmodal-3600 Image-Text Retrieval: Czech Crossmodal-3600 Image-Text Retrieval: Danish Crossmodal-3600 Image-Text Retrieval: German Crossmodal-3600 Image-Text Retrieval: Greek Crossmodal-3600 Image-Text Retrieval: English Crossmodal-3600 Image-Text Retrieval: Spanish Crossmodal-3600 Image-Text Retrieval: Persian Crossmodal-3600 Image-Text Retrieval: Finnish Crossmodal-3600 Image-Text Retrieval: Filipino Crossmodal-3600 Image-Text Retrieval: French Crossmodal-3600 Image-Text Retrieval: Hindi Crossmodal-3600 Image-Text Retrieval: Croatian Crossmodal-3600 Image-Text Retrieval: Hungarian Crossmodal-3600 Image-Text Retrieval: Indonesian Crossmodal-3600 Image-Text Retrieval: Italian Crossmodal-3600 Image-Text Retrieval: Hebrew Crossmodal-3600 Image-Text Retrieval: Japanese Crossmodal-3600 Image-Text Retrieval: Korean Crossmodal-3600 Image-Text Retrieval: Maori Crossmodal-3600 Image-Text Retrieval: Dutch Crossmodal-3600 Image-Text Retrieval: Norwegian Crossmodal-3600 Image-Text Retrieval: Polish Crossmodal-3600 Image-Text Retrieval: Portuguese Crossmodal-3600 Image-Text Retrieval: Quechua Crossmodal-3600 Image-Text Retrieval: Romanian Crossmodal-3600 Image-Text Retrieval: Russian Crossmodal-3600 Image-Text Retrieval: Swedish Crossmodal-3600 Image-Text Retrieval: Swahili Crossmodal-3600 Image-Text Retrieval: Telugu Crossmodal-3600 Image-Text Retrieval: Thai Crossmodal-3600 Image-Text Retrieval: Turkish Crossmodal-3600 Image-Text Retrieval: Ukrainian Crossmodal-3600 Image-Text Retrieval: Vietnamese Crossmodal-3600 Image-Text Retrieval: Chinese Crossmodal-3600 Text-Image Retrieval: Arabic Crossmodal-3600 Text-Image Retrieval: Bengali Crossmodal-3600 Text-Image Retrieval: Czech Crossmodal-3600 Text-Image Retrieval: Danish Crossmodal-3600 Text-Image Retrieval: German Crossmodal-3600 Text-Image Retrieval: Greek Crossmodal-3600 Text-Image Retrieval: English Crossmodal-3600 Text-Image Retrieval: Spanish Crossmodal-3600 Text-Image Retrieval: Persian Crossmodal-3600 Text-Image Retrieval: Finnish Crossmodal-3600 Text-Image Retrieval: Filipino Crossmodal-3600 Text-Image Retrieval: French Crossmodal-3600 Text-Image Retrieval: Hindi 55.33 42.71 36.56 7.99 6.03 5.98 4.81 5.78 5.11 5.25 44.57 53.58 90.81 52.31 45.08 30.61 67.86 54.14 41.56 49.64 59.25 82.72 39.08 77.67 53.08 53.81 35.83 38.42 56.75 59.00 50.75 99.58 47.11 45.33 45.97 43.33 94.64 52.19 42.78 44.50 89.94 96.08 72.61 52.78 55.19 43.19 53.67 67.49 95.17 65.52 60.01 45.85 77.96 58.97 52.64 62.65 72.96 90.89 48.33 87.43 54.22 44.75 38.33 8.34 5.51 6.07 4.41 6.21 5.30 5.27 49.02 56.44 76.03 52.81 45.22 32.00 70.17 54.58 43.50 55.33 60.11 72.56 39.72 71.67 53.72 54.61 37.47 40.69 47.75 61.58 53.06 97.94 48.06 46.81 45.81 42.53 94.97 52.72 45.00 46.19 75.06 81.00 74.72 54.94 57.33 42.22 54.81 65.43 83.83 65.19 59.93 47.48 75.46 56.93 52.79 63.27 72.06 83.32 49.81 83.45 52.18 41.32 34.51 8.24 5.57 5.96 4.20 5.78 5.30 6.06 46.88 45.00 66.36 43.81 35.06 24.28 53.64 52.42 38.44 38.97 42.67 72.86 31.78 65.67 37.94 38.64 28.47 33.33 39.44 45.42 40.33 99.22 41.14 36.11 35.50 36.03 93.53 38.31 35.14 34.94 81.33 76.67 59.47 40.72 41.25 34.00 42.47 59.74 79.72 58.57 51.18 39.88 69.11 57.57 49.06 55.06 59.11 83.98 43.31 81.38 53.48 42.80 35.53 7.81 5.84 5.90 4.23 6.15 5.67 5.96 45.89 45.89 63.53 43.36 34.81 24.36 53.42 51.58 38.00 41.97 42.42 62.72 31.47 65.44 38.86 37.81 30.94 33.50 37.39 45.78 40.00 95.00 41.42 36.72 35.61 38.33 93.83 38.06 35.11 36.06 67.64 67.78 60.50 41.25 40.97 35.22 44.67 59.02 75.56 59.19 52.77 40.72 69.24 57.52 49.90 55.54 58.61 78.41 44.62 80.96 53.38 40.48 35.91 6.55 4.92 4.56 3.75 5.02 4.19 4.97 46.04 44.56 63.75 42.22 31.00 24.03 50.14 51.67 35.81 40.17 39.06 71.36 29.92 63.47 35.78 34.42 28.53 30.97 35.72 44.97 38.31 99.25 38.39 34.28 34.11 34.56 93.92 35.39 33.22 34.78 79.47 69.69 58.86 39.72 37.83 32.44 42.50 59.86 77.31 58.18 49.50 39.75 67.35 56.32 48.31 56.09 56.24 83.70 42.10 80.01 53.20 40.76 37.58 7.46 5.20 5.27 4.00 5.50 4.79 5.03 47.35 44.78 61.47 41.61 32.53 23.11 51.94 50.89 35.89 38.11 40.28 60.22 29.61 63.53 35.64 34.78 28.42 31.03 34.19 46.69 38.58 96.08 39.94 34.47 34.33 34.11 93.42 34.86 33.42 34.19 65.81 66.33 59.92 39.89 39.19 32.86 43.97 59.70 73.33 57.56 49.74 39.50 68.25 56.51 48.76 54.64 56.42 74.94 42.34 79.22 28 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Crossmodal-3600 Text-Image Retrieval: Croatian Crossmodal-3600 Text-Image Retrieval: Hungarian Crossmodal-3600 Text-Image Retrieval: Indonesian Crossmodal-3600 Text-Image Retrieval: Italian Crossmodal-3600 Text-Image Retrieval: Hebrew Crossmodal-3600 Text-Image Retrieval: Japanese Crossmodal-3600 Text-Image Retrieval: Korean Crossmodal-3600 Text-Image Retrieval: Maori Crossmodal-3600 Text-Image Retrieval: Dutch Crossmodal-3600 Text-Image Retrieval: Norwegian Crossmodal-3600 Text-Image Retrieval: Polish Crossmodal-3600 Text-Image Retrieval: Portuguese Crossmodal-3600 Text-Image Retrieval: Quechua Crossmodal-3600 Text-Image Retrieval: Romanian Crossmodal-3600 Text-Image Retrieval: Russian Crossmodal-3600 Text-Image Retrieval: Swedish Crossmodal-3600 Text-Image Retrieval: Swahili Crossmodal-3600 Text-Image Retrieval: Telugu Crossmodal-3600 Text-Image Retrieval: Thai Crossmodal-3600 Text-Image Retrieval: Turkish Crossmodal-3600 Text-Image Retrieval: Ukrainian Crossmodal-3600 Text-Image Retrieval: Vietnamese Crossmodal-3600 Text-Image Retrieval: Chinese Average Western 0-shot Classification Average Western 10-shot Classification Average Western 0-shot Retrieval Average Western Classification Average Dollar Street Classification Average GeoDE Classification Average Income Classification Average Geographic Classification Average Demography Classification Average Multilingual: Low-Resource Lang Average Multilingual: High-Resource Lang Average Western-centric Average Cultural Diversity Average Fairness Average Multilinguality 66.68 66.49 50.28 47.96 69.11 69.56 64.52 99.73 57.41 61.54 56.06 54.54 97.88 65.20 53.47 58.78 94.55 97.76 81.83 65.21 68.84 61.84 64.87 23.54 23.62 44.55 23.60 56.89 40.72 50.22 5.95 24.91 87.73 55.54 29.19 47.72 23.87 57.52 65.73 66.66 49.62 49.51 60.25 71.62 64.72 97.92 58.78 61.46 56.43 54.07 97.89 65.55 53.75 59.12 84.91 87.85 80.83 64.41 68.01 61.28 65.56 23.12 24.18 45.65 23.89 58.10 39.94 51.15 5.97 27.14 78.82 56.21 29.69 47.97 24.56 56.64 54.42 53.73 44.05 42.80 56.25 62.34 56.76 99.56 52.02 53.81 47.92 49.48 98.14 54.05 47.58 50.72 90.09 87.47 74.60 55.12 57.74 54.00 59.03 21.97 23.59 39.83 23.15 56.09 40.60 48.09 5.84 26.47 77.14 46.75 27.60 46.72 23.36 47.23 56.10 54.57 44.58 45.41 55.62 63.34 57.83 96.30 53.88 54.35 49.96 51.03 98.03 54.79 48.43 52.50 80.20 82.04 75.72 58.01 59.49 55.01 61.21 22.18 24.34 41.70 23.75 57.46 39.93 49.41 5.93 25.93 72.04 47.53 28.54 47.07 23.76 46.43 54.22 50.75 43.97 42.60 54.14 58.44 56.51 99.62 51.48 52.99 47.09 48.72 98.04 52.41 45.36 51.82 89.57 83.03 73.67 56.70 55.32 53.39 57.33 20.44 23.10 39.23 22.37 53.66 37.01 49.02 4.83 25.50 75.01 45.43 26.87 44.01 23.01 45.40 53.60 51.16 44.30 42.66 51.65 61.42 57.58 96.19 51.82 53.50 47.16 48.34 97.88 51.93 46.83 50.97 78.20 80.15 75.03 56.82 57.30 53.51 59.49 21.18 23.99 39.44 23.22 54.33 35.60 49.23 5.37 26.19 70.10 45.75 27.55 43.29 23.46 44.61 29 Scaling Pre-training to One Hundred Billion Data for Vision Language Models F. Distribution of Languages We reuse the 35 languages10 reported in Crossmodal-3600 benchmark [65] for multilingual experiments. Table 12 | Distribution of the 35 languages used in multilingual evaluations. Language Type Maori Telugu Swahili Filipino Bengali Hebrew Hindi Croatian Norwegian Finnish Danish Hungarian Ukrainian Romanian Greek Swedish Czech Persian Thai Dutch Arabic Vietnamese Turkish Polish Italian Korean Portuguese Indonesian French Chinese German Russian Spanish Japanese English Low-resource All High-resource All Low-resource Low-resource Low-resource Low-resource Low-resource Low-resource Low-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource High-resource Low-resource High-resource Pages (%) 0.001 0.036 0.046 0.111 0.113 0.240 0.267 0.284 0.290 0.296 0.370 0.378 0.476 0.489 0.560 0.660 0.727 0.881 1.167 1.173 1.258 1.337 1.554 1.825 1.964 2.519 3.054 3.181 3.354 3.544 3.869 6.981 8.214 8.752 35.353 0.814 94.510 10“Quechua” is excluded as it is not supported by the language detection method we used. 30 M a Te ori lu Sw gu a Fi hili lip Be ino n H gali eb re w H C ind ro i N a or tia w n eg Fi ian nn is h D h n is ria an ga un H U kr R ain om ia an n ia G n Sw ree ed k is C h z Pe ech rs ia n Th a D i ut Vi A ch et ra na bi m c e Tu se rk is Po h lis Ita h lia Po Kor n rtu ea In gu n do es ne e si a Fr n e C nc hi h n G ese er R man us s Sp ian Ja ani pa sh ne En se gl is h 20% H H H H H H H H H H H H H H H H H H H H H H H H H H H H L L L L L L L Scaling Pre-training to One Hundred Billion Data for Vision Language Models 40% 30% 10% 0% Figure 6 | Visualization of the language distribution, where “L” and “H” denote low-resource and high-resource language respectively. 31