Title: The Platonic Representation Hypothesis Description: No description Keywords: Machine Learning, Representation, Artificial Intelligence, Multimodality Text content: The Platonic Representation Hypothesis 1 Introduction 2 Representations are converging Preliminaries 2.1 Different models, with different architectures and objectives, can have aligned representations 2.2 Alignment increases with scale and performance 2.3 Representations are converging across modalities 2.4 Models are increasingly aligning to brains 2.5 Does alignment predict downstream performance? 3 Why are representations converging? 3.1 Convergence via Task Generality 3.2 Convergence via Model Capacity 3.3 Convergence via Simplicity Bias 4 What representation are we converging to? 4.1 An idealized world 4.2 A family of contrastive learners converge to a representation of ℙ⁢(𝐙)ℙ𝐙\mathbb{P}(\mathbf{Z})roman_ℙ ( bold_Z ) A study in color 5 What are the implications of convergence? Scaling is sufficient, but not necessarily efficient Training data can be shared across modalities Ease of translation and adaptation across modalities Scaling may reduce hallucination and bias 6 Counterexamples and limitations Different modalities may contain different information Not all representations are presently converging Sociological bias in producing AI models Special-purpose intelligences might not converge How do we measure alignment? Lots left to explain A Mutual k𝑘kitalic_k-Nearest Neighbor Alignment Metric The choice to use mutual nearest-neighbors Relationship between CKA and Mutual Nearest-Neighbors B Consistency across various metrics Vision-vision comparison Cross-modal comparison C Experiments on Evaluating Alignment and Convergence C.1 Vision-Vision Alignment and Representation Quality C.2 Cross-Modal Alignment D Color Cooccurrence Experiment Perceptual representation from CIELAB color space Three representations from cooccurrence in VISION and LANGUAGE E Caption Density Experiments F Analysis of Contrastive Learners F.1 Contrastive objectives learn pointwise mutual information F.2 Contrastive learners can represent K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT exactly under smoothness conditions The Platonic Representation Hypothesis Minyoung Huh    Brian Cheung    Tongzhou Wang    Phillip Isola Abstract We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis. Project Page: phillipi.github.io/prh Code: github.com/minyoungg/platonic-rep Machine Learning, Representation, Artificial Intelligence, Multimodality 1 Introduction AI systems are rapidly evolving into highly multifunctional entities. For example, whereas in the past we had special-purpose solutions for different language processing tasks (e.g., sentiment analysis, parsing, dialogue), modern large language models (LLMs) are competent at all these tasks using a single set of weights (Srivastava et al., 2022). Unified systems are also being built across data modalities: instead of using a different architecture for processing images versus text, recent models, such as GPT4-V (OpenAI, 2023), Gemini (Google, 2023), and LLaVA (Liu et al., 2023), handle both modalities with a combined architecture. More and more systems are built off of general-purpose pretrained backbones, sometimes called foundation models (Bommasani et al., 2021), that support a large range of tasks, including robotics (Driess et al., 2023; Brohan et al., 2023), bioinformatics (Ma et al., 2024), and healthcare (Steinberg et al., 2021). In short, AI systems are becoming increasingly homogeneous in both their architectures and their capabilities. The Platonic Representation Hypothesis   Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces. Figure 1: The Platonic Representation Hypothesis: Images (X𝑋Xitalic_X) and text (Y𝑌Yitalic_Y) are projections of a common underlying reality (Z𝑍Zitalic_Z). We conjecture that representation learning algorithms will converge on a shared representation of Z𝑍Zitalic_Z, and scaling model size, as well as data and task diversity, drives this convergence. This paper explores one aspect of this trend: representational convergence. We argue that there is a growing similarity in how datapoints are represented in different neural network models. This similarity spans across different model architectures, training objectives, and even data modalities. What has led to this convergence? Will it continue? And ultimately, where does it end? Our central hypothesis, stated above in Figure 1, is that there is indeed an endpoint to this convergence and a principle that drives it: different models are all trying to arrive at a representation of reality, meaning a representation of the joint distribution over events in the world that generate the data we observe. Figure 1 conveys this hypothesis: there exists a real world (labeled Z𝑍Zitalic_Z), which we measure with various sensors, such as the camera shown to the left (X𝑋Xitalic_X). Other projections of these measurements, such as the textual description shown, can be produced from the first set of measurements or mediated by some other set of measurements, e.g., touch or other camera views (dotted arrow from X𝑋Xitalic_X to Y𝑌Yitalic_Y)111Touch could convey the shapes in this example but not the colors. This is an important limitation to our hypothesis that we discuss at several points in the paper: different sensors and views might capture different information, which may limit their potential to converge to identical representations. . Representation learning algorithms find vector embeddings that statistically model the various measurements and projections. The resulting vector embeddings are all derived from the underlying reality in Z𝑍Zitalic_Z and thereby become aligned. As models are trained on more data and for more tasks, they require representations that capture more and more information about Z𝑍Zitalic_Z, and hence alignment toward Z𝑍Zitalic_Z increases toward a convergent point as a function of scale. We call this converged hypothetical representation the “platonic representation” in reference to Plato’s Allegory of the Cave (Plato, c. 375 BC), and his idea of an ideal reality that underlies our sensations. The training data for our algorithms are shadows on the cave wall, yet, we hypothesize, models are recovering ever better representations of the actual world outside the cave. This idea is not unique to Plato; our hypothesis is also related to the notion of “convergent realism” (Newton-Smith, 1981; Putnam, 1982; Doppelt, 2007; Hardin & Rosenberg, 1982) in the philosophy of science (i.e., that science is converging on truth), and to many arguments that have been put forth in the representation learning literature (e.g., Tian et al. (2020a); Zimmermann et al. (2021); Richens & Everitt (2024); Cao & Yamins (2024)). Also closely related to our hypothesis is the “Anna Karenina scenario” described by Bansal et al. (2021), referring to the possibility that all well-performing neural nets represent the world in the same way. We discuss the evidence they give for this possibility in Section 2222Borrowed from Tolstoy (1877), similar analogies have been made in other domains, such as the “Anna Karenina principle” popularized by Diamond (1998) to explain animal domestication.. The platonic representation hypothesis refers to the situation where we are in an Anna Karenina scenario and the “happy representation” that is converged upon is one that reflects a statistical model of the underlying reality. We discuss the potential nature of this statistical model in more detail in Section 4. 2 Representations are converging Preliminaries We restrict our attention to representations that are vector embeddings. We characterize such a representation by the similarity structure it induces, referred to as its kernel. Kernels are commonly used to assess representations (Kornblith et al., 2019; Klabunde et al., 2023); this can be justified by the fact that they capture the relative structures among data samples, which are also the learning signal for many machine learning algorithms  (Aronszajn, 1950; Smola & Schölkopf, 1998). Following prior literature, we define representational alignment as a measure of the similarity of the similarity structures induced by two representations, i.e., a similarity metric over kernels. We give the mathematical definition of these concepts below: • A representation is a function f:𝒳→ℝn:𝑓→𝒳superscriptℝ𝑛f\colon\mathcal{X}\rightarrow\mathbb{R}^{n}italic_f : caligraphic_X → roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that assigns a feature vector to each input in some data domain 𝒳𝒳\mathcal{X}caligraphic_X. • A kernel, K:𝒳×𝒳→ℝ:𝐾→𝒳𝒳ℝK\colon\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_K : caligraphic_X × caligraphic_X → roman_ℝ, characterizes how a representation measures distance/similarity between datapoints. K⁢(xi,xj)=⟨f⁢(xi),f⁢(xj)⟩𝐾subscript𝑥𝑖subscript𝑥𝑗𝑓subscript𝑥𝑖𝑓subscript𝑥𝑗K(x_{i},x_{j})=\langle f(x_{i}),f(x_{j})\rangleitalic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ⟨ italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩, where ⟨⋅,⋅⟩⋅⋅\langle{{}\cdot{}},{{}\cdot{}}\rangle⟨ ⋅ , ⋅ ⟩ denotes inner product, xi,xj∈𝒳subscript𝑥𝑖subscript𝑥𝑗𝒳x_{i},x_{j}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X and K∈𝒦𝐾𝒦K\in\mathcal{K}italic_K ∈ caligraphic_K. • A kernel-alignment metric, m:𝒦×𝒦→ℝ:𝑚→𝒦𝒦ℝm\colon\mathcal{K}\times\mathcal{K}\rightarrow\mathbb{R}italic_m : caligraphic_K × caligraphic_K → roman_ℝ, measures the similarity between two kernels, i.e., how similar is the distance measure induced by one representation to the distance measure induced by another. Examples include Centered Kernel Distance (CKA) (Kornblith et al., 2019), SVCCA (Raghu et al., 2017), and nearest-neighbor metrics (Klabunde et al., 2023). In our experiments, we use a mutual nearest-neighbor metric that measures the mean intersection of the k𝑘kitalic_k-nearest neighbor sets induced by two kernels, K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, normalized by k𝑘kitalic_k. This metric is a variant of those proposed in Park et al. (2024), Klabunde et al. (2023) and Oron et al. (2017). See Appendix A for the exact definition and Appendix B for comparisons with alternative alignment metrics. Next, we explore several ways in which representations are converging. First, we argue that different neural networks are converging to aligned representations. Then, we show that this continues to hold across modalities, where image embeddings in vision models align with text embeddings in language models. Figure 2: VISION models converge as COMPETENCE increases: We measure alignment among 78787878 models using mutual nearest-neighbors on Places-365 (Zhou et al., 2017), and evaluate their performance on downstream tasks from the Visual Task Adaptation Benchmark (VTAB; Zhai et al. (2019)). LEFT: Models that solve more VTAB tasks tend to be more aligned with each other. Error bars show standard error. RIGHT: We use UMAP to embed models into a 2D space, based on 𝖽𝗂𝗌𝗍𝖺𝗇𝖼𝖾≜−log⁡(𝖺𝗅𝗂𝗀𝗇𝗆𝖾𝗇𝗍)≜𝖽𝗂𝗌𝗍𝖺𝗇𝖼𝖾𝖺𝗅𝗂𝗀𝗇𝗆𝖾𝗇𝗍\mathsf{distance}\triangleq-\log(\mathsf{alignment})sansserif_distance ≜ - roman_log ( sansserif_alignment ). More competent and general models (blue) have more similar representations. 2.1 Different models, with different architectures and objectives, can have aligned representations One indication of representational convergence is the rising number of systems built on top of pre-trained foundation models. These models are becoming standard backbones across a growing spectrum of tasks. Their versatility across numerous applications implies a level of universality in the way they represent data. While this trend implies convergence toward a relatively small set of foundation models, it does not imply that different foundation models will arrive at the same representation. Yet that is what has been observed by several recent papers. Lenc & Vedaldi (2015) conducted one such study, in which they measured representational similarity through a technique called model stitching. Given two models, f𝑓fitalic_f and g𝑔gitalic_g, each composed of multiple layers (f=f1∘⋯∘fn𝑓subscript𝑓1⋯subscript𝑓𝑛f=f_{1}\circ\cdots\circ f_{n}italic_f = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, g=g1∘⋯∘gm𝑔subscript𝑔1⋯subscript𝑔𝑚g=g_{1}\circ\cdots\circ g_{m}italic_g = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT), an intermediate representation from f𝑓fitalic_f is integrated into g𝑔gitalic_g via a learned affine stitching layer hℎhitalic_h, resulting in a new stitched model F=f1∘⋯∘fk∘h∘gk+1∘⋯∘gm𝐹subscript𝑓1⋯subscript𝑓𝑘ℎsubscript𝑔𝑘1⋯subscript𝑔𝑚F=f_{1}\circ\cdots\circ f_{k}\circ h\circ g_{k+1}\circ\cdots\circ g_{m}italic_F = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∘ italic_h ∘ italic_g start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. If F𝐹Fitalic_F has good performance, it indicates that f𝑓fitalic_f and g𝑔gitalic_g have compatible representations at layer k𝑘kitalic_k, up to the transform hℎhitalic_h. In their study, Lenc & Vedaldi (2015) made two notable findings: (1) A vision model trained on ImageNet (Russakovsky et al., 2015) can be aligned with a model trained on Places-365 (Zhou et al., 2017) while maintaining good performance; (2) The early layers of these convolutional networks are more interchangeable than later layers. The first finding illustrates a level of data independence where distinct image datasets lead to similar representations. The second finding agrees with extensive research that oriented Gabor-like filters are common in both artificial and biological vision systems. This suggests a convergence to a similar initial layer of representation across various neural network architectures (Olshausen & Field, 1996; Krizhevsky et al., 2017). Bansal et al. (2021) expanded on the idea of model stitching, showing that models trained using self-supervised objectives align closely with their supervised counterparts. Moschella et al. (2022) further demonstrated the feasibility of “zero-shot” model stitching without learning a stitching layer. Despite the fact that different text models were trained on different modalities, they found that the models often embed data in remarkably similar ways. In particular, they considered the kernel K𝐾Kitalic_K defined by learned representations and showed that K𝐾Kitalic_K serves as a bridge between models, allowing an encoder trained in one language, like English, to work effectively with a decoder in another, like French. Dravid et al. (2023) extended this idea to individual neurons, and found “Rosetta Neurons” that are activated by the same pattern across a range of vision models. Such neurons form a common dictionary independently discovered by all models. Figure 3: LANGUAGE and VISION models align: We measure alignment using mutual nearest-neighbor on the Wikipedia caption dataset (WIT) (Srinivasan et al., 2021). The x-axis is the language model performance measured over 4M tokens from the OpenWebText dataset (Gokaslan & Cohen, 2019) (see Appendix B for plots with model names). We measure performance using 1−bits-per-byte1bits-per-byte1-\texttt{bits-per-byte}1 - bits-per-byte, where bits-per-byte normalizes the cross-entropy by the total bytes in the input text string. The results show a linear relationship between language-vision alignment and language modeling score, where a general trend is that more capable language models align better with more capable vision models. We find that CLIP models, which are trained with explicit language supervision, exhibit a higher level of alignment. However, this alignment decreases after being fine-tuned on ImageNet classification (labeled CLIP (I12K ft)). 2.2 Alignment increases with scale and performance Kornblith et al. (2019) and Roeder et al. (2021) observed model alignment not only exists but also increases with model scale and dataset size. On CIFAR-10 classification, Krizhevsky et al. (2009) found that larger models exhibit greater alignment with each other compared to smaller ones. Theoretically, Balestriero & Baraniuk (2018) showed that models with similar outputs (e.g., as a result of having high performance) also have similar internal activations. With the continuing trend of models scaling up, this suggests model alignment will increase over time – we might expect that the next generation of bigger, better models will be even more aligned with each other. We expand upon this observation by evaluating the transfer performance of 78787878 vision models. These models were trained with varying architectures, training objectives, and datasets (detailed in Section C.1). In Figure 2 (left), we bin these models based on their average transfer performance on the VTAB dataset (Zhai et al., 2019), and then measure the average kernel alignment of the models within each bin. The results indicate that models with high transfer performance form a tightly clustered set of representations, while models with weak performance have more variable representations. We further visualize this structure with UMAP (McInnes et al., 2018) over models representation in Figure 2 (right). This suggests that models that are competent all represent data in a similar way. Echoing Bansal et al. (2021) and Tolstoy (1877), we might say: all strong models are alike, each weak model is weak in its own way. The discussion so far indicates that various models are aligning toward a unified representation. But does the convergence extend to model weights? While models with different architectures might not have compatible weight spaces, there exists ample evidence that models with the same architecture will often converge to the same basin of weights (Nagarajan & Kolter, 2019; Garipov et al., 2018; Lubana et al., 2023). This holds even for models with different initializations, up to permutations over weight space (Ainsworth et al., 2022). Because of this, it is possible to merge separately trained models of the same architecture, and achieve some of the capabilities of all models in the mixture (Stoica et al., 2023; Jordan et al., 2022; Wortsman et al., 2022). Figure 4: Alignment predicts downstream performance: We visualize correlation between LLM alignment score to DINOv2 (Oquab et al., 2023) and downstream task performance on Hellaswag (common-sense) (Zellers et al., 2019) and GSM8K (math) (Cobbe et al., 2021). LLMs are plotted with radii proportional to the size of the model, and color-coded by their rank order in language modeling scores (1−bits-per-byte1bits-per-byte1-\texttt{bits-per-byte}1 - bits-per-byte). We observe that models aligned more closely with vision also show better performance on downstream language tasks. For Hellaswag, there is a linear relationship with alignment score, while GSM8K exhibits an “emergence”-esque trend. 2.3 Representations are converging across modalities Do models trained on different data modalities also converge? Several works indicate that the answer is yes. Merullo et al. (2022) extended model stitching to the cross-modal setting, finding that a single linear projection is sufficient to stitch a vision model to an LLM and achieve good performance on visual question answering and image captioning. Koh et al. (2023) showed that linear stitching can also work in the opposite direction, aligning text inputs to visual outputs. In fact, many recent language-vision models stitch pre-trained language and vision models together. For example, LLaVA (Liu et al., 2023) demonstrated state-of-the-art results by projecting visual features into a language model with a 2-layer MLP. Other works show further kinds of evidence of cross-modal synergy. OpenAI (2023) found that jointly training a language model with a vision model improves performance on language tasks, compared to training the language model on its own. Sorscher et al. (2022) show a setting in which word embeddings of visual concept names can be isometrically mapped to image embeddings for those same concepts. In work concurrent to ours, Maniparambil et al. (2024) show well-trained vision encoders on large datasets exhibit high semantic similarity with language encoders regardless of the training paradigm (supervised, self-supervised, or language-supervised). Sharma et al. (2024) probed the visual knowledge of LLMs trained only on language data, by converting images into code that an LLM can process. They found that LLMs have rich knowledge of visual structures, to the extent that decent visual representations can be trained on images generated solely by querying an LLM to produce code and rendering the response. In visual generation, LLMs show abilities to augment captions with visual structures (e.g., bounding boxes) and improve generation quality (Betker et al., 2023; Lian et al., 2023a, b; Wu et al., 2023). Over other modalities, Ngo & Kim (2024) showed auditory models are also roughly aligned with LLMs up to a linear transformation, and Ng et al. (2023) demonstrated the effectiveness of using pre-trained LLMs for facial motion prediction. We set out to address these claims in a broader scope to determine whether models are indeed learning an increasingly modality-agnostic representation of the world. We sampled a variety of models trained either solely on vision or solely on language, and compared their representations as they became larger and more competent over many tasks. In Figure 3, we assess alignment between a suite of language models and vision models. So far we have only defined alignment for two kernels defined over the same input space. To measure cross-modal alignment, we use paired datasets to bridge the two modalities. For vision and text, we use the Wikipedia captions dataset {(xi,yi)}isubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖\{(x_{i},y_{i})\}_{i}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Srinivasan et al., 2021), composed of images from Wikipedia (xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and their corresponding captions (yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). We then measure alignment between a language model ftextsubscript𝑓textf_{\texttt{text}}italic_f start_POSTSUBSCRIPT text end_POSTSUBSCRIPT and a vision model fimgsubscript𝑓imgf_{\texttt{img}}italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT as the alignment of the two following kernels: Kimg⁢(i,j)=⟨fimg⁢(xi),fimg⁢(xj)⟩subscript𝐾img𝑖𝑗subscript𝑓imgsubscript𝑥𝑖subscript𝑓imgsubscript𝑥𝑗\displaystyle K_{\texttt{img}}(i,j)=\langle f_{\texttt{img}}(x_{i}),f_{\texttt% {img}}(x_{j})\rangleitalic_K start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( italic_i , italic_j ) = ⟨ italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ (1) Ktext⁢(i,j)=⟨ftext⁢(yi),ftext⁢(yj)⟩.subscript𝐾text𝑖𝑗subscript𝑓textsubscript𝑦𝑖subscript𝑓textsubscript𝑦𝑗\displaystyle K_{\texttt{text}}(i,j)=\langle f_{\texttt{text}}(y_{i}),f_{% \texttt{text}}(y_{j})\rangle.italic_K start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ( italic_i , italic_j ) = ⟨ italic_f start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ . (2) Using this analysis, we find that the better an LLM is at language modeling, the more it tends to aligns with vision models, as shown in Figure 3. The converse effect also holds: the better a vision models is, the more it tends to align with LLMs. See Section C.2 for more details. 2.4 Models are increasingly aligning to brains Neural networks also show substantial alignment with biological representations in the brain (Yamins et al., 2014). This commonality may be due to similarities in the task and data constraints both systems are confronted with. Even though the mediums may differ – silicon transistors versus biological neurons – the fundamental problem faced by brains and machines is the same: efficiently extracting and understanding the underlying structure in images, text, sounds, etc. (Barlow et al., 1961; Olshausen & Field, 1997). Sorscher et al. (2022) developed a theoretical framework for how the efficient extraction of novel concepts occurs for both the human visual system and deep networks. The tasks that the human visual system has been honed to perform through evolution – like segmentation, detection, and whole-image classification – are also the ones that we train our neural nets to perform. Yamins et al. (2014) went as far as to title their work in the spirit that performance over such tasks implies brain alignment. Antonello & Huth (2024) posited that it is less the particular task and more the generality of the representations that explain their alignment with biological representations. Further, Conwell et al. (2022) showed that training data plays a large role in alignment. Psychophysical studies have also shown agreement between how humans perceive visual similarity and how models do, even when the models are trained on tasks, such as self-supervised prediction, that are seemingly unrelated to mimicking human perception (Zhang et al., 2018). 2.5 Does alignment predict downstream performance? If models are converging towards a more accurate representation of reality, we expect that alignment should correspond to improved performance on downstream tasks. Figure 4 supports this hypothesis by demonstrating improved performance on commonsense reasoning (Hellaswag; Zellers et al. (2019)) and mathematical problem solving (GSM8K; Cobbe et al. (2021)) as alignment improves. 3 Why are representations converging? Modern machine learning models are generally trained to minimize the empirical risk with possible implicit and/or explicit regularization: f∗﹇trained model=arg⁢minf∈ℱ﹈function class⁢𝔼x∼𝖽𝖺𝗍𝖺𝗌𝖾𝗍⁢[ℒ﹇training objective⁢(f,x)]+ℛ﹈regularization⁢(f)superscript﹇superscript𝑓trained modelsubscriptargmin𝑓subscript﹈ℱfunction classsubscript𝔼similar-to𝑥𝖽𝖺𝗍𝖺𝗌𝖾𝗍delimited-[]superscript﹇ℒtraining objective𝑓𝑥subscript﹈ℛregularization𝑓{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\overbracket{\color[% rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}f^{*}}^{\mathclap{% \textsf{trained model}}}}{}=\definecolor{tcbcolback}{rgb}{0.7,0.7,1}% \definecolor{tcbcolframe}{rgb}{0.7,0.7,1}\definecolor{tcbcolupper}{rgb}{0,0,0}% \definecolor{tcbcollower}{rgb}{0,0,0}\definecolor{tcbcol@origin}{rgb}{0,0,0}% \definecolor{.}{rgb}{0,0,0}\definecolor{.}{rgb}{0,0,0}\leavevmode\hbox to41.25% pt{\vbox to12.62pt{\pgfpicture\makeatletter\hbox{\hskip 0.0pt\lower 0.0pt\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{{}{}{}{}\pgfsys@beginscope% \pgfsys@invoke{ }{}{}{}{}{}{}{}{}\definecolor[named]{pgffillcolor}{rgb}{% 0.7,0.7,1}\pgfsys@color@rgb@fill{0.7}{0.7}{1}\pgfsys@invoke{ }% \pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{% }{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}\pgfsys@moveto{0.0% pt}{2.0pt}\pgfsys@lineto{0.0pt}{10.62303pt}\pgfsys@curveto{0.0pt}{11.72762pt}{% 0.89542pt}{12.62303pt}{2.0pt}{12.62303pt}\pgfsys@lineto{39.25006pt}{12.62303pt% }\pgfsys@curveto{40.35464pt}{12.62303pt}{41.25006pt}{11.72762pt}{41.25006pt}{1% 0.62303pt}\pgfsys@lineto{41.25006pt}{2.0pt}\pgfsys@curveto{41.25006pt}{0.89542% pt}{40.35464pt}{0.0pt}{39.25006pt}{0.0pt}\pgfsys@lineto{2.0pt}{0.0pt}% \pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt}{0.0pt}{2.0pt}% \pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope% }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}{}{}{}{}% \definecolor[named]{pgffillcolor}{rgb}{0.7,0.7,1}\pgfsys@color@rgb@fill{0.7}{0% .7}{1}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}% }{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{0.0pt}{10.62303pt}% \pgfsys@curveto{0.0pt}{11.72762pt}{0.89542pt}{12.62303pt}{2.0pt}{12.62303pt}% \pgfsys@lineto{39.25006pt}{12.62303pt}\pgfsys@curveto{40.35464pt}{12.62303pt}{% 41.25006pt}{11.72762pt}{41.25006pt}{10.62303pt}\pgfsys@lineto{41.25006pt}{2.0% pt}\pgfsys@curveto{41.25006pt}{0.89542pt}{40.35464pt}{0.0pt}{39.25006pt}{0.0pt% }\pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542% pt}{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }% \pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{{}}{{}}{{}}{{}}{{}}{{}}{{}}% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{2.0% pt}{3.94444pt}\pgfsys@invoke{ }\hbox{{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\hbox{\set@color{\ignorespaces$\operatorname*{arg\,min}$}}}}\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope% }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}_{f\in{\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\underbracket{% \scriptsize\definecolor{tcbcolback}{rgb}{0.7,0.7,1}\definecolor{tcbcolframe}{% rgb}{0.7,0.7,1}\definecolor{tcbcolupper}{rgb}{0,0,0}\definecolor{tcbcollower}{% rgb}{0,0,0}\definecolor{tcbcol@origin}{rgb}{.5,.5,.5}\definecolor{.}{rgb}{% 0,0,0}\definecolor{.}{rgb}{.5,.5,.5}\leavevmode\hbox to8.57pt{\vbox to8.78pt{% \pgfpicture\makeatletter\hbox{\hskip 0.0pt\lower 0.0pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@invoke{ }\pgfsys@color@gray@fill% {.5}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \hbox to0.0pt{{}{}{}{}\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}{}{}{}{}% \definecolor[named]{pgffillcolor}{rgb}{0.7,0.7,1}\pgfsys@color@rgb@fill{0.7}{0% .7}{1}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}% }{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{0.0pt}{6.78333pt}% \pgfsys@curveto{0.0pt}{7.88791pt}{0.89542pt}{8.78333pt}{2.0pt}{8.78333pt}% \pgfsys@lineto{6.56946pt}{8.78333pt}\pgfsys@curveto{7.67404pt}{8.78333pt}{8.56% 946pt}{7.88791pt}{8.56946pt}{6.78333pt}\pgfsys@lineto{8.56946pt}{2.0pt}% \pgfsys@curveto{8.56946pt}{0.89542pt}{7.67404pt}{0.0pt}{6.56946pt}{0.0pt}% \pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt% }{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}% {}{}{}{}\definecolor[named]{pgffillcolor}{rgb}{0.7,0.7,1}% \pgfsys@color@rgb@fill{0.7}{0.7}{1}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}% \pgfsys@invoke{ }{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{% 0.0pt}{6.78333pt}\pgfsys@curveto{0.0pt}{7.88791pt}{0.89542pt}{8.78333pt}{2.0pt% }{8.78333pt}\pgfsys@lineto{6.56946pt}{8.78333pt}\pgfsys@curveto{7.67404pt}{8.7% 8333pt}{8.56946pt}{7.88791pt}{8.56946pt}{6.78333pt}\pgfsys@lineto{8.56946pt}{2% .0pt}\pgfsys@curveto{8.56946pt}{0.89542pt}{7.67404pt}{0.0pt}{6.56946pt}{0.0pt}% \pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt% }{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }% \pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{{}}{{}}{{}}{{}}{{}}{{}}{{}}% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{2.0% pt}{2.0pt}\pgfsys@invoke{ }\hbox{{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\hbox{\set@color{\ignorespaces$\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\mathcal{F}$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}% \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}_{\mathclap{\scriptstyle\textsf{function % class}}}}}\mathbb{E}_{x\sim{\scriptsize\definecolor{tcbcolback}{rgb}{0.7,1,0.7% }\definecolor{tcbcolframe}{rgb}{0.7,1,0.7}\definecolor{tcbcolupper}{rgb}{0,0,0% }\definecolor{tcbcollower}{rgb}{0,0,0}\definecolor{tcbcol@origin}{rgb}{0,0,0}% \definecolor{.}{rgb}{0,0,0}\definecolor{.}{rgb}{0,0,0}\leavevmode\hbox to26.21% pt{\vbox to8.86pt{\pgfpicture\makeatletter\hbox{\hskip 0.0pt\lower 0.0pt\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{{}{}{}{}\pgfsys@beginscope% \pgfsys@invoke{ }{}{}{}{}{}{}{}{}\definecolor[named]{pgffillcolor}{rgb}{% 0.7,1,0.7}\pgfsys@color@rgb@fill{0.7}{1}{0.7}\pgfsys@invoke{ }% \pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{% }{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}\pgfsys@moveto{0.0% pt}{2.0pt}\pgfsys@lineto{0.0pt}{6.8611pt}\pgfsys@curveto{0.0pt}{7.96568pt}{0.8% 9542pt}{8.8611pt}{2.0pt}{8.8611pt}\pgfsys@lineto{24.20558pt}{8.8611pt}% \pgfsys@curveto{25.31017pt}{8.8611pt}{26.20558pt}{7.96568pt}{26.20558pt}{6.861% 1pt}\pgfsys@lineto{26.20558pt}{2.0pt}\pgfsys@curveto{26.20558pt}{0.89542pt}{25% .31017pt}{0.0pt}{24.20558pt}{0.0pt}\pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto% {0.89542pt}{0.0pt}{0.0pt}{0.89542pt}{0.0pt}{2.0pt}\pgfsys@closepath% \pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}{}{}{}{}% \definecolor[named]{pgffillcolor}{rgb}{0.7,1,0.7}\pgfsys@color@rgb@fill{0.7}{1% }{0.7}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}% }{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{0.0pt}{6.8611pt}% \pgfsys@curveto{0.0pt}{7.96568pt}{0.89542pt}{8.8611pt}{2.0pt}{8.8611pt}% \pgfsys@lineto{24.20558pt}{8.8611pt}\pgfsys@curveto{25.31017pt}{8.8611pt}{26.2% 0558pt}{7.96568pt}{26.20558pt}{6.8611pt}\pgfsys@lineto{26.20558pt}{2.0pt}% \pgfsys@curveto{26.20558pt}{0.89542pt}{25.31017pt}{0.0pt}{24.20558pt}{0.0pt}% \pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt% }{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }% \pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{{}}{{}}{{}}{{}}{{}}{{}}{{}}% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{2.0% pt}{2.0pt}\pgfsys@invoke{ }\hbox{{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\hbox{\set@color{\ignorespaces$\mathsf{dataset}$}}}}\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope% }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}}[{\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\overbracket{% \definecolor{tcbcolback}{rgb}{0.7,1,0.7}\definecolor{tcbcolframe}{rgb}{% 0.7,1,0.7}\definecolor{tcbcolupper}{rgb}{0,0,0}\definecolor{tcbcollower}{rgb}{% 0,0,0}\definecolor{tcbcol@origin}{rgb}{.5,.5,.5}\definecolor{.}{rgb}{0,0,0}% \definecolor{.}{rgb}{.5,.5,.5}\leavevmode\hbox to10.25pt{\vbox to10.83pt{% \pgfpicture\makeatletter\hbox{\hskip 0.0pt\lower 0.0pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@invoke{ }\pgfsys@color@gray@fill% {.5}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \hbox to0.0pt{{}{}{}{}\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}{}{}{}{}% \definecolor[named]{pgffillcolor}{rgb}{0.7,1,0.7}\pgfsys@color@rgb@fill{0.7}{1% }{0.7}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}% }{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{0.0pt}{8.83331pt}% \pgfsys@curveto{0.0pt}{9.9379pt}{0.89542pt}{10.83331pt}{2.0pt}{10.83331pt}% \pgfsys@lineto{8.25002pt}{10.83331pt}\pgfsys@curveto{9.3546pt}{10.83331pt}{10.% 25002pt}{9.9379pt}{10.25002pt}{8.83331pt}\pgfsys@lineto{10.25002pt}{2.0pt}% \pgfsys@curveto{10.25002pt}{0.89542pt}{9.3546pt}{0.0pt}{8.25002pt}{0.0pt}% \pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt% }{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}% {}{}{}{}\definecolor[named]{pgffillcolor}{rgb}{0.7,1,0.7}% \pgfsys@color@rgb@fill{0.7}{1}{0.7}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}% \pgfsys@invoke{ }{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{% 0.0pt}{8.83331pt}\pgfsys@curveto{0.0pt}{9.9379pt}{0.89542pt}{10.83331pt}{2.0pt% }{10.83331pt}\pgfsys@lineto{8.25002pt}{10.83331pt}\pgfsys@curveto{9.3546pt}{10% .83331pt}{10.25002pt}{9.9379pt}{10.25002pt}{8.83331pt}\pgfsys@lineto{10.25002% pt}{2.0pt}\pgfsys@curveto{10.25002pt}{0.89542pt}{9.3546pt}{0.0pt}{8.25002pt}{0% .0pt}\pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89% 542pt}{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }% \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{{}}{{}}{{}}{{}}{{% }}{{}}{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0% }{1.0}{2.0pt}{2.0pt}\pgfsys@invoke{ }\hbox{{\color[rgb]{0,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\hbox{\set@color{\ignorespaces$\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\mathcal{L}$}}}}\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}^{\mathclap{\textsf{training objective}}}}(f% ,x)]+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\underbracket{% \definecolor{tcbcolback}{rgb}{1,0.7,0.7}\definecolor{tcbcolframe}{rgb}{% 1,0.7,0.7}\definecolor{tcbcolupper}{rgb}{0,0,0}\definecolor{tcbcollower}{rgb}{% 0,0,0}\definecolor{tcbcol@origin}{rgb}{.5,.5,.5}\definecolor{.}{rgb}{0,0,0}% \definecolor{.}{rgb}{.5,.5,.5}\leavevmode\hbox to11.36pt{\vbox to10.83pt{% \pgfpicture\makeatletter\hbox{\hskip 0.0pt\lower 0.0pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@invoke{ }\pgfsys@color@gray@fill% {.5}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \hbox to0.0pt{{}{}{}{}\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}{}{}{}{}% \definecolor[named]{pgffillcolor}{rgb}{1,0.7,0.7}\pgfsys@color@rgb@fill{1}{0.7% }{0.7}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}% }{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{0.0pt}{8.83331pt}% \pgfsys@curveto{0.0pt}{9.9379pt}{0.89542pt}{10.83331pt}{2.0pt}{10.83331pt}% \pgfsys@lineto{9.36111pt}{10.83331pt}\pgfsys@curveto{10.4657pt}{10.83331pt}{11% .36111pt}{9.9379pt}{11.36111pt}{8.83331pt}\pgfsys@lineto{11.36111pt}{2.0pt}% \pgfsys@curveto{11.36111pt}{0.89542pt}{10.4657pt}{0.0pt}{9.36111pt}{0.0pt}% \pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.89542pt% }{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope\pgfsys@invoke{ }{}{}{}{}% {}{}{}{}\definecolor[named]{pgffillcolor}{rgb}{1,0.7,0.7}% \pgfsys@color@rgb@fill{1}{0.7}{0.7}\pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}% \pgfsys@invoke{ }{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}{{}{}{{}}}{{}% {}{{}}}{}{}{{}{}{{}}}{{}{}{{}}}{}{}\pgfsys@moveto{0.0pt}{2.0pt}\pgfsys@lineto{% 0.0pt}{8.83331pt}\pgfsys@curveto{0.0pt}{9.9379pt}{0.89542pt}{10.83331pt}{2.0pt% }{10.83331pt}\pgfsys@lineto{9.36111pt}{10.83331pt}\pgfsys@curveto{10.4657pt}{1% 0.83331pt}{11.36111pt}{9.9379pt}{11.36111pt}{8.83331pt}\pgfsys@lineto{11.36111% pt}{2.0pt}\pgfsys@curveto{11.36111pt}{0.89542pt}{10.4657pt}{0.0pt}{9.36111pt}{% 0.0pt}\pgfsys@lineto{2.0pt}{0.0pt}\pgfsys@curveto{0.89542pt}{0.0pt}{0.0pt}{0.8% 9542pt}{0.0pt}{2.0pt}\pgfsys@closepath\pgfsys@fill\pgfsys@invoke{ }% \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@fill@opacity{1.0}\pgfsys@invoke{ }{{{}}{{}}{{}}{{}}{{% }}{{}}{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0% }{1.0}{2.0pt}{2.0pt}\pgfsys@invoke{ }\hbox{{\color[rgb]{0,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\hbox{\set@color{\ignorespaces$\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\mathcal{R}$}}}}\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}_{\mathclap{\textsf{regularization}}}}(f)over﹇ start_ARG italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT trained model end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_f ∈ under﹈ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT function class end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 start_POSTSUBSCRIPT italic_x ∼ sansserif_dataset end_POSTSUBSCRIPT [ over﹇ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT training objective end_POSTSUPERSCRIPT ( italic_f , italic_x ) ] + under﹈ start_ARG caligraphic_R end_ARG start_POSTSUBSCRIPT regularization end_POSTSUBSCRIPT ( italic_f ) In the following sections, we lay out how each colored component in this optimization process potentially plays a role in facilitating representational convergence. Figure 5: The Capacity Hypothesis: If an optimal representation exists in function space, larger hypothesis spaces are more likely to cover it. LEFT: Two small models might not cover the optimum and thus find different solutions (marked by outlined ). RIGHT: As the models become larger, they cover the optimum and converge to the same solution (marked by filled ★★\bigstar★ ). Figure 6: The Multitask Scaling Hypothesis: Models trained with an increasing number of tasks are subjected to pressure to learn a representation that can solve all the tasks. 3.1 Convergence via Task Generality Each training datapoint and objective (task) places an additional constraint on the model. As data and tasks scale, the volume of representations that satisfy these constraints must proportionately grow smaller, as visualized in Figure 6 and stated below: The Multitask Scaling Hypothesis   There are fewer representations that are competent for N𝑁Nitalic_N tasks than there are for M’. This is an image of a Using prompting showed similar trends to average pooling but had slightly lower alignment scores. Appendix D Color Cooccurrence Experiment Here we describe the details of how we created the four color representations visualized in Figure 8, from left to right. Perceptual representation from CIELAB color space We embed pixels taken from the CIFAR-10 image dataset (Krizhevsky et al., 2009; Torralba et al., 2008) based on the CIELAB color space, which is designed as a perceptually uniform space that changes numerical values correspond to similar perceived changes in color. Three representations from cooccurrence in VISION and LANGUAGE For these three representations, we first obtain a dissimilarity matrix over colors (in different ways detailed below), then use multidimensional scaling (Shepard, 1980) to find a 3-dimensional embedding in which Euclidean distance between the embeddings for A𝐴Aitalic_A and B𝐵Bitalic_B, zAsubscript𝑧𝐴{z}_{A}italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and zBsubscript𝑧𝐵{z}_{B}italic_z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, best matches this dissimilarity matrix. We use 1,00010001{,}0001 , 000 fits and take the best match. Afterward, we visually align it with the CIELAB space by finding the best rotation, translation, scaling, and flipping, by running the Kabsch-Umeyama algorithm (Kabsch, 1976, 1978; Umeyama, 1991) twice, once on 𝐳𝐳\mathbf{z}bold_z and once on −𝐳𝐳-\mathbf{z}- bold_z, to account for flipping. The dissimilarity matrix we used in each case is described as following: • VISION: Pixel cooccurrence. We collect color cooccurrence statistics from the CIFAR-10 dataset, and estimate a joint distribution p⁢(A,B)𝑝𝐴𝐵p(A,B)italic_p ( italic_A , italic_B ) over 300,000300000300{,}000300 , 000 randomly sampled pixel colors A𝐴Aitalic_A and B𝐵Bitalic_B that occur within a radius of at most 4 pixels of one another. Colors are quantized on a grid in RGB space and represented as discrete variables, and p⁢(A,B)𝑝𝐴𝐵p(A,B)italic_p ( italic_A , italic_B ) is modeled as a table of normalized counts, from which we compute the empirical pointwise mutual information matrix K𝖯𝖬𝖨⁢(A,B)subscript𝐾𝖯𝖬𝖨𝐴𝐵K_{\mathsf{PMI}}(A,B)italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_A , italic_B ). Quantization ensures that there is no bias from how color distances are represented in RGB space. Dissimilarity matrix is defined as −K𝖯𝖬𝖨⁢(A,B)+csubscript𝐾𝖯𝖬𝖨𝐴𝐵𝑐-K_{\mathsf{PMI}}(A,B)+c- italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_A , italic_B ) + italic_c, where c=maxA,B⁡K𝖯𝖬𝖨⁢(A,B)𝑐subscript𝐴𝐵subscript𝐾𝖯𝖬𝖨𝐴𝐵c=\max_{A,B}K_{\mathsf{PMI}}(A,B)italic_c = roman_max start_POSTSUBSCRIPT italic_A , italic_B end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_A , italic_B ) is an offset to ensure non-negativity (similar to the constant in Section 4.2 and Proposition F.1 that ensures neural networks can express K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT). • LANGUAGE. We used an approach similar to Abdou et al. (2021). – We take 20202020 pairs of (color, word) appeared in the dataset collected by Lindsey & Brown (2014), where 51515151 participants were asked to free name each of the 330330330330 colors from the Munsell Color Chart. We filtered words that appeared less than 100100100100 times, and computed each word’s associate color by taking the centroid in CIELAB space. Our filtering process followed Abdou et al. (2021) exactly, but resulted in 20202020 colors, a slightly different set than the 18181818 colors they claimed. – For each of the 20202020 color words , we construct three sentences: The color . This color is . The color of this thing is . and obtain the average sentence embedding from the language encoder, as the embedding for (details below). We find this approach more effective than Abdou et al. (2021), which uses object names that potentially have color biases, even though the objects may appear in multiple colors. – Unlike Abdou et al. (2021), we did not perform linear regression from language embedding to CIELAB space, which distorts distances and easily overfits with only 20202020 samples. Instead, we used multidimensional scaling to best preserve distances, as described above. – Masked language contrastive learning (SimCSE) embedding: We used sentence embedding from the unsupervised SimCSE RoBERTa-L (Gao et al., 2021) to encode the above sentences into 1024102410241024-dimensional embeddings, and used the pairwise Euclidean distances among embeddings as the dissimilarity matrix. – Masked language predictive learning (RoBERTa) embedding: We concatenated hidden states of the last four layers of RoBERTa-L (Liu et al., 2019), following (Devlin et al., 2018). We averaged across token dimensions, and obtained a 4096409640964096-dimensional embedding for each of the above sentences, and used the pairwise Euclidean distances among embeddings as the dissimilarity matrix. Appendix E Caption Density Experiments We use LLaMA3-8B-Instruct (Meta, 2024) to generate summary captions at various densities for images in the Densely Captioned Images dataset (Urbanek et al., 2023) from the train split. Following  Urbanek et al. (2023), we prompt the language model with the following instructions to generate captions at differing granularity: system: You are given a full-text description of an image. You should summarize it into about words, being sure to include as much salient visual information as possible given the word constraint, especially information from the start of the original description. The new description should apply for the original image. Respond with only the summary, in one line. user: We measure the alignment with this generated caption to test our hypothesis that denser captations would result in higher alignment scores. In Figure 9, we find that the alignment score also improves as caption length increases. Appendix F Analysis of Contrastive Learners F.1 Contrastive objectives learn pointwise mutual information There are two widely used forms of contrastive objectives. We now discuss each form in detail and show how they both are minimized by the pointwise mutual information (PMI) as stated in Equation 5. To simplify notation, we consider learning the bivariate model g⁢(xa,xb)∈ℝ𝑔subscript𝑥𝑎subscript𝑥𝑏ℝg(x_{a},x_{b})\in\mathbb{R}italic_g ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ∈ roman_ℝ. In Section 4, such g𝑔gitalic_g is optimized within the family of {g=⟨fX,fX⟩:fX∈ℱX}conditional-set𝑔subscript𝑓𝑋subscript𝑓𝑋subscript𝑓𝑋subscriptℱ𝑋\{g=\langle f_{X},f_{X}\rangle\colon f_{X}\in\mathcal{F}_{X}\}{ italic_g = ⟨ italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟩ : italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT }. Recall that our positive pairs are sampled from (x,x+)∼P𝖼𝗈𝗈𝗋similar-to𝑥subscript𝑥subscript𝑃𝖼𝗈𝗈𝗋(x,x_{+})\sim{P}_{\mathsf{coor}}( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ∼ italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT, and that the negative pairs are sampled independently from its marginals which we denote as (x,x−)⁢∼i.i.d.⁢P𝑥subscript𝑥i.i.d.similar-to𝑃(x,x_{-})\overset{\text{i.i.d.}}{\sim}P( italic_x , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) overi.i.d. start_ARG ∼ end_ARG italic_P where P⁢(x)=∑x+P𝖼𝗈𝗈𝗋⁢(x,x+)𝑃𝑥subscriptsubscript𝑥subscript𝑃𝖼𝗈𝗈𝗋𝑥subscript𝑥P(x)=\sum_{x_{+}}{P}_{\mathsf{coor}}(x,x_{+})italic_P ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ). 1. The binary NCE loss (Gutmann & Hyvärinen, 2010) is defined with a certain prior over sampling positive vs. negative pairs. Let p𝗉𝗈𝗌subscript𝑝𝗉𝗈𝗌p_{\mathsf{pos}}italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT be the probability of sampling a positive pair. Then the loss is given by ℒ𝖻𝗂𝗇𝖺𝗋𝗒⁢-⁢𝖭𝖢𝖤⁢(g)≜p𝗉𝗈𝗌⋅𝔼(x,x+)∼P𝖼𝗈𝗈𝗋⁢[−log⁡σ⁢(g⁢(x,x+))]+(1−p𝗉𝗈𝗌)⋅𝔼(x,x−)⁢∼i.i.d.⁢P⁢[−log⁡σ⁢(−g⁢(x,x−))].≜subscriptℒ𝖻𝗂𝗇𝖺𝗋𝗒-𝖭𝖢𝖤𝑔⋅subscript𝑝𝗉𝗈𝗌subscript𝔼similar-to𝑥subscript𝑥subscript𝑃𝖼𝗈𝗈𝗋delimited-[]𝜎𝑔𝑥subscript𝑥⋅1subscript𝑝𝗉𝗈𝗌subscript𝔼𝑥subscript𝑥i.i.d.similar-to𝑃delimited-[]𝜎𝑔𝑥subscript𝑥\mathcal{L}_{\mathsf{binary\mbox{-}NCE}}(g)\triangleq p_{\mathsf{pos}}\cdot% \mathbb{E}_{(x,x_{+})\sim{P}_{\mathsf{coor}}}\left[-\log\sigma(g(x,x_{+}))% \right]+(1-p_{\mathsf{pos}})\cdot\mathbb{E}_{(x,x_{-})\overset{\text{i.i.d.}}{% \sim}P}\left[-\log\sigma(-g(x,x_{-}))\right].caligraphic_L start_POSTSUBSCRIPT sansserif_binary - sansserif_NCE end_POSTSUBSCRIPT ( italic_g ) ≜ italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT ⋅ roman_𝔼 start_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ∼ italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( italic_g ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ) ] + ( 1 - italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT ) ⋅ roman_𝔼 start_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) overi.i.d. start_ARG ∼ end_ARG italic_P end_POSTSUBSCRIPT [ - roman_log italic_σ ( - italic_g ( italic_x , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) ) ] . (20) The Bayes optimal solution is given by g⁢(xa,xb)𝑔subscript𝑥𝑎subscript𝑥𝑏\displaystyle g(x_{a},x_{b})italic_g ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) =log⁡P⁢(pos|xa,xb)1−P⁢(pos|xa,xb)absent𝑃|possubscript𝑥𝑎subscript𝑥𝑏1𝑃|possubscript𝑥𝑎subscript𝑥𝑏\displaystyle=\log\frac{P(\texttt{pos}\mathrel{|}x_{a},x_{b})}{1-P(\texttt{pos% }\mathrel{|}x_{a},x_{b})}= roman_log divide start_ARG italic_P ( pos | italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_P ( pos | italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG (21) =log⁡P⁢(pos,xa,xb)P⁢(neg,xa,xb)absent𝑃possubscript𝑥𝑎subscript𝑥𝑏𝑃negsubscript𝑥𝑎subscript𝑥𝑏\displaystyle=\log\frac{P(\texttt{pos},x_{a},x_{b})}{P(\texttt{neg},x_{a},x_{b% })}= roman_log divide start_ARG italic_P ( pos , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( neg , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG (22) =log⁡p𝗉𝗈𝗌⋅P𝖼𝗈𝗈𝗋⁢(xa,xb)(1−p𝗉𝗈𝗌)⁢P⁢(xa)⁢P⁢(xb)absent⋅subscript𝑝𝗉𝗈𝗌subscript𝑃𝖼𝗈𝗈𝗋subscript𝑥𝑎subscript𝑥𝑏1subscript𝑝𝗉𝗈𝗌𝑃subscript𝑥𝑎𝑃subscript𝑥𝑏\displaystyle=\log\frac{p_{\mathsf{pos}}\cdot{P}_{\mathsf{coor}}(x_{a},x_{b})}% {(1-p_{\mathsf{pos}})P(x_{a})P(x_{b})}= roman_log divide start_ARG italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT ⋅ italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG start_ARG ( 1 - italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG (23) =log⁡P𝖼𝗈𝗈𝗋⁢(xa,xb)P⁢(xa)⁢P⁢(xb)+log⁡p𝗉𝗈𝗌1−p𝗉𝗈𝗌absentsubscript𝑃𝖼𝗈𝗈𝗋subscript𝑥𝑎subscript𝑥𝑏𝑃subscript𝑥𝑎𝑃subscript𝑥𝑏subscript𝑝𝗉𝗈𝗌1subscript𝑝𝗉𝗈𝗌\displaystyle=\log\frac{{P}_{\mathsf{coor}}(x_{a},x_{b})}{P(x_{a})P(x_{b})}+% \log\frac{p_{\mathsf{pos}}}{1-p_{\mathsf{pos}}}= roman_log divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG + roman_log divide start_ARG italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_p start_POSTSUBSCRIPT sansserif_pos end_POSTSUBSCRIPT end_ARG (24) =K𝖯𝖬𝖨⁢(xa,xb)+cX.absentsubscript𝐾𝖯𝖬𝖨subscript𝑥𝑎subscript𝑥𝑏subscript𝑐𝑋\displaystyle=K_{\mathsf{PMI}}(x_{a},x_{b})+c_{X}.= italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT . (25) 2. The InfoNCE loss (Oord et al., 2018) is defined with randomly sampling one positive pair along with K𝐾Kitalic_K negative ones. With some hyperparameter τ>0𝜏0\tau>0italic_τ > 0, the loss is given by ℒ𝖨𝗇𝖿𝗈𝖭𝖢𝖤⁢(g)≜𝔼(x,x+)∼P𝖼𝗈𝗈𝗋(x−(1),x−(2),…,x−(K))⁢∼i.i.d.⁢P⁢[−log⁡eg⁢(x,x+)/τeg⁢(x,x+)/τ+∑i=1Keg⁢(x,x−(i))/τ].≜subscriptℒ𝖨𝗇𝖿𝗈𝖭𝖢𝖤𝑔subscript𝔼similar-to𝑥subscript𝑥subscript𝑃𝖼𝗈𝗈𝗋superscriptsubscript𝑥1superscriptsubscript𝑥2…superscriptsubscript𝑥𝐾i.i.d.similar-to𝑃delimited-[]superscript𝑒𝑔𝑥subscript𝑥𝜏superscript𝑒𝑔𝑥subscript𝑥𝜏superscriptsubscript𝑖1𝐾superscript𝑒𝑔𝑥superscriptsubscript𝑥𝑖𝜏\mathcal{L}_{\mathsf{InfoNCE}}(g)\triangleq\mathbb{E}_{\begin{subarray}{c}(x,x% _{+})\sim{P}_{\mathsf{coor}}\\ (x_{-}^{(1)},x_{-}^{(2)},\dots,x_{-}^{(K)})\overset{\text{i.i.d.}}{\sim}P\end{% subarray}}\left[-\log\frac{e^{g(x,x_{+})/\tau}}{e^{g(x,x_{+})/\tau}+\sum_{i=1}% ^{K}e^{g(x,x_{-}^{(i)})/\tau}}\right].caligraphic_L start_POSTSUBSCRIPT sansserif_InfoNCE end_POSTSUBSCRIPT ( italic_g ) ≜ roman_𝔼 start_POSTSUBSCRIPT start_ARG start_ROW start_CELL ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ∼ italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) overi.i.d. start_ARG ∼ end_ARG italic_P end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ - roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG ] . (26) The Bayes optimal solution is given by eg⁢(x,x+)/τeg⁢(x,x+)/τ+∑i=1Keg⁢(x,x−(i))/τsuperscript𝑒𝑔𝑥subscript𝑥𝜏superscript𝑒𝑔𝑥subscript𝑥𝜏superscriptsubscript𝑖1𝐾superscript𝑒𝑔𝑥superscriptsubscript𝑥𝑖𝜏\displaystyle\frac{e^{g(x,x_{+})/\tau}}{e^{g(x,x_{+})/\tau}+\sum_{i=1}^{K}e^{g% (x,x_{-}^{(i)})/\tau}}divide start_ARG italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_g ( italic_x , italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG =P𝖼𝗈𝗈𝗋⁢(x+|x)⁢∏jP⁢(x−(j))P𝖼𝗈𝗈𝗋⁢(x+|x)⁢∏jP⁢(x−(j))+∑iP𝖼𝗈𝗈𝗋⁢(x−(i)|x)⁢P⁢(x+)⁢∏j≠iP⁢(x−(j))absentsubscript𝑃𝖼𝗈𝗈𝗋|subscript𝑥𝑥subscriptproduct𝑗𝑃superscriptsubscript𝑥𝑗subscript𝑃𝖼𝗈𝗈𝗋|subscript𝑥𝑥subscriptproduct𝑗𝑃superscriptsubscript𝑥𝑗subscript𝑖subscript𝑃𝖼𝗈𝗈𝗋|superscriptsubscript𝑥𝑖𝑥𝑃subscript𝑥subscriptproduct𝑗𝑖𝑃superscriptsubscript𝑥𝑗\displaystyle=\frac{{P}_{\mathsf{coor}}(x_{+}\mathrel{|}x)\prod_{j}P(x_{-}^{(j% )})}{{P}_{\mathsf{coor}}(x_{+}\mathrel{|}x)\prod_{j}P(x_{-}^{(j)})+\sum_{i}{P}% _{\mathsf{coor}}(x_{-}^{(i)}\mathrel{|}x)P(x_{+})\prod_{j\neq i}P(x_{-}^{(j)})}= divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_x ) ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_x ) ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x ) italic_P ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG (27) =P𝖼𝗈𝗈𝗋⁢(x+|x)/P⁢(x+)P𝖼𝗈𝗈𝗋⁢(x+|x)/P⁢(x+)+∑iP𝖼𝗈𝗈𝗋⁢(x−(i)|x)/P⁢(x−(i)).absentsubscript𝑃𝖼𝗈𝗈𝗋|subscript𝑥𝑥𝑃subscript𝑥subscript𝑃𝖼𝗈𝗈𝗋|subscript𝑥𝑥𝑃subscript𝑥subscript𝑖subscript𝑃𝖼𝗈𝗈𝗋|superscriptsubscript𝑥𝑖𝑥𝑃superscriptsubscript𝑥𝑖\displaystyle=\frac{{P}_{\mathsf{coor}}(x_{+}\mathrel{|}x)/P(x_{+})}{{P}_{% \mathsf{coor}}(x_{+}\mathrel{|}x)/P(x_{+})+\sum_{i}{P}_{\mathsf{coor}}(x_{-}^{% (i)}\mathrel{|}x)/P(x_{-}^{(i)})}.= divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_x ) / italic_P ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_x ) / italic_P ( italic_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x ) / italic_P ( italic_x start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG . (28) For τ=1𝜏1\tau=1italic_τ = 1, this optima corresponds to g𝑔gitalic_g choices where g⁢(xa,xb)𝑔subscript𝑥𝑎subscript𝑥𝑏\displaystyle g(x_{a},x_{b})italic_g ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) =log⁡P𝖼𝗈𝗈𝗋⁢(xb|xa)P⁢(xb)+cX⁢(xa)absentsubscript𝑃𝖼𝗈𝗈𝗋|subscript𝑥𝑏subscript𝑥𝑎𝑃subscript𝑥𝑏subscript𝑐𝑋subscript𝑥𝑎\displaystyle=\log\frac{{P}_{\mathsf{coor}}(x_{b}\mathrel{|}x_{a})}{P(x_{b})}+% c_{X}(x_{a})= roman_log divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_ARG + italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) (29) =K𝖯𝖬𝖨⁢(xa,xb)+cX⁢(xa).absentsubscript𝐾𝖯𝖬𝖨subscript𝑥𝑎subscript𝑥𝑏subscript𝑐𝑋subscript𝑥𝑎\displaystyle=K_{\mathsf{PMI}}(x_{a},x_{b})+c_{X}(x_{a}).= italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) . (30) For the general τ≠1𝜏1\tau\neq 1italic_τ ≠ 1 case, we have g𝑔gitalic_g (and corresponding fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT) recovers K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT up to an offset and a scale. Our main argument in Section 4 that fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT recovers K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT still holds. F.2 Contrastive learners can represent K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT exactly under smoothness conditions We want to express K𝖯𝖬𝖨+Csubscript𝐾𝖯𝖬𝖨𝐶K_{\mathsf{PMI}}+Citalic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT + italic_C using some representation function fX:𝒳→ℝn:subscript𝑓𝑋→𝒳superscriptℝ𝑛f_{X}\colon\mathcal{X}\rightarrow\mathbb{R}^{n}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT : caligraphic_X → roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT so that ⟨fX⁢(xa),fX⁢(xb)⟩=K𝖯𝖬𝖨⁢(xa,xb)+C,for some C.subscript𝑓𝑋subscript𝑥𝑎subscript𝑓𝑋subscript𝑥𝑏subscript𝐾𝖯𝖬𝖨subscript𝑥𝑎subscript𝑥𝑏𝐶for some C.\langle f_{X}(x_{a}),f_{X}(x_{b})\rangle=K_{\mathsf{PMI}}(x_{a},x_{b})+C,% \qquad\text{for some $C$.}⟨ italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ⟩ = italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + italic_C , for some italic_C . (31) For such an fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT to exist, an equivalent criterion is that K𝖯𝖬𝖨+Csubscript𝐾𝖯𝖬𝖨𝐶K_{\mathsf{PMI}}+Citalic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT + italic_C is positive semi-definite (PSD), as can be seen from eigendecomposition. Proposition F.1. Suppose that the off-diagonal elements of K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT are bounded within [log⁡ρ𝗆𝗂𝗇,log⁡ρ𝗆𝗂𝗇+δ]∈(−∞,0]subscript𝜌𝗆𝗂𝗇subscript𝜌𝗆𝗂𝗇𝛿0[\log\rho_{\mathsf{min}},\log\rho_{\mathsf{min}}+\delta]\in(-\infty,0][ roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT , roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT + italic_δ ] ∈ ( - ∞ , 0 ]. We have K𝖯𝖬𝖨+Csubscript𝐾𝖯𝖬𝖨𝐶K_{\mathsf{PMI}}+Citalic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT + italic_C is positive semi-definite (PSD) for some C𝐶Citalic_C if the joint distribution is sufficiently smooth: P𝖼𝗈𝗈𝗋⁢(zi|zi)P𝖼𝗈𝗈𝗋⁢(zi)≥eN⁢δ⁢ρ𝗆𝗂𝗇⁢,∀i.\frac{{P}_{\mathsf{coor}}(z_{i}\mathrel{|}z_{i})}{{P}_{\mathsf{coor}}(z_{i})}% \geq e^{N\delta}\rho_{\mathsf{min}}\mathrlap{,\qquad\forall i.}divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≥ italic_e start_POSTSUPERSCRIPT italic_N italic_δ end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT start_ARG , ∀ italic_i . end_ARG (32) Proof. Note that K𝖯𝖬𝖨+Csubscript𝐾𝖯𝖬𝖨𝐶K_{\mathsf{PMI}}+Citalic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT + italic_C still only has non-positive off-diagonal elements if −C≥log⁡ρ𝗆𝗂𝗇+δ.𝐶subscript𝜌𝗆𝗂𝗇𝛿-C\geq\log\rho_{\mathsf{min}}+\delta.- italic_C ≥ roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT + italic_δ . (33) For such C𝐶Citalic_C, it is diagonally dominant (and thus PSD) if, ∀i,⁢K𝖯𝖬𝖨⁢(zi,zi)+C≥∑j≠i|K𝖯𝖬𝖨⁢(zi,zj)+C|=−(N−1)⁢C−∑j≠iK𝖯𝖬𝖨⁢(zi,zj),\mathllap{\forall i,\qquad}K_{\mathsf{PMI}}(z_{i},z_{i})+C\geq\sum_{j\neq i}% \left\lvert K_{\mathsf{PMI}}(z_{i},z_{j})+C\right\rvert=-(N-1)C-\sum_{j\neq i}% K_{\mathsf{PMI}}(z_{i},z_{j}),start_ARG ∀ italic_i , end_ARG italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_C ≥ ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_C | = - ( italic_N - 1 ) italic_C - ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (34) or equivalently, ∀i,⁢N⁢C+∑jK𝖯𝖬𝖨⁢(zi,zj)≥0.\mathllap{\forall i,\qquad}NC+\sum_{j}K_{\mathsf{PMI}}(z_{i},z_{j})\geq 0.start_ARG ∀ italic_i , end_ARG italic_N italic_C + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 0 . (35) The following choice of C𝐶Citalic_C readily satisfies the above Equation 35: C≜−mini⁡1N⁢∑jK𝖯𝖬𝖨⁢(zi,zj).≜𝐶subscript𝑖1𝑁subscript𝑗subscript𝐾𝖯𝖬𝖨subscript𝑧𝑖subscript𝑧𝑗C\triangleq-\min_{i}\frac{1}{N}\sum_{j}K_{\mathsf{PMI}}(z_{i},z_{j}).italic_C ≜ - roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (36) Therefore, it remains to show that Equation 33 is true. Note that −C≜mini⁡1N⁢∑jK𝖯𝖬𝖨⁢(zi,zj)≥N−1N⁢log⁡ρ𝗆𝗂𝗇+1N⁢(mini⁡K𝖯𝖬𝖨⁢(zi,zi)).≜𝐶subscript𝑖1𝑁subscript𝑗subscript𝐾𝖯𝖬𝖨subscript𝑧𝑖subscript𝑧𝑗𝑁1𝑁subscript𝜌𝗆𝗂𝗇1𝑁subscript𝑖subscript𝐾𝖯𝖬𝖨subscript𝑧𝑖subscript𝑧𝑖-C\triangleq\min_{i}\frac{1}{N}\sum_{j}K_{\mathsf{PMI}}(z_{i},z_{j})\geq\frac{% N-1}{N}\log\rho_{\mathsf{min}}+\frac{1}{N}(\min_{i}K_{\mathsf{PMI}}(z_{i},z_{i% })).- italic_C ≜ roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (37) Therefore, it suffices to have log⁡ρ𝗆𝗂𝗇+δ≤N−1N⁢log⁡ρ𝗆𝗂𝗇+1N⁢(mini⁡K𝖯𝖬𝖨⁢(zi,zi)).subscript𝜌𝗆𝗂𝗇𝛿𝑁1𝑁subscript𝜌𝗆𝗂𝗇1𝑁subscript𝑖subscript𝐾𝖯𝖬𝖨subscript𝑧𝑖subscript𝑧𝑖\log\rho_{\mathsf{min}}+\delta\leq\frac{N-1}{N}\log\rho_{\mathsf{min}}+\frac{1% }{N}(\min_{i}K_{\mathsf{PMI}}(z_{i},z_{i})).roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT + italic_δ ≤ divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG roman_log italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (38) Rearranging terms gives the desired condition P𝖼𝗈𝗈𝗋⁢(zi|zi)P𝖼𝗈𝗈𝗋⁢(zi)≥eN⁢δ⁢ρ𝗆𝗂𝗇⁢,∀i.\frac{{P}_{\mathsf{coor}}(z_{i}\mathrel{|}z_{i})}{{P}_{\mathsf{coor}}(z_{i})}% \geq e^{N\delta}\rho_{\mathsf{min}}\mathrlap{,\qquad\forall i.}divide start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT sansserif_coor end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≥ italic_e start_POSTSUPERSCRIPT italic_N italic_δ end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT start_ARG , ∀ italic_i . end_ARG (39) ∎ Remark F.2. Proposition F.1 is one example that a sufficiently smooth world or a sufficiently high sampling rate allows the PMI kernel K𝖯𝖬𝖨subscript𝐾𝖯𝖬𝖨K_{\mathsf{PMI}}italic_K start_POSTSUBSCRIPT sansserif_PMI end_POSTSUBSCRIPT to be exactly represented as inner products of a learned feature space (up to a scale). The condition here can be satisfied, for example, if the off-diagonal terms decay linearly with respect to N𝑁Nitalic_N and stay sufficiently close to each other. While the condition is somewhat strict, it captures the essence that smoothness and continuity allow easier learning. Nonetheless, we note that exact representation is not necessary for convergence, and thus this requirement can likely be relaxed. Please see Section 6 for discussions on practical settings. Generated on Thu Jul 25 09:36:20 2024 by LaTeXML