Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Christopher Liao 1 Theodoros Tsiligkaridis 2 Brian Kulis 1 Abstract There is extensive interest in metric learning methods for image retrieval. Many metric learning loss functions focus on learning a correct ranking of training samples, but strongly overfit semantically inconsistent labels and require a large amount of data. To address these shortcomings, we propose a new metric learning method, called contextual loss, which optimizes contextual similarity in addition to cosine similarity. Our contextual loss implicitly enforces semantic consistency among neighbors while converging to the correct ranking. We empirically show that the proposed loss is more robust to label noise, and is less prone to overfitting even when a large portion of train data is withheld. Extensive experiments demonstrate that our method achieves a new state-of-the-art across four image retrieval benchmarks and multiple different evaluation settings. Code is available Figure 1. Examples of metric learning labels which are inconsistent with semantic information from two standard benchmarks: CUB (top) and SOP (bottom). These labels are caused by a visual feature which is not present or barely visible. supervision, and train an embedding space where images with the same label are closer together than images with different labels. However, binary supervision is unreliable, since it does not capture the complexity of relationships in the data. Furthermore, methods which overly rely on the binary supervision can be brittle in the presence of noise, since the supervision is either correct or incorrect. Multilabel datasets (Ranjan et al., 2015) mitigate this problem, but can be expensive to procure, so developing a metric learning method that is robust to label noise and generalizable to test data is a challenging yet important problem. at: https://github.com/Chris210634/ metric-learning-using-contextualsimilarity 1. Introduction 1 Department of Electrical and Computer Engineering, Boston University 2 MIT Lincoln Laboratory. Correspondence to: Christopher Liao , Theodoros Tsiligkaridis , Brian Kulis . Existing image retrieval approaches fall into two main categories: classification and pairwise ranking losses. Classification losses optimize a classifier on top of the embedding layer and discard the classifier at the end of training. Pairwise ranking losses train the embedding layer directly by pulling together pairs of samples with the same label and pushing apart pairs of samples with different labels. Pairwise ranking methods include losses which explicitly optimize a ranking metric such as AP (average precision) surrogates: Fast-AP (Cakir et al., 2019), Smooth-AP (Brown et al., 2020), Blackbox AP (Rolı́nek et al., 2020) and Roadmap (Ramzi et al., 2021). They also include the standard contrastive, triplet and multi-similarity (MS) losses (Wang et al., 2019). Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Empirically, classification methods, such as proxy anchor (Kim et al., 2020), proxy NCA (Teh et al., 2020), and HIST (Lim et al., 2022) perform well on small benchmark datasets, Image retrieval refers to learning a ranking of instances from a gallery set relative to a query image such that the highest ranked instances are the most relevant to the query. Several real-world applications are powered by this technology, such as person re-identification (Ye et al., 2021), face recognition (Guillaumin et al., 2009), vehicle re-identification (Chu et al., 2019), landmark retrieval (Weyand et al., 2020), and product retrieval (Cakir et al., 2019). Current metric learning techniques often use a dataset with single discrete labels for 1 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Inspired by STML, we show that optimizing contextual similarity directly in the supervised setting is beneficial. As far as we know, we are the first to treat contextual similarity as a loss function for supervised learning. This is non-trivial since contextual similarity involves non-differentiable counting operations, and as a consequence, is not amenable to off-the-shelf optimization techniques. We propose a simple but effective optimization strategy in Section 3, using heuristic gradients. We analytically justify this optimization approach in Section 4.1. Our contributions are as follows: Figure 2. Comparison of our contextual loss with popular metric learning losses. We plot the test R@1 accuracy against the train R@1 accuracy over the course of training on the CUB and Cars benchmarks. The dashed line tracks the R@1 values over the course of training, and the star indicates the R@1 values at the end of training. Compared to baselines, the contextual loss achieves higher test R@1 at the expense of lower train R@1. 1. We introduce the contextual loss, which establishes a new state-of-the-art across all standard image retrieval benchmarks, even when compared to more complicated (e.g. Metrix, HIST and AVSL) and less scalable methods (AP surrogates). 2. Our contextual loss mitigates overfitting by implicitly enforcing semantic consistency among neighbors in the embedding space. As a result, we achieve a 4% improvement in R@1 accuracy over baselines in the presence of label noise. while multi-similarity and AP surrogates perform well on large datasets. This general trend is supported by our main results in Section 5. We hypothesize that pairwise ranking methods tend to overfit the training labels while sacrificing semantic consistency of the embedding space. This can be possible even if the labels are “correct”, as Figure 1 illustrates. For instance, pulling apart samples of white-necked ravens from common ravens would likely lead to overfitting, since the distinguishing visual attribute is absent from the images. To address this issue, we propose to optimize contextual similarity in addition to cosine similarity. The resulting loss function implicitly regularizes the embedding space for semantic consistency among neighbors (see results in Section 4). Figure 2 clearly shows that our method reduces overfitting, since we achieve the best test R@1 accuracy despite lower train R@1 accuracy than some baselines. Results in Section 5 show that our method outperforms all baselines across all standard benchmarks in terms of R@1 accuracy. 3. We conduct an extensive experimental study of our method and several popular baselines. This includes empirical results across five different benchmarks, two different experimental settings, accompanied by a comprehensive ablation study. In addition, we tune baselines extensively to promote fair comparisons. 4. The optimization of non-differentiable steps in the loss calculation may be of interest to some readers. This paper is organized as follows. Section 2 summarizes related work. Section 3 states our method, including how we optimize the non-differentiable steps in calculating contextual similarity. Section 4.1 checks that minimizing the contextual loss corresponds to learning the correct ranking of samples and that the proposed optimization procedure converges. The rest of Section 4 explores why the proposed contextual loss is less prone to overfitting than other pairwise ranking losses by analyzing gradients and running targeted experiments. Section 5 and the Appendix present an extensive experimental study. Code is avail- Contextual similarity is a widely used evaluation-time technique to boost retrieval accuracy. In simple terms, the contextual similarity is the fraction of neighbors two samples have in common in embedding space. Intuitively, two samples are more likely to share the same label if they have many neighbors in common, regardless of their cosine similarity. Many retrieval frameworks (Zhong et al. (2017), Cao et al. (2020)) use a combination of cosine similarity and contextual similarity for evaluation, but only explicitly optimize the cosine similarity when training. In this paper, we propose to explicitly optimize both similarities, since contextual similarity captures crucial semantic information. In another line of work, some unsupervised metric learning methods such as STML (Kim et al., 2022) use contextual similarity to estimate the true similarity between unlabeled samples. able at: https://github.com/Chris210634/metriclearning-using-contextual-similarity 2. Related Work Classification Methods We refer to any method which optimizes class centroids in conjunction with embeddings as a classification method. These methods scale with the number of classes in the training set and are usually sensitive to the 2 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization learning rate of class centroids (Teh et al., 2020). Classification losses have traditionally performed well on small metric learning benchmarks; these include normalized-softmax (Zhai & Wu, 2018), arcface (Deng et al., 2019), proxy NCA and proxy anchor. More recently, IBC (Seidenschwarz et al., 2021) and HIST (Lim et al., 2022) report an improvement in R@1 when learning a graph neural network in conjunction with class centroids. However, even with these additional tricks, classification methods lag behind pairwise methods on larger benchmarks. plicitly use hierarchical labels for training. These methods assign a higher cost to mistakes in discriminating labels that are farther apart in the hierarchy, leading to a more robust embedding space. Yan et al. (2021) propose to generate synthetic hierarchical labels for unsupervised metric learning, and Yan et al. (2023) extend this idea to metric learning with synthetic label noise. These two works use a hyperbolic embedding space to better capture hierarchical relationships (Khrulkov et al., 2020). Ermolov et al. (2022) show that simply using a hyperbolic embedding space instead of a Euclidean embedding space improves metric learning performance. Our work has a similar motivation to the above hierarchical and hyperbolic metric learning works, but we use contextual similarity instead of hierarchical labels to mitigate label inconsistency. Appendix I.4 contains some results on hierarchical retrieval metrics. Pairwise Ranking Methods Pairwise ranking losses include the contrastive loss (Hadsell et al., 2006), triplet loss (Weinberger et al., 2005) (Wu et al., 2017), multi-similarity, and AP surrogates (cited in previous section). Despite being more than a decade old, contrastive and triplet losses remain the go-to method for metric learning, and Musgrave et al. (2020a) show that they are comparable in performance to many recent methods. Multi-similarity includes a hard pair mining scheme that is effectively learning to rank. AP maximization methods explicitly learn to rank samples within a mini-batch. AP maximization is challenging because it involves back-propagating through the non-differentiable heaviside function, similar to the current work. As a workaround, Fast-AP uses soft-binning; SmoothAP uses a low-temperature sigmoid; Roadmap uses an upper bound on the heaviside instead of an approximation. We find that using heuristic gradients works better for optimizing contextual similarity. 3. Method Notation Denote the normalized output of the embedding network as fi ∈ Rd . sij = ⟨fi , fj ⟩ ∈ [−1, 1] denotes the cosine similarity between the samples i and j. There are n samples in a mini-batch. We always use balanced sampling, where k images are selected from n/k randomly sampled labels. n is divisible by k. k is divisible by 2, but we always use k ≥ 4 in experiments. yij ∈ {0, 1} denotes the true similarity between i and j, defined as yij = 1 if samples i and j share the same label and 0 otherwise. We use uppercase letters to denote matrices, math script to denote sets, and lowercase letters to denote scalars. i, j and p are reserved for sample indices. 1N is used to denote the binary indicator matrix for set N . For instance, let N (i) denote the set of neighbors to sample i, then 1N (i, j) = 1 if j ∈ N (i), and 0 otherwise. Unsupervised Metric Learning The concept of contextual similarity is extensively studied in the unsupervised metric learning literature, mainly in the context of person re-ID (see survey (Ye et al., 2021)). Most unsupervised person re-ID methods use the k-reciprocal re-rank distance (Zhong et al., 2017), which is a weighted combination of Euclidean distance and Jaccard distance between reciprocal-neighbor sets, calculated over the entire dataset. More recently, STML (Kim et al., 2022) proposes an unsupervised metric learning framework for image retrieval using a simpler batch-wise contextual similarity measure. We loosely follow STML’s contextual similarity definition, making significant changes to accommodate the change in problem setting and to address optimization issues (these changes are enumerated in Appendix E.1). We emphasize that prior work on contextual similarity optimizes the cosine similarity towards the contextual similarity, focusing on the unsupervised scenario, while our work optimizes the contextual similarity towards the true similarity, requiring full supervision. Contextual Similarity Definition We loosely follow the definition of contextual similarity proposed in STML (Kim et al., 2022), with significant modifications to accommodate the change in problem setting and to address optimization issues (these modifications are enumerated in Appendix E.1). In this section, we present the similarity definition using indicator matrices in order to show an efficient implementation in PyTorch. Note that the binary “and” is replaced by multiplication for differentiability. Algorithm 1 contains PyTorch-like pseudo-code for Equations 1 - 7. We include the code here for reproducibility and to show that our contextual loss can be compactly implemented despite the cumbersome mathematical notation. We denote the contextual similarity between samples i and j as wij . The matrix with entries wij is entirely a function of the cosine similarity matrix with entries sij . The goal of Equations 1 - 4 is to calculate wij in terms of sij . This is implemented as get contextual similarity in Algorithm 1. For readability, we present the wij calculation as Robust Metric Learning Over-reliance on binary supervision is a long-standing problem in metric learning. Many studies overcome this issue by taking advantage of the hierarchical nature of labels in metric learning datasets. Sun et al. (2021), Zheng et al. (2022), and Ramzi et al. (2022) ex- 3 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Algorithm 1 Pseudo-code, PyTorch-like itself is always included in the closest neighbor count (e.g. if k = 2, then the “k-th closest neighbor” is the closest neighbor to a sample). We now proceed to calculate the intersection of neighborhood sets. # Hyperparameters: alpha, k, eps, s_tilde, lam, gamma # The symbol '@' means matrix multiplication in Python # Note: In PyTorch, set keepdim=True when calling sum(.) class GreaterThan(autograd.Function): # Implements theta with heuristic gradient def forward(x, y): return (x >= y).float() def backward(g): # Returns gradient w.r.t (x, y) return g * alpha, - g * alpha Step 2 Intersection of Neighborhoods This step refines the similarity prediction by counting the number of neighbors two samples have in common. Intuitively, samples with the same label should have a similar set of neighbors. def get_contextual_similarity(s, k, eps): D = 2 - 2 * s # Squared Euclidean distance Dk = -(-D).topk(k).values[:,-1:] # Distance to k-th neighbor # Distance to k/2-th neighbor Dk_over_2 = -(-D).topk(k//2).values[:,-1:] Nk_over_2_mask = GreaterThan(-D + eps, -Dk_over_2.detach()) Rk_over_2_mask = Nk_over_2_mask * Nk_over_2_mask.T W_2 = (Rk_over_2_mask @ W_1) / Rk_over_2_mask.sum(dim=1) where M+ (i, j) = and M− (i, j) = return 0.5 * (W_2 + W_2.T) Step 1 Neighborhood Calculation The first step calculates a binary matrix 1Nk+ϵ (i, j) indicating whether sample j is a neighbor of i. This binary value can be thought of as a preliminary prediction of yij . The neighborhood indicator calculation can be defined in terms of the heaviside function, which has no gradient. We set a constant positive gradient in the backward pass, which is reasonable since θ is a (nonstrictly) increasing function. Step 3 Query Expansion This final step further refines the similarity prediction by averaging W1 across close neighbors (known as query expansion, see Arandjelović & Zisserman (2012)). 1Rk/2+ϵ (i, j) = 1Nk/2+ϵ (i, j)1Nk/2+ϵ (j, i) P p 1Rk/2+ϵ (i, p)W1 (p, j) P W2 (i, j) = p 1Rk/2+ϵ (i, p) Forward: θ(x) = 1, if x ≥ 0 ; 0 otherwise. (1) wij = Let D(i, j) denote the squared Euclidean distance between samples i and j. By definition, D(i, j) = 2 − 2sij and D(i, j) ∈ [0, 4]. The sg operator denotes stop gradient. (4) 1 (W2 (i, j) + W2 (j, i)) 2 1Rk/2+ϵ (i, j) is a binary value which equals 1 if j is a k/2 + ϵ neighbor of i and i is a k/2 + ϵ neighbor of j, 0 otherwise. This type of reciprocal relationship is widely used in the retrieval literature, most notably by Zhong et al. (2017). W2 (i, j) is an intermediary similarity value representing the entries of W1 averaged over the smaller Rk/2+ϵ neighborhood. W2 is then symmeterized to yield the final contextual similarity values wij ∈ [0, 1]. Using θ, we calculate the indicator function for whether sample j is in the k + ϵ neighborhood of sample i: where p denotes the k-th closest neighbor of i.   1 − 1Nk+ϵ (i, p) 1 − 1Nk+ϵ (j, p) . W1 (i, j) ∈ [0, 1] is an intermediary similarity value. M+ (i, j) counts the number of neighbors i and j have in common. M− (i, j) counts the number of non-neighbors i and j have in common. Appendix E.3 Figure 12 explains why both M+ and M− are necessary. The normalization factors in Eq. 3 ensure that the similarity value is between 0 and 1. We do not backpropagate gradients through the normalization factors, because it is undesirable to optimize the number of samples in the neighborhoodP set. As further justification for the stop gradient, note that p 1Nk+ϵ (i, p) = k for any i, p when ϵ = 0. three sequential steps. 1Nk+ϵ (i, j) = θ(−D(i, j) + sg(D(i, p)) + ϵ), 1Nk+ϵ (i, p)1Nk+ϵ (j, p), p=1 n X (3) I_neg = 1 - eye(w.shape[0]) # ones with zeros on diagonal L_contrast = contrastive(s, y) # Standard, code omitted L_reg = (s.mean() - s_tilde).square() L_context = ((w - s).square()* I_neg).mean() loss = lam * L_context + (1-lam) * L_contrast + gamma * L_reg loss.backward() optimizer.step() ∂θ(x) = α. ∂x n X p=1 for data, labels in loader: f = F.normalize(model(data)) # normalized embeddings s = f @ f.T # cosine similarity matrix y = (labels.T == labels) # true similarity matrix w = get_contextual_similarity(s, k, eps) # matrix Backward: 1Nk+ϵ (i, j) · 2 ! M+ (i, j) M− (i, j) P + P , sg p 1Nk+ϵ (i, p) sg p 1 − 1Nk+ϵ (i, p) W1 (i, j) = Nk_mask = GreaterThan(-D + eps, -Dk.detach()) M_plus = (Nk_mask @ Nk_mask.T) / Nk_mask.sum(dim=1).detach() Nk_mask_not = 1 - Nk_mask M_minus = (Nk_mask_not @ Nk_mask_not.T) / Nk_mask_not.sum(dim=1).detach() W_1 = 0.5 * (M_plus + M_minus) * Nk_mask (2) In words, 1Nk+ϵ (i, j) is a binary value indicating whether or not D(i, j) ≤ D(i, p) + ϵ. By convention, the sample Loss Function We use the MSE loss to optimize wij against 4 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization the true similarity labels yij : Lcontext = 1 X (yij − wij )2 n2 (5) i,j|i̸=j Our final loss function Lours is a sum of three loss functions: Lours = λLcontext + (1 − λ)Lcontrast + γLreg (6) 2 n2 X 1 Lreg = s̃ − 2 sij  n i,j (7)  Figure 3. Left: plot of contextual loss value vs. 1-mAP (mean AP) over the course of training on CUB for different choices of λ. Training proceeds from upper right to bottom left. Observe that 1-mAP decreases as contextual loss decreases. This shows that Lcontext is a valid surrogate for learning to rank. Right: convergence plot of Lcontext on CUB without mini-batching. Lcontext decreases almost monotonically when the contextual loss is minimized (λ = 1), while there is a large amount of noise when the contrastive loss is minimized (λ = 0). γ = 0. Lcontrast is the standard contrastive loss (see Appendix E.3 Eq. 29). In our work, Lcontrast is best viewed as a regularizer that reduces the decomposability gap between the batchwise contextual loss and the contextual loss over the entire dataset. We justify this interpretation in Appendix D Fig. 9. Lreg is a similarity regularizer that encourages the model to use the entire embedding space by pushing the average cosine similarity between all pairs towards the constant s̃. 4.1. Contextual Loss and Optimization Remarks The parameter wij is a function of sij , so all three components of our loss function in Eq. 6 optimize the cosine similarity matrix with entries sij . However Lcontext is the main contribution of the current work, and experiments verify that most of the improvement over baselines can be attributed to this contextual loss. The value of k is not arbitrary; it must be set to the number of samples per label in the mini-batch. Although the time and space to calculate the contextual loss scales as O(n3 ), all operations are implemented as matrix multiplication, which is highly optimized on modern hardware. Appendix C Figure 8 shows that the cubic scaling is negligible for all practical batch sizes. Proposition 4.1. For a batch of size n with exactly k samples from each class (n divisible by k, n > 2k, and k ≥ 2), assuming that ϵ = 0, Lcontext = 0 if and only if all samples are correctly ranked with respect to every other sample within the batch, i.e. sip > sij , ∀p, j where yij = 0 and yip = 1, ∀i ∈ [1, n]. We defer the proof to Appendix B. This Proposition shows that Lcontext is a valid ranking objective, similar to AP surrogates, multi-similarity, and triplet losses. Note that Proposition 4.1 does not hold for Lcontrast , since Lcontrast continues to provide gradients up to fixed margins, regardless of whether the correct ranking is satisfied. Figure 3 (left) shows that Lcontext is approximately a linearly scaled version of 1-mAP over the course of training. Figure 3 (right) suggests that the value of Lcontext converges when optimized using gradient descent. The Appendix contains more empirical evidence that the value of Lcontext converges (Fig. 9 and 15 ). Figure 4 justifies the choice of heuristic gradient in Eq. 1. In this simple 2-D example, the gradient is always positive and non-zero in the direction away from the minimum, until the minimum is reached. Hyperparameters α controls the magnitude of the heaviside gradient. Tuning α is unnecessary, since it is redundant with the learning rate. ϵ is the desired similarity margin between positive and negative samples. ϵ is analogous to the margin parameter on the triplet and multi-similarity loss. δ+ and δ− (Appendix E.3 Eq. 29) are the positive and negative margins resp. for the contrastive loss. s̃ is the desired average cosine similarity between all pairs. λ and γ control the relative weighting between the three losses. The choice of (1 − λ) for the weight on the contrastive loss instead of a separate hyperparameter is completely arbitrary, as tuning the contrastive loss weight separately would be redundant with tuning the learning rate. 4.2. Intuition In the previous subsection, we proved that the minimum of Lcontext corresponds to a correct ranking of samples within a batch. We also showed that the value of Lcontext converges empirically. However, we still need some intuition as to why gradients from Lcontext work better than simple pair-wise contrastive loss functions. This discussion will naturally lead to the semantic consistency intuition promised at the beginning of the paper. Let us start by asking: what is the value of optimizing the intersection of neighborhood sets in 4. Analysis This section discusses intuition behind the contextual loss in Eq. 5. Section 4.1 provides empirical evidence that Lcontext converges and shows that Lcontext = 0 coincides with the correct ranking of samples. Sections 4.2 - 4.4 carefully justify the semantic consistency argument outlined in the introduction. 5 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization 1 of the gradient Figure 5. This figure complements Eq. 9. Part ⃝ 2 of the (left) enforces correct ranking w.r.t. sample i. Part ⃝ gradient (right) increases the distance between i and all samples in the neighborhood of j, and vice versa. As shown, sometimes this implies that samples with the same label are pulled apart. The k + ϵ neighborhood as defined in Eq. 2 is shaded. k = 4. Figure 4. Illustration of Lcontext and its gradient. We generate 64 random points on the unit circle from two classes, centered around the coordinates (0,1) and (1,0) (see middle plot). We move one orange point x from the blue centroid to the orange centroid x. The value of the loss function is plotted in blue, and the amount of gradient pointing away from x is plotted in orange. Lcontext decreases as x rolls to the right, and the gradient closely approximates what it should look like if Lcontext were continuous. We include the same illustration for Lcontrast on the right for comparison. backward pass. For simplicity of notation, ∂W1 (i, j) := gij > 0 denotes the gradient of the loss w.r.t. W1 (i, j). The gradient w.r.t. the distance matrix ∂D can be split into two 1 and ⃝ 2 , added together by chain rule. parts ⃝ 1 ∂D(i, j) = −α∂ 1N (i, j) = −αgij (a⟨·⟩ + b) ⃝ ( ∂D(i) = −α∂ 1N (i) = −αagij 1N (i, j)1N (j) 2 ⃝ ∂D(j) = −α∂ 1N (j) = −αagij 1N (i, j)1N (i) (9) the manner of Eq. 3? We offer a straight-forward intuition: Maximizing the intersection between the neighborhood sets of two samples is equivalent to pushing one sample towards the context of the other sample and vice versa; minimizing the intersection is equivalent to pulling apart the contexts of the two samples. We show this intuition by analyzing the gradient. Note that the −α factors in Eq. 9 come from going backward through θ(·). The negative sign accounts for optimizing distance instead of similarity. Intuitively, the two components of the gradient perform different functions. The gradient in 1 enforces correct ranking. The gradient in Eq. 9 ⃝ 2 Eq. 9 ⃝ pulls i away from the context of j and j away from the context of i. More clearly, ∂D(i, p) < 0 when 1N (j, p) = 1 and ∂D(i, p) = 0 otherwise, for all samples p in the batch. In words, we increase the distance between i and all samples in the neighborhood of j. See Figure 5 for an illustration. For simplicity, consider ϵ = 0, such P that the normalization factors in Eq. 3 are constant: p 1Nk+ϵ (i, p) = k and P 1 − 1 (i, p) = n − k. For the remainder of the Nk+ϵ p section we drop the k + ϵ subscript from N for readability. Further consider a negative pair of samples i and j (yij = 0), where j is wrongly ranked w.r.t. i (i.e. 1N (i, j) = 1). For the sake of developing intuition, let us further assume that all entries of W1 are correct except index i, j; also, 1Rk/2+ϵ (i, p) = 0 ∀p ̸= i. Under these assumptions, wij = W1 (i, j) and Lcontext = 2n1 2 (yij − wij )2 . This allows us to focus on interpreting Eq. 3. Under these assumptions, Eq. 3 simplifies to (see algebra in Appendix A): 4.3. Semantic Consistency The previous subsection showed that the gradients of the contextual loss optimize distances between neighbors of samples, not just pairwise distances. This is important because the neighborhood of a sample contains semantically similar images, regardless of whether they have the same label or not. Indeed, optimizing contextual similarity may result in pulling apart samples with the same label, and to a lesser extent pushing together samples with different labels. For example, in Figure 5, sample i is pulled both from sample j, which has a different label, and from the two neighbors of sample j, which have the same labels as sample i. W1 (i, j) = 1N (i, j) (a⟨1N (i), 1N (j)⟩ + b), | {z } | {z } 1 ⃝ 2 ⃝ 1 1 n − 2k where a = + , and b = . 2k 2(n − k) 2(n − k) (8) a and b are positive constants, assuming that n > 2k. ⟨·⟩ denotes the inner product between the i-th and j-th rows of the indicator matrix. This inner product is clearly positive. W1 (i, j) ∈ (0, 1] must be a non-zero positive number under our assumptions. The gradient w.r.t. W1 (i, j) must be nonzero positive because yij = 0. We are now ready for the This behavior is novel to the contextual loss. All of our baselines which take the cosine similarity matrix as input satisfy the condition that ∂D(i, j) ≥ 0 when yij = 1 and 6 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 1. Comparison of train and test accuracy on CUB between Lcontext and Lcontext with gradient corrected according to Eq. 10. CUB Lcontext with gradient correction Train R@1 Test R@1 87.0 71.4 92.9 65.4 Figure 7. Ablation results. We test the contribution of each component in Eq. 6 by trying different values for λ and γ on CUB and Cars. The dashed line indicates results without the similarity regularizer (γ = 0). The three different marker symbols represent different learning rates. We show results for different learning rates because the optimal learning rate varies with λ. Observe that the optimal R@1 is always achieved by a combination of Lcontrast and Lcontext . The similarity regularizer is sometimes helpful, but never detrimental. Figure 6. Robustness and generalizability comparison of the contextual loss against other pairwise ranking losses. See description in Section 4.4. For this experiment, we test Lcontext without additional regularization, i.e. λ = 1 and γ = 0. test data. We verify this claim with three sets of experiments on the CUB benchmark in Figure 6. ∂D(i, j) ≤ 0 when yij = 0. Note that this discussion is limited to pairwise ranking losses; classification methods cannot be analyzed in this way. We analyze the effect of these “wrong” gradients in Table 1 by adding a custom autograd function between the distance matrix and the contextual loss. This function truncates the gradient of the distance matrix according to the label matrix: ( max(∂D(i, j), 0), if yij = 1 ∂D(i, j) = (10) min(∂D(i, j), 0), otherwise. Label Noise We experiment with assigning random labels to 5, 10, and 20 % of randomly selected training samples. Image Noise We scraped around 1,800 generic bird images from Bing using search terms such as “bird” and “flying bird”. We manually filtered the images such that each image contains at least one bird. We then replace a percentage of randomly selected CUB training samples with random images from our Bing dataset. This experiment simulates a typical web-scraped dataset, where images are often only lightly proofread by a human. In most modern image datasets, a small portion of labels are either wrong or do not accurately reflect the content of the image. We experiment with 5, 10, and 20 % image noise. Clearly, clamping the distance gradients according to Eq. 10 raises the R@1 on training data at the expense of a lower test R@1. This raises the question: why does discarding seemingly wrong gradients lower the R@1 accuracy by 6%? We hypothesize that Lcontext implicitly regularizes the embedding space against the label and image noise illustrated in Fig. 1. The contextual loss considers relationships between groups of k samples, which intuitively should be more robust to random label variations than solely relying on pairwise relationships. In other words, sample pairs with gradients ∂D(i, j) that violate the true labels yij are pairs where the yij is inconsistent with semantic information; these labels likely do not generalize to test data. We justify this hypothesis in the next section by testing the robustness and generalizability of our approach. Limited Training Data According to traditional machine learning wisdom, small datasets are easier to overfit. We demonstrate that our method achieves reasonable results even when some training data is withheld. CUB-200 is already the smallest standard benchmark, with 5,994 training samples belonging to 100 classes. We experiment with using only the first 50, 60, 70, 80, and 90 classes for training. The results from these three sets of experiments are presented in Fig. 6. Our method clearly outperforms other pairwise ranking losses. We tune learning rates individually for each method. We use batch size 256 and k = 8. We use Lcontext by itself instead of the entire loss to study the contextual loss without additional regularization. 4.4. Robustness and Generalizability Experiments The previous subsection proposed that optimizing contextual similarity with Lcontext leads to an embedding space with higher test R@1 accuracy because contextual similarity is more robust to label noise and more generalizable to 5. Experiments Datasets We experiment on two small-scale and two largescale datasets: Caltech-UCSD Birds (CUB-200) (Wah et al., 7 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 2. State-of-the-art comparisons on CUB and Cars datasets. We use ResNet-50. The last two rows use embedding size 1536, while the other rows use embedding size 512. Results should not be compared across embedding sizes. † indicates our reproduction using the implementation by Musgrave et al. (2020b); other results are copied from their original paper. All results indicated by † are an average of 6 trials with different random seeds but with the same train-test split. CUB Cars Method R@1 R@2 R@4 R@8 mAP R@1 R@2 R@4 R@8 mAP DRML (Zheng et al., 2021) DIML (Zhao et al., 2021) DiVA (Milbich et al., 2020) Proxy Anchor (Kim et al., 2020) MS (Wang et al., 2019) IBC (Seidenschwarz et al., 2021) S2SD (Roth et al., 2020) Proxy NCA + Metrix PA + Metrix MS + Metrix (Venkataramanan et al., 2021) HIST (Lim et al., 2022) MHGL (Ebrahimpour et al., 2022) MS † (Wang et al., 2019) Contrastive † (Hadsell et al., 2006) Roadmap † (Ramzi et al., 2021) Triplet † (Weinberger et al., 2005) MS + miner † (Wang et al., 2019) Proxy Anchor † (Kim et al., 2020) Proxy NCA † (Teh et al., 2020) Contextual (Ours) † 68.7 68.2 69.2 69.7 67.8 70.3 70.1 70.4 71.0 71.4 71.4 70.6 65.9 65.9 66.0 64.8 68.0 69.1 66.3 71.9 78.6 79.3 80.0 77.8 80.3 79.7 80.6 81.8 80.6 81.1 80.9 76.6 76.6 76.8 75.9 78.4 79.5 77.1 81.5 86.3 87.0 85.6 87.6 88.7 88.2 86.8 88.1 88.0 84.9 84.8 85.3 84.5 86.2 87.0 85.5 88.5 91.6 92.4 92.3 90.8 90.8 91.1 90.6 91.7 92.2 91.5 93.1 36.1 35.3 36.0 33.8 36.4 37.4 35.1 40.2 86.9 87.0 87.6 87.7 87.8 88.1 89.5 88.5 89.1 89.6 89.6 90.1 82.1 82.4 83.5 88.1 90.5 89.0 88.2 91.1 92.1 92.9 92.9 92.7 93.3 93.9 93.4 93.6 94.2 93.9 94.2 88.6 88.7 89.8 93.1 94.7 93.5 93.2 95.0 95.2 95.8 95.3 96.2 96.5 96.7 96.0 96.4 96.4 92.9 92.8 93.7 96.0 97.0 96.2 96.1 97.1 97.4 97.9 98.1 95.7 95.6 96.3 97.7 98.4 97.9 97.9 98.4 35.7 35.3 37.3 41.8 42.7 38.8 38.2 43.3 PA + AVSL (Zhang et al., 2022) Contextual (Ours) † 71.9 72.7 81.7 82.2 88.1 88.8 93.2 93.4 40.5 91.5 91.8 95.0 95.4 97.0 97.4 98.4 98.6 43.3 2011), Stanford Cars-196 (Krause et al., 2013), Stanford Online Products (SOP) (Oh Song et al., 2016), and miniiNaturalist-2021 (Van Horn et al., 2018). CUB-200 and Cars-196 are smaller fine-grain classification datasets with 200 and 196 unique labels, respectively. SOP is a large-scale dataset with 120,053 product images from 22,634 classes. mini-iNaturalist-2021 is a subset of the iNaturalist-2021 species classification competition dataset, with 50 images from each of 10,000 species. iNaturalist results are deferred to Appendix G Table 4. Table 3. State-of-the-art comparisons on SOP. We use ResNet-50. The last two rows use embedding size 1536, while the other rows use 512. All results indicated by † are an average of 2 trials with different random seeds but with the same train-test split. Method R@1 R@10 R@100 R@1000 DRML DIML DiVA Proxy Anchor MS IBC S2SD Proxy NCA + Metrix PA + Metrix MS + Metrix HIST MHGL MS † Contrastive † Roadmap † Triplet † MS + miner † Proxy Anchor † Proxy NCA † Contextual (Ours) † 79.9 79.3 79.6 79.1 76.9 81.4 80.0 81.3 81.3 81.0 81.4 81.7 79.9 80.9 81.9 81.9 82.2 79.7 78.8 82.6 90.7 91.2 90.8 89.8 91.3 91.4 92.7 91.7 92.0 92.0 92.0 90.4 90.9 92.0 92.5 92.5 91.1 90.7 92.5 96.1 96.2 95.9 95.9 97.1 96.9 97.2 96.7 96.6 95.8 95.6 96.3 96.8 96.7 96.2 96.3 96.7 98.6 98.3 98.7 98.9 98.8 98.7 98.8 98.8 PA + AVSL Contextual (Ours) † 79.6 83.2 91.4 92.9 96.4 96.8 98.8 Baselines We compare against a diverse set of baselines in Tables 2 and 3. Some baselines (such as DRML, Metrix, S2SD, HIST, MHGL, and AVSL) use complicated tricks to achieve published results. We simply copy the results for these baselines from their original paper. We then choose a representative set of baselines which only modify the loss function, and reproduce their results under identical experimental conditions (indicated by † in the tables). We first compare against contrastive and triplet losses, which are the accepted standard in the field. Multi-similarity (MS) is a popular pairwise ranking loss. From classification methods, we compare against proxy anchor and proxy NCA. From AP maximization methods, we compare against Fast-AP, Smooth-AP, and Roadmap (Fast-AP and Smooth-AP comparisons are deferred to Appendix G Table 4). The benchmark results for many of the above-mentioned loss functions 8 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization contextual and contrastive losses. The optimal value of λ is usually between 0.8 and 0.9. The similarity regularizer is only needed to achieve state-of-the-art on some datasets. This property is reasonable for a regularizer, and it is more important to observe that the similarity regularizer improves the results of many diverse metric learning losses (see Table 6 Appendix I.1). are under-represented in the literature. For fair comparison, we tune learning rates separately for each loss function. We use default values for any other hyperparameters. Hyperparameters and Setup On our method, we tune λ and ϵ separately for each dataset. We use fixed values for remaining hyperparameters: γ = 0.1, α = 10.0, k = 4, δ+ = 0.75, δ− = 0.6, s̃ = 0.3. The results in the main paper all use 224×224 image resolution. Some recent studies use 256×256 image resolution; comparisons in this setting are included in Appendix G Table 4. We use Adam with a decaying learning-rate schedule. We report results on the model with the best test R@1 metric, as is standard in the literature. We tune learning rates separately for each method and dataset combination. We use a batch size of 256 for iNaturalist, 128 for SOP and CUB, and 64 for Cars; the larger batch size is necessary to achieve reasonable performance on iNaturalist, while the smaller batch size appears to reduce overfitting on Cars. We use a 4 per class balanced sampler. For SOP and iNaturalist, we use hierarchical sampling (Cakir et al., 2019), following prior work. We use an embedding size of 512 for most comparisons; we only use an embedding size of 1536 for comparison with AVSL. We use ResNet-50 with a linear embedding layer. For CUB and Cars, we add an additional linear projector layer, which is discarded at the end of training. We use GeM pooling (Radenović et al., 2018), a widely used generalization of max and mean pooling. We always freeze batch-norm. We emphasize that the setup described above is used for all baseline results marked by †. Additionally, we add the linear projector and/or similarity regularization with γ = 0.1 only when it improves the baseline, for fair comparison. 6. Conclusion In this work, we proposed a novel contextual loss based on contextual similarity optimization. Our contextual loss improves the R@1 performance significantly over the current state-of-the-art across four benchmarks, when regularized by the contrastive loss and a novel similarity regularizer. We established that our loss function reduces overfitting by regularizing the embedding space for semantic consistency among neighbors. We justified this interpretation both analytically by inspecting the gradient, and empirically by showing that our loss function is more robust to label noise. We carefully supported each component of our loss function through extensive experiments across two different experimental settings, accompanied by exhaustive ablation studies. E THICS S TATEMENT We note that metric learning can be applied to controversial problems such as person re-identification and face reidentification. Our work is mainly foundational, so does not contribute directly to these applications. We also limit our experimentation to the image retrieval aspect of metric learning. Performance Metrics We report Recall @ k for select k. R@k is the percentage of test samples where at least one of the k closest neighbors have the same label. We also report mAP, a standard ranking metric defined in Appendix I.3. We report the average of 6 trials on CUB and Cars, and 2 trials on SOP. We omit standard deviations for readability. R EPRODUCIBILITY S TATEMENT Instructions on how to run the code is provided in a README file. We include details on hardware requirements in Appendix H.1. Code is released here: https://github.com/Chris210634/metriclearning-using-contextual-similarity Discussion Our R@1 results are better than the best baseline across all datasets and embedding sizes. We achieve R@1 gains of 0.5%, 0.6 %, and 0.4 % on CUB, Cars, and SOP resp. for the 512 embedding size. We achieve gains of 0.8%, 0.3%, and 3.6% on these datasets for the 1536 embedding size. These results are an average of 6 trials with different random seeds on CUB and Cars, and 2 trials on SOP. Additionally, we achieve R@1 gains of 0.8%, 0.6%, 0.2%, and 0.3% over the best baseline on CUB, Cars, SOP, and iNaturalist, resp. averaged over 3 trials, under slightly different experimental settings using 256×256 image resolution (see Table 4 in the appendix). ACKNOWLEDGMENTS DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering. Ablation Figure 7 plots R@1 on CUB and Cars with different values of λ and γ. This figure shows that the best R@1 performance is always achieved by a combination of 9 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization References Guillaumin, M., Verbeek, J., and Schmid, C. Is that you? metric learning approaches for face identification. In 2009 IEEE 12th international conference on computer vision, pp. 498–505. IEEE, 2009. Arandjelović, R. and Zisserman, A. Three things everyone should know to improve object retrieval. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2911–2918. IEEE, 2012. Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735– 1742. IEEE, 2006. Brown, A., Xie, W., Kalogeiton, V., and Zisserman, A. Smooth-ap: Smoothing the path towards large-scale image retrieval. In European Conference on Computer Vision, pp. 677–694. Springer, 2020. Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I., and Lempitsky, V. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6418–6428, 2020. Cakir, F., He, K., Xia, X., Kulis, B., and Sclaroff, S. Deep metric learning to rank. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1861–1870, 2019. Cao, B., Araujo, A., and Sim, J. Unifying deep local and global features for image search. In European Conference on Computer Vision, pp. 726–743. Springer, 2020. Kim, S., Kim, D., Cho, M., and Kwak, S. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3238–3247, 2020. Chang, D., Pang, K., Zheng, Y., Ma, Z., Song, Y.-Z., and Guo, J. Your” flamingo” is my” bird”: fine-grained, or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11476– 11485, 2021. Kim, S., Kim, D., Cho, M., and Kwak, S. Self-taught metric learning without labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7431–7441, 2022. Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013. Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014. Lim, J., Yun, S., Park, S., and Choi, J. Y. Hypergraphinduced semantic tuplet loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 212–222, 2022. Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., and Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8282–8291, 2019. Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104, 2016. Deng, J., Guo, J., Xue, N., and Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699, 2019. Ebrahimpour, M. K., Qian, G., and Beach, A. Multi-head deep metric learning using global and local representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3031–3040, 2022. McFee, B. and Lanckriet, G. R. Metric learning to rank. In ICML, 2010. Milbich, T., Roth, K., Bharadhwaj, H., Sinha, S., Bengio, Y., Ommer, B., and Cohen, J. P. Diva: Diverse visual feature aggregation for deep metric learning. In European Conference on Computer Vision, pp. 590–607. Springer, 2020. El-Nouby, A., Neverova, N., Laptev, I., and Jégou, H. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021. Musgrave, K., Belongie, S., and Lim, S.-N. A metric learning reality check. In European Conference on Computer Vision, pp. 681–699. Springer, 2020a. Ermolov, A., Mirvakhabova, L., Khrulkov, V., Sebe, N., and Oseledets, I. Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7409–7419, 2022. Musgrave, K., Belongie, S., and Lim, S.-N. Pytorch metric learning, 2020b. 10 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4004–4012, 2016. Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778, 2018. Radenović, F., Tolias, G., and Chum, O. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7): 1655–1668, 2018. Venkataramanan, S., Psomas, B., Kijak, E., Amsaleg, L., Karantzalos, K., and Avrithis, Y. It takes two to tango: Mixup for deep metric learning. arXiv preprint arXiv:2106.04990, 2021. Ramzi, E., Thome, N., Rambour, C., Audebert, N., and Bitot, X. Robust and decomposable average precision for image retrieval. Advances in Neural Information Processing Systems, 34:23569–23581, 2021. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. 2011. Wang, X., Han, X., Huang, W., Dong, D., and Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5022–5030, 2019. Ramzi, E., Audebert, N., Thome, N., Rambour, C., and Bitot, X. Hierarchical average precision training for pertinent image retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23– 27, 2022, Proceedings, Part XIV, pp. 250–266. Springer, 2022. Weinberger, K. Q., Blitzer, J., and Saul, L. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems, 18, 2005. Ranjan, V., Rasiwasia, N., and Jawahar, C. Multi-label crossmodal retrieval. In Proceedings of the IEEE international conference on computer vision, pp. 4094–4102, 2015. Weyand, T., Araujo, A., Cao, B., and Sim, J. Google landmarks dataset v2-a large-scale benchmark for instancelevel recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2575–2584, 2020. Revaud, J., Almazán, J., Rezende, R. S., and Souza, C. R. d. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5107– 5116, 2019. Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision, pp. 2840–2848, 2017. Rolı́nek, M., Musil, V., Paulus, A., Vlastelica, M., Michaelis, C., and Martius, G. Optimizing rank-based metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7620–7630, 2020. Yan, J., Luo, L., Deng, C., and Huang, H. Unsupervised hyperbolic metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12465–12474, 2021. Roth, K., Milbich, T., Ommer, B., Cohen, J. P., and Ghassemi, M. S2sd: simultaneous similarity-based selfdistillation for deep metric learning. arXiv preprint arXiv:2009.08348, 2020. Yan, J., Luo, L., Deng, C., and Huang, H. Adaptive hierarchical similarity metric learning with noisy labels. IEEE Transactions on Image Processing, 32:1245–1256, 2023. Seidenschwarz, J. D., Elezi, I., and Leal-Taixé, L. Learning intra-batch connections for deep metric learning. In International Conference on Machine Learning, pp. 9410– 9421. PMLR, 2021. Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi, S. C. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence, 44(6):2872–2893, 2021. Sun, Y., Zhu, Y., Zhang, Y., Zheng, P., Qiu, X., Zhang, C., and Wei, Y. Dynamic metric learning: Towards a scalable metric space to accommodate multiple semantic scales. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5393–5402, 2021. Zhai, A. and Wu, H.-Y. Classification is a strong baseline for deep metric learning. arXiv preprint arXiv:1811.12649, 2018. Teh, E. W., DeVries, T., and Taylor, G. W. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In European Conference on Computer Vision, pp. 448–464. Springer, 2020. Zhang, B., Zheng, W., Zhou, J., and Lu, J. Attributable visual similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7532–7541, 2022. 11 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Zhao, W., Rao, Y., Wang, Z., Lu, J., and Zhou, J. Towards interpretable deep metric learning with structural matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9887–9896, 2021. Zheng, W., Zhang, B., Lu, J., and Zhou, J. Deep relational metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12065–12074, 2021. Zheng, W., Huang, Y., Zhang, B., Zhou, J., and Lu, J. Dynamic metric learning with cross-level concept distillation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 197–213. Springer, 2022. Zhong, Z., Zheng, L., Cao, D., and Li, S. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1318–1327, 2017. 12 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization A. Simplification of Step 2 in Main Paper Section 3 Under the assumption that ϵ = 0, the formula for W1 in terms of 1Nk simplifies. We drop the subscript k from N for P readability. When ϵ = 0, there are exactly k samples in the neighborhood set N . Specifically, 1 (i, p) = k and N p P p 1 − 1N (i, p) = n − k. Eq. 3 becomes: W1 (i, j) = 1N (i, j) 2 where M+ (i, j) = and M− (i, j) =  · n X M+ (i, j) M− (i, j) + k n−k  , 1N (i, p)1N (j, p), (11) p=1 n X (1 − 1N (i, p)) (1 − 1N (j, p)) . p=1 Under our assumptions, M− (i, j) simplifies: M− (i, j) = n X (1 − 1N (i, p) − 1N (j, p) + 1N (i, p)1N (j, p)) p=1 =n−k−k+ n X 1N (i, p)1N (j, p) (12) p=1 = n − 2k + ⟨1N (i), 1N (j)⟩ Note that M+ (i, j) = ⟨1N (i), 1N (j)⟩, by definition. Eq. 11 becomes:   1N (i, j) ⟨1N (i), 1N (j)⟩ n − 2k + ⟨1N (i), 1N (j)⟩ · + W1 (i, j) = 2 k n−k (13) This can be written as Eq. 8 in the main paper, restated here: W1 (i, j) = 1N (i, j) (a⟨1N (i), 1N (j)⟩ + b) , where a = 1 n − 2k 1 + and b = 2k 2(n − k) 2(n − k) (14) Under the assumption that n > 2k, a > 0 and b > 0. B. Proof of Proposition 4.1 Forward direction If all samples are correctly ranked w.r.t. every other sample within the batch, show that Lcontext = 0. Proof Recall that the Proposition assumes that there are exactly k samples from each class in the batch. This means that the k neighborhood set N (i) of every sample i contains exactly the k samples in the batch with the same label. In other words, 1N (i, j) = yij , ∀i, j. Further, if 1N (i, j) = 1, then the inner product ⟨1N (i), 1N (j)⟩ = k because the neighborhood sets of i and j are identical and there are exactly k samples in this set. When ⟨1N (i), 1N (j)⟩ = k, W1 (i, j) = 1N (i, j). When 1N (i, j) = 0, W1 (i, j) = 0. Therefore, W1 (i, j) = 1N (i, j) = yij , ∀i, j. Following Eq. 4, it is also clear that wij = W1 (i, j) when samples are correctly ranked. This is because, the Rk/2 neighborhood is a subset of the larger Nk neighborhood. That is, if 1Rk/2 (i, j) = 1, then 1N (i, j) = 1. Since 1N (i, p) = 1N (j, p), ∀p, when 1N (i, j) = 1, averaging the entries of W1 over Rk/2 as indicated in Eq. 4 does not change W1 . Therefore, W2 (i, j) = W1 (i, j). Finally, yij is symmetric, so W2 (i, j) = wij . This proves that yij = wij and Lcontext = 0 by Eq. 5. Backward direction If Lcontext = 0, show that all samples are correctly ranked w.r.t. every other samples within the batch. Proof We prove this by contradiction. Suppose that Lcontext = 0 and there exists a pair of samples i, j where j is not correctly ranked w.r.t. i. There are two possible cases: (1) yij = 1 but 1N (i, j) = 0 or (2) yij = 0 but 1N (i, j) = 1. (1) 13 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 8. Comparison of scalability in time and space between our loss function and Roadmap. Our method is plotted in blue, while Roadmap is plotted in orange. For this experiment, we pre-calculate the 2048-dimensional ResNet-50 features and train a linear embedding layer using each loss function. Both the time taken per batch (left) and GPU memory used (right) scale cubicly. However, the time and space taken to calculate our loss function is negligible compared to backbone evaluation, even for batch size 6400. On the other hand, the time and space needed to calculate the Roadmap loss quickly becomes unpractical for large batch sizes. Note: we pre-compute all features and store them in GPU memory. The memory used by the pre-computation is freed, but the GPU memory being used, as indicated by the Nvidia tool, remains at the maximum memory used over the lifetime of the process. Therefore, the memory consumed by calculating the loss function is not visible until it exceeds the memory used by the pre-computation. is the case where i and j have the same label, but j is not ranked as one of the top-k neighbors of i. (2) is the case where i and j have different labels, but j is ranked as one of the top-k neighbors of i. Since we assume that there are exactly k samples with the same label per batch, both (1) and (2) must be true. In particular, assume that (2) is true. Under Eq. 14, if 1N (i, j) = 1, then W1 (i, j) > 0 because 1N (i, j) is multiplied by a non-zero positive number. Similarly, under Eq. 4, if W1 (i, j) > 0, then wij > 0 because W1 (i, j) is multiplied by non-zero positive numbers and added to positive numbers. Since wij > 0 and yij = 0, Lcontext > 0, which contradicts the assumption that Lcontext = 0 and concludes the proof. C. Scalability While the calculation of our loss function scales cubicly in time and space, GPU implementations of matrix multiplication are highly optimized. Consequently, our contextual loss remains practical even for batch sizes as high as 6400. See Figure 8. In comparison, AP surrogates such as Roadmap become impractical beyond batch sizes of about 2000. We used one A40 GPU for this experiment, so we were unable to plot points above 32 GB of GPU memory. D. Decomposability Gap Ramzi et al. (2021) demonstrated mathematically the idea of a decomposability gap in their Roadmap paper. When optimizing a ranking metric such as AP, the average batch-wise AP is a loose upper bound of AP over the entire dataset. Given a reasonable embedding space (e.g. halfway through training), most randomly sampled batches are perfectly ranked, even if ranking over the entire dataset is far from optimal. Ramzi et al. (2021) call this difference between the batch-wise objective and the dataset-wise objective the “decomposability gap” and propose to use the standard contrastive loss to reduce this gap. The resulting optimization objective is a convex combination of the ranking loss and Lcontrast . This decomposability gap could partially explain why we need to regularize Lcontext with Lcontrast . Specifically, Lcontext reaches a minimum of zero when all samples within a mini-batch are correctly ranked. At this point, the embedding no longer changes until an incorrectly-ranked mini-batch is sampled. This is problematic, because, without hard off-line sampling, the probability that a randomly sampled batch contains the “hard” triplet that is incorrectly ranked diminishes. This justifies the need for Lcontrast to complement our contextual loss. See Figure 9 for an empirical verification of the existence of a decomposability gap on the CUB dataset. 14 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 9. Empirical verification of the existence of a decomposability gap when optimizing the contextual loss. We run this experiment on CUB. We train the linear embedding layer on top of fixed ResNet-50 features. We experiment with different λ. λ = 0 corresponds to the contrastive loss, while λ = 1 corresponds to the contextual loss. We experiment with different batch sizes. We only plot the train R@1 accuracy in this figure. Note that batch size of 6400 corresponds to the entire dataset (i.e. decomposability gap is zero). The plots on the right show the R@1, AP@R, and contextual loss over the course of training for λ = 0 and λ = 1 for both the largest batch size and the smallest batch size. Observe that when the batch size is 400, optimizing the contextual loss (λ = 1) in blue does not converge to the correct ranking on the entire dataset. The training stalls at around 1000 iterations. This is because the embedding space after iteration 1000 is good enough in the sense that the batch-wise ranking is perfect for all sampled batches. Meanwhile, if we use batch size 6400 (no mini-batching), the decomposability gap is zero, and the contextual loss in blue clearly converges to the correct ranking over the entire dataset. Further observe that on the larger batch sizes, the optimal train R@1 is achieved when λ = 1. This suggests that the contrastive loss is not needed on large batch sizes, where the decomposability gap is smaller. 15 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization E. Contextual Similarity Definition in Set Notation In this section, we state the same contextual similarity definition as Section 3 in the main paper, but following the notation of Kim et al. (2022). We compute the contextual similarity on a single batch, not the entire dataset. Two samples are contextually similar if they share the same neighbors, i.e. the intersection of their k-neighbor sets is large. Following this intuition, we calculate w̃ij , the preliminary contextual similarity between samples i and j: Nk+ϵ (i) = {j | sij ≤ sip + ϵ where p denotes the k-th closest neighbor of i} ( |Nk+ϵ (i) ∩ Nk+ϵ (j)| / |Nk+ϵ (i)| , if j ∈ Nk+ϵ (i) w̃ij = 0 , otherwise. (15) (16) We include i in Nk+ϵ (i) (so if k = 2, then Nk+0 (i) includes two elements: i and its closest neighbor). Then, we use query expansion and symmeterize to obtain the final contextual similarity. Query expansion (Arandjelović & Zisserman, 2012) is a standard evaluation-time trick in metric learning. It boosts retrieval performance by retrieving neighbors of a sample’s neighbors. Analogously, we adjust the contextual similarity by averaging w̃ij over the set of close reciprocal neighbors Rk/2 (i). Nk/2+ϵ (i) = {j | sij ≤ sip + ϵ where p denotes the k/2-th closest neighbor of i} (17) Rk/2+ϵ (i) = {j | j ∈ Nk/2+ϵ (i) and i ∈ Nk/2+ϵ (j)} X 1 1 ŵij = w̃pj , wij = (ŵij + ŵji ). |Rk/2+ϵ (i)| 2 (18) (19) p∈Rk/2+ϵ (i) Note that wij ∈ [0, 1] and only depend on the cosine similarities sij between embeddings. We want to optimize the embeddings such that wij converges to yij . The value of k is not arbitrary; it must be set to the number of samples per label in the mini-batch. For example, a standard metric learning setup is to randomly sample 32 labels and then sample 4 images per label. In this case, we set k = 4. E.1. Differences with STML This subsection enumerates in detail the technical differences between our definition of contextual similarity (wij in Eq. 19) and the definition of contextual similarity in STML (Kim et al., 2022). Note that by necessity, we reference equations and notation in the STML paper. P C in their Eq. 5) with the contextual similarity (wij 1. STML averages a non-linear function of the cosine similarity (wij in their Eq. 5) to arrive at an estimate of the ground-truth similarity. In their work, the resulting similarity is used to “teach” a student embedding network. In our work, the contextual similarity is optimized directly by the contextual loss Lcontext , and the cosine similarity is directly optimized by the contrastive loss Lcontrast . 2. We add a margin term ϵ to the neighborhood definition; STML uses ϵ = 0. In our loss, the margin is necessary to encourage separation between embeddings with different labels. 3. (Minor) We use the set of k + ϵ neighbors to calculate w̃ij in our Eq. 16; STML uses the set of k reciprocal neighbors to calculate w̃ij in their Eq. 7. 4. (Minor) We use the set of k reciprocal neighbors for query expansion in our Eq. 18; STML uses the k/2 neighbors in their Eq. 8. Additionally, our work contains the following technical innovations not found in (Kim et al., 2022): • The implementation of contextual similarity in (Kim et al., 2022) involves indexing a matrix and setting those values to a constant. Consequently, the output depends on the input indirectly through an indexing operation. We re-implement the contextual similarity calculation in a way that only uses greater-than, logical-and, addition, and multiplication. A detailed explanation of this process is detailed in Appendix E.2 and E.3. 16 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization • We determine suitable differentiable substitutions for the logical-and and greater-than functions. logical-and could be replaced by simple averaging, multiplication, or min. greater-than can be replaced by a sigmoid, a smooth upper-bound, or the constant-gradient heaviside. We ultimately chose to use multiplication and a constant-gradient heaviside. These two choices are intuitively motivated in Appendix E.2, Fig. 10, and experimentally justified in Table 5. • (Minor) Optimizing the intersection of neighborhood sets in Eq. 16 directly is problematic and leads to shrinkage of the embedding space. We mitigate this issue by additionally optimizing the intersection of the complements of the neighborhood sets. This is explained in Figure 12 E.2. Detailed Optimization Motivation The definition in the previous sub-section clearly contains three discrete operations: (1) greater-than, (2) logical-and, and (3) intersection. For optimization, we will deal extensively with indicator matrices: we denote as 1N ∈ Rn×n the indicator matrix where 1N (i, j) = 1 if j ∈ N (i) for some set N . We use ⊙ to denote element-wise multiplication. Greater-than This is used in Eq. 15 and 17 and is equivalent to the non-differentiable heaviside function θ. Our approach is to use the exact value of θ(·) in the forward pass and a constant positive gradient α in the backward pass. A constant positive gradient is reasonable since θ is a (non-strictly) increasing function. ∂θ(x) = α. (20) ∂x In Section F.1 we show with a toy experiment that this approach is robust despite being somewhat heuristic. In contrast, θ is traditionally approximated by a sigmoid (e.g. for AP approximation and in Gated Recurrent Networks (Cho et al., 2014) ): Forward: θ(x) = 1 if x ≥ 0 ; 0 otherwise θσ (x) = Backward: 1 . 1 + exp (x/τ ) (21) θσ trades off the quality of the approximation with the domain where gradients are non-zero. As temperature τ decreases, θσ approaches θ, but gradient vanishes everywhere except in a small region around the boundary. This behavior is not intuitive: it is undesirable to only have a large gradient at the boundary. Some prior work (e.g. ROADMAP) side-step this issue by using an upper-bound to the heaviside function (where the right side of the heaviside function increases linearly), which solves the gradient issue at the expense of grossly over-estimating the true objective. In our case, this is especially concerning since we will be multiplying together indicator functions. In Section G.1, we show that our approach in Eq. 20 achieves better empirical results than a sigmoid approximation. Logical-and A logical-and is used explicitly in Eq. 18 and implicitly in Eq. 16 and 19. There are two differentiable substitutes for logical-and: min and multiplication. Multiplication is smooth, but has no gradient at the origin. min is logically consistent on the continuous domain [0,1], but the gradient is not continuous when inputs are equal. We found experimentally that multiplication is the best option, see row 5 of Table 5. A gradient of zero at the origin is desireable in the case of Eq. 18 and 19 (query expansion step). A work-around for the zero-gradient issue for Eq. 16 is presented in the next sub-section. Intersection The number of elements in the intersection in Eq. 16 is calculated using matrix multiplication. A sample p ∈ Nk+ϵ (i) ∩ Nk+ϵ (j) if p ∈ Nk+ϵ (i) and p ∈ Nk+ϵ (j). Using multiplication for the logical-and, the number of elements in the intersection can be expressed compactly as a function of the indicator matrices: M+ = 1Nk+ϵ 1⊺Nk+ϵ , where M+ (i, j) = |Nk+ϵ (i) ∩ Nk+ϵ (j)| . (22) We are now ready to state the loss function by combining the definition of contextual similarity with the optimization method. E.3. Loss in Matrix Notation Our loss function straightforwardly combines the previous two sub-sections, with one exception for the intersection calculation in Eq. 16. Optimizing the intersection with M+ tends to focus on pairs of neighboring samples because using multiplication for logical-and zeros out the gradient when both inputs are 0. This is problematic since the resulting embedding becomes clustered regardless of true similarity (see Figure 12). We mitigate this problem by optimizing the 17 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 10. Illustration of Nk+ϵ optimization using the simplified loss function L1 in Eq. 32. Negative samples which are in the set Nk+ϵ (i) are pushed away from i by a constant gradient. Positive samples which are not in the set Nk+ϵ (i) are pulled closer. This behavior is analogous to the standard contrastive loss: Lcontrast applies a constant gradient to pairs violating a fixed margin, while L1 applies a constant gradient to pairs violating the flexible margin defined by the k-th neighbor. k = 4 in the illustration. c c intersection of the complements Nk+ϵ (i) ∩ Nk+ϵ (j) in addition to optimizing the original intersection in Eq. 16. c denotes the complement of a set. Maximizing the size of the intersection between two sets is equivalent to maximizing the size of the intersection between their complements, and vice versa. This can be trivially proven under the assumptions that the universal set is constant and that the size of both sets is constant. Let D ∈ Rn×n denote the matrix where D(i, j) ∈ [0, 4] is the squared Euclidean distance between the normalized features of sample i and j. D(i, j) = 2 − 2sij . sg denotes stop gradient. Step 1 (Neighborhood Optimization): 1Nk+ϵ (i, j) = θ(−D(i, j) + sg(D(i, p)) + ϵ) where p denotes the k-th closest neighbor of i. Step 2 (Optimizing Intersection of Neighborhoods):   M− 1 M+ + W̃ = ⊙ 1Nk+ϵ c (i)|) 2 sg(|Nk+ϵ (i)|) sg(|Nk+ϵ c where M+ = 1Nk+ϵ 1⊺Nk+ϵ and M− = 1Nk+ϵ 1⊺N c . (23) (24) k+ϵ Step 3 (Query Expansion): 1Nk/2+ϵ (i, j) = θ(−D(i, j) + sg(D(i, p)) + ϵ) where p denotes the k/2-th closest neighbor of i (25) 1Rk/2+ϵ = 1Nk/2+ϵ ⊙ 1⊺Nk/2+ϵ (26) Ŵ = 1Rk/2+ϵ W̃ |Rk/2+ϵ (i)| ,W =  1 Ŵ + Ŵ ⊺ . 2 (27) Finally, we use the MSE loss to optimize wij := W (i, j) against the true similarity labels yij : Lcontext = 1 X (yij − wij )2 . n2 (28) i,j|i̸=j Lcontext is the key to our framework and works reasonably on its own. However, it has two vulnerabilities: (1) similar to learning to rank losses, Lcontext can be small even if the distance between positive pairs is large, so long as there are no closer negative pairs (see Figure 10); (2) Lcontext tends to converge to a solution where the average cosine similarity between all pairs is large, suggesting that only a small portion of the available embedding space is utilized. In response to problem (1), 18 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 11. This figure justifies the non-standard approach of optimizing L1 (Eq. 32) by overriding the gradient of θ. We generate synthetic 2-D data consisting of five concentric circles; each color represents a label. We use a standard three layer MLP with ReLU activations to output a 2D embedding. We train on increasingly noisy data and plot the resulting embeddings on clean data. We only sample 4 points per label per batch (k = 4). L1 minimization results in reasonable embeddings despite its simplicity. we add the standard contrastive loss ((Hadsell et al., 2006)), which explicitly optimizes the cosine similarity toward fixed margins δ+ and δ− . In response to problem (2), we add a non-standard but straightforward similarity regularizer, which regularizes the average cosine similarity between all pairs toward a fixed value s̃. Our final framework (Eq. 31) minimizes a weighted combination of the three losses. P P i,j|yij =1 (δ+ − sij )+ i,j|yij =0 (sij − δ− )+ + (29) Lcontrast = {i, j|yij = 1 and δ+ − sij > 0} {i, j|yij = 0 and sij − δ− > 0}  2 Lreg = s̃ − n 1 X n2 i,j 2 sij  Lours = λLcontext + (1 − λ)Lcontrast + γLreg . (30) (31) Note that El-Nouby et al. (2021) propose a superficially similar similarity regularization scheme, which “maximizes the distance between every point and its nearest neighbor”. Our Lreg regularizes the average similarity between all pairs of samples. Nearest neighbor pairs are dominated by positive pairs, while all pairs are dominated by negative pairs. Therefore, our regularizer is fundamentally different from El-Nouby et al. (2021). F. Slightly Different Analysis The contextual similarity loss function Lcontext is highly non-trivial. In this section we demystify each of the three steps using a toy experiment and gradient analysis. This is a slightly different analysis than the one given in the main paper. F.1. Step 1: Neighborhood optimization Consider the following simplified version of Lcontext : L1 = LMSE (yij , 1Nk+ϵ (i, j)). (32) This is a valid loss function that works reasonably on its own (see Figure 11). During each iteration, we sample a batch with exactly k samples from each label. Intuitively, L1 reaches a minimum of zero when Nk+ϵ (i) includes only i and the k − 1 other samples with the same label, i.e. all samples are correctly ranked. Using the gradient of θ(·) defined in Eq. 20, we see that negative samples which are in Nk+ϵ (i) receive a gradient of magnitude α pushing it away from i, while positive samples which are not in Nk+ϵ (i) receive a gradient of magnitude α pushing it toward i. See Figure 10. 19 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 12. This figure justifies the M− term in Eq. 24. The left plot shows the distribution of distances between pairs in normalized embedding space when we only optimize M+ ; the right√plot shows the same when we optimize both M− and M+ . A uniformly distributed embedding space has an average pair-wise distance of 2 because most directions are orthogonal in high dimensions. This is indicated by the green distribution. Clearly, more of the embedding space is utilized when we optimize both M− and M+ . F.2. Step 2: Optimizing Intersection of neighborhoods While λL1 + (1 − λ)Lcontrastive is a reasonable loss function already, row 1 of Table 5 shows that L1 collapses without the contrastive loss (λ = 1). We propose that this is because L1 provides very sparse gradients: ∂L1 /∂sij = 0 for most (i, j) pairs. Consider the following more complicated loss: L2 = LMSE (yij , W̃ (i, j)). (33) W̃ is defined in Eq. 24. W̃ (i, j) has two terms multiplied together element-wise. The 1Nk+ϵ term was already addressed in the previous sub-section, so we focus on the first term, which optimizes the intersection between neighborhood sets M+ . We offer a straight-forward intuition: Maximizing the intersection between the neighborhood sets of two samples is equivalent to pushing one sample towards the context of the other sample and vice versa; minimizing the intersection is equivalent to pulling apart the contexts of the two samples. We show this intuition by analyzing the gradient. For simplicity, consider ϵ = 0, such that the normalization factors in Eq. 24 are equal to k and n − k, resp. Further consider a negative pair of samples i and j (yij = 0), where j is wrongly ranked w.r.t. i (i.e. j ∈ Nk (i)). Following the loss function equations: M+ (i, j) = ⟨1Nk (i), 1Nk (j)⟩ M− (i, j) = ⟨1cNk (i), 1cNk (j)⟩   1 1 1 W̃ (i, j) = M+ (i, j) + M− (i, j) · 1Nk (i, j) . | {z } 2 k n−k (34) =1 W̃ (i, j) ∈ (0, 1] must be a non-zero positive number. Using the equation for L2 , we see that the gradient w.r.t. W̃ (i, j) must be positive because yij = 0. We are now ready for the backward pass. For simplicity of notation, ∂ W̃ (i, j) := gij > 0 denotes the gradient of the loss w.r.t. W̃ (i, j). 1 1 gij , and ∂M− (i, j) = gij 2k 2(n − k) 1 1 ∂ 1Nk (i) = gij 1Nk (j) , ∂ 1Nk (j) = gij 1Nk (i) 2k 2k 1 1 ∂ 1cNk (i) = gij 1cNk (j) , ∂ 1cNk (j) = gij 1cNk (i). 2(n − k) 2(n − k) ∂M+ (i, j) = (35) Use the fact that 1Nkc = 1 − 1Nk and ∂ 1Nk = −∂ 1Nkc and vice versa: ( ∂ 1Nk (i) ∂ 1Nk (j) ij = 2k (1Nk (j)) gij = 2k (1Nk (i)) g 20 (36) Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 4. R@k Results. We use ResNet-50 with an embedding size of 512 for all experiments. † indicates results reported by the original authors; we re-run all other baselines using the implementation by (Musgrave et al., 2020b). Standard deviations are based on three trials with the same train-test split. We emphasize that our R@1 performance is at least two standard deviations better than the next best baseline on all datasets. We are at least comparable to baselines on other R@k metrics. CUB Method Contrastive Triplet NtXent MS N-Softmax† Proxy NCA ++† Fast-AP Smooth-AP ROADMAP Ours Cars R@1 R@2 R@4 R@8 R@1 R@2 R@4 R@8 68.5 ± 0.3 67.3 ± 0.2 65.7 ± 0.4 68.9 ± 0.5 61.3 69.0 ± 0.8 63.3 ± 0.1 66.5 ± 0.9 68.7 ± 0.5 69.8 ± 0.2 78.3 ± 0.1 77.9 ± 0.1 76.3 ± 0.2 78.5 ± 0.4 73.9 79.8 ± 0.7 73.7 ± 0.4 76.6 ± 0.5 78.3 ± 0.3 79.8 ± 0.1 86.0 ± 0.2 85.6 ± 0.2 84.3 ± 0.4 86.0 ± 0.6 83.5 87.3 ± 0.7 82.2 ± 0.3 84.8 ± 0.6 86.1 ± 0.3 87.1 ± 0.1 91.3 ± 0.1 91.2 ± 0.1 90.0 ± 0.4 91.4 ± 0.5 90.0 92.7 ± 0.4 88.5 ± 0.2 90.8 ± 0.4 91.1 ± 0.1 92.3 ± 0.2 85.4 ± 0.2 77.6 ± 1.3 79.0 ± 0.6 88.7 ± 0.4 84.2 86.5 ± 0.4 74.7 ± 0.4 81.1 ± 0.2 84.5 ± 0.5 89.3 ± 0.0 91.1 ± 0.3 85.4 ± 0.8 86.0 ± 0.3 93.0 ± 0.2 90.4 92.5 ± 0.3 82.5 ± 0.7 87.8 ± 0.4 90.3 ± 0.0 93.7 ± 0.2 94.6 ± 0.3 90.8 ± 0.7 91.0 ± 0.2 95.7 ± 0.1 94.4 95.7 ± 0.2 88.0 ± 0.6 92.2 ± 0.3 93.9 ± 0.0 96.3 ± 0.1 96.8 ± 0.1 94.1 ± 0.4 94.4 ± 0.3 97.3 ± 0.1 96.9 97.7 ± 0.1 92.2 ± 0.2 95.1 ± 0.3 96.2 ± 0.1 97.8 ± 0.2 SOP Method Contrastive Triplet NtXent MS N-Softmax† Proxy NCA ++† Fast-AP Smooth-AP ROADMAP Ours mini-iNaturalist R@1 R@10 R@100 R@1 R@4 R@16 R@32 82.4 ± 0.0 82.0 ± 0.0 79.7 ± 0.2 81.4 ± 0.0 78.2 80.7 ± 0.5 80.3 ± 0.1 82.0 ± 0.0 83.1 ± 0.1 83.3 ± 0.0 91.9 ± 0.0 92.5 ± 0.1 90.8 ± 0.0 91.4 ± 0.0 90.6 92.0 ± 0.3 91.0 ± 0.1 92.6 ± 0.0 92.6 ± 0.0 92.9 ± 0.1 96.0 ± 0.0 96.7 ± 0.0 96.1 ± 0.0 96.1 ± 0.1 96.2 96.7 ± 0.1 96.0 ± 0.0 96.9 ± 0.0 96.6 ± 0.0 96.7 ± 0.0 43.5 ± 0.1 35.4 ± 0.1 40.8 ± 0.1 44.9 ± 0.1 – – 35.6 ± 0.2 42.7 ± 0.0 45.9 ± 0.1 46.2 ± 0.0 62.7 ± 0.1 56.5 ± 0.1 61.6 ± 0.1 63.9 ± 0.1 – – 55.8 ± 0.1 63.3 ± 0.0 65.8 ± 0.0 65.8 ± 0.1 77.6 ± 0.1 74.7 ± 0.1 78.0 ± 0.0 78.4 ± 0.1 – – 72.8 ± 0.0 79.0 ± 0.0 80.4 ± 0.1 80.2 ± 0.1 83.2 ± 0.1 81.7 ± 0.1 83.9 ± 0.0 83.9 ± 0.1 – – 79.3 ± 0.0 84.7 ± 0.0 85.7 ± 0.0 85.4 ± 0.1 ( ∂ 1Nk (i) ∂ 1Nk (j) ij (1Nk (j) − 1) = 2(n−k) g ij = 2(n−k) (1Nk (i) − 1) g (37) Eq. 36 is the contribution to the gradient from optimizing 1Nk , while Eq. 37 is the contribution to the gradient from optimizing 1cNk . These can be added together: ( ∂ 1Nk (i) ∂ 1Nk (j) ij ij = 2(n−k) (1Nk (j) − 1) + 2k (1Nk (j)) g g ij ij (1Nk (i) − 1) + 2k (1Nk (i)) = 2(n−k) g g (38) Going backward through θ(·) multiplies the gradient by −α (negative sign accounts for optimizing distance instead of similarity): −αgij αgij (1Nk (j) − 1) − (1Nk (j)) , and vice versa for ∂D(j). (39) ∂D(i) = 2(n − k) 2k From the above equation, ∂D(i, p) < 0 when 1Nk (j, p) = 1 and ∂D(i, p) > 0 when 1Nk (j, p) = 0, for all samples p in the batch. In words, we increase the distance between i and all samples in the neighborhood of j, and we decrease the distance between i and all points outside the neighborhood of j. Recall that we assumed i and j to be a negative pair, so it makes sense to pull i away from the context of j in this fashion. F.3. Step 3: Query Expansion Query expansion (QE) is an established trick for image retrieval (Arandjelović & Zisserman, 2012). QE expands the neighborhood set by additionally retrieving the neighbors of very close neighbors. In the unsupervised setting, QE refers to averaging the contextual similarity scores for samples in the Rk/2+ϵ neighborhood (see Eq. 27); this leads to more robust pseudo-supervision. In our setting, we have two reasons for using QE (Eq. 25 26 and 27): (1) averaging w̃ij with very close neighbors could have the same effect as label smoothing and (2) the Nk/2+ϵ neighborhood is optimized by Eq. 26 and 27. 21 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 5. Ablation Results. Here, we experiment with variations of Lcontext , both by itself (λ = 1) and regularized by a small amount of contrastive loss (λ = 0.8). We exhaustively test various ways to simplify or modify the contextual loss presented in Eq. 23 - 28. On the SOP dataset, we show that all of the modifications to Lcontext decrease R@1. Ablation (γ = 0) SOP R@1 λ = 0.8 λ = 1.0 λL1 +(1 − λ)Lcontrast λL1,σ +(1 − λ)Lcontrast λL2 +(1 − λ)Lcontrast W̃ = 1Nk+ϵ (skip step 2) λLcontext,min +(1 − λ)Lcontrast λLcontext,σ +(1 − λ)Lcontrast λLcontext,M+ +(1 − λ)Lcontrast No stop gradient in Eq. 24 λLcontext +(1 − λ)Lcontrast 83.13 79.20 82.39 82.74 66.57 82.39 82.31 73.03 83.20 41.99 75.61 81.05 81.74 – 77.84 80.72 – 82.04 Explanation Eq. 32 (step 1 only) Eq. 32 but using θσ in Eq. 21 Eq. 33 (skip step 3) Use min instead of ⊙ for logical-and Use θσ for all steps Remove M− term from Eq. 24 Final results without Lreg Empirically, QE is necessary to achieve the optimal performance of our framework, since Lcontext achieves higher R@1 than L2 in Table 5. G. 256 × 256 Resolution Experiments We include 256 × 256 image resolution results in this section. This setup is slightly different from the main paper. These results are comparable to the results reported in the ROADMAP paper. Hyperparameters and Setup on 256 × 256 Experiments We use hyperparameter values that are approximately optimal across all datasets. λ = 0.4, γ = 0.1, α = 10.0, ϵ = 0.05, k = 4, δ+ = 0.75, δ− = 0.6, s̃ = 0.25. We follow the same training procedure as ROADMAP, but with a slightly faster learning-rate schedule for time efficiency. We use Adam with a learning-rate schedule that multiplies the learning rate by 0.3 at Epochs 15, 30, and 45. We train for 80 epochs. We report results on the model with the best test R@1 metric, as is standard in the literature. We use an initial learning rate of 8 × 10−5 on CUB, 0.00016 on Cars, 4 × 10−5 on SOP, and 8 × 10−5 on iNaturalist. We use a batch size of 256 for iNaturalist and 128 for other datasets; the larger batch size is necessary to achieve reasonable performance on iNaturalist. We use a 4 per class sampler. For CUB and Cars, we use random sampling (sample 32 classes at random, then sample 4 images per class). For SOP and iNaturalist, we use hierarchical sampling (Cakir et al. (2019)), following prior work. We use an embedding size of 512 and ResNet-50 with a linear embedding layer. Following prior work, we use layer-norm and max-pooling on the smaller CUB and Cars datasets. We always freeze batch-norm. Discussion for 256 × 256 Experiments Our R@1 results are at least two standard deviations better than the best baseline across all datasets. Our results are especially good on CUB and Cars, where we achieve R@1 gains of 0.8% and 0.6 %, resp. We achieve more modest R@1 gains of 0.2% and 0.3% on SOP and iNaturalist, resp. We note that these gains are significant because the standard deviation is relatively small. We also note that while Proxy NCA ++ and MS are the best baselines on CUB and Cars, ROADMAP is the best baseline on SOP and iNaturalist. Our method is the best across all datasets, suggesting that it is more versatile. G.1. Ablation Results In Table 5, we evaluate the contribution of each step in the calculation of Lcontext . These experiments focus on dissecting Lcontext , so we set γ = 0 (no similarity regularizer) and λ = 0.8 or 1.0. Overall, all of the modifications tested in Table 5 decrease the performance of Lcontext in terms of R@1. In particular: row 1 shows that including only step 1 leads to a collapsed representation when λ = 1.0; rows 3 and 4 show that steps 3 and 2 of the loss calculation are necessary; row 7 shows that the M− term in Eq. 24 is necessary; rows 2 and 6 show that our approach to optimizing θ in Eq. 20 is better than using a sigmoid. 22 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization H. Minor Experimental Details H.1. Code and Environment We include code with our submission. We run experiments on 1 V100 GPU with 16 GB of memory. The CUB and Cars experiments take under one hour. The SOP and iNaturalist experiments take 4 hours and 6 hours, respectively. Some of our code is borrowed from ROADMAP. For faster experimentation, we use mixed precision floating point. Experiments take more than 12 hours on P100 GPUs, partially because mixed precision arithmetic does not appear to speed-up experiments as much on P100 GPUs compared to on V100 GPUs. H.2. Augmentation (Specific to 256 × 256 Experiments) On CUB, Cars and SOP, we use random resized crop with default parameters to crop the image to 256 × 256 pixels. We horizontally flip the image with 50% probability. For testing, we resize the image to 288 pixels, then center crop to 256 pixels. On iNaturalist, we random resize crop to 224 × 224 pixels, then flip horizontally with 50% probability. We use a smaller image size on iNaturalist so that the larger batch size can fit in the 16 GB of GPU memory. For testing, we resize the image to 256 pixels then center crop to 224 pixels. H.3. Sampling (Specific to 256 × 256 Experiments) We use a batch size of 128 on CUB, Cars and SOP. We use a batch size of 256 on iNaturalist. On all datasets, we train for 80 epochs. An epoch is defined as 15, 30, 655, and 576 batches for CUB, Cars, SOP, and iNaturalist respectively. On CUB and Cars, we use a random 4 per class sampler. On SOP and iNaturalist, we use a balanced hierarchical sampling strategy, since there are super-labels. In particular, half of each batch is randomly sampled from one super-label, and the other half is randomly sampled from another super-label. In order to sample in a balanced manner (to prevent over-emphasis of uncommon super-labels), we arrange all samples in the training dataset into half-batches with same super-labels at the beginning of each epoch. We then form batches by pairing up half-batches in a round-robin fashion. H.4. Loss Plot Figure 15 plots the contextual loss and test R@1 over the course of training on SOP, with varying λ. γ = 0. These plots show that the loss decreases over the course of training; in general, a lower contextual loss corresponds to a higher test R@1. This result suggests that the contextual loss is a reasonable objective for image retrieval. H.5. Note on Contrastive Loss The contrastive loss has two margin parameters δ− and δ+ . The optimal values for δ+ , the positive margin, is different depending on the value of γ (how much similarity regularization is used). For the “contrastive” results in row 1 of Table 4, we use δ− = 0.6 and δ+ = 0.9. For our results where γ = 0.1, we set δ− = 0.6 and δ+ = 0.75. We find this tighter positive margin to be optimal under similarity regularization. However, we note that δ+ = 0.75 is likely to be too small when only the contrastive loss is used, since positive pairs are not encouraged to be more similar than a cosine similarity of 0.75. H.6. Note on Stop Gradient Note that we detach the normalization factors in Eq. 24. This is necessary because optimizing the size of the neighborhood is undesirable. On the contrary, we do not detach the normalization factor in Eq. 27. This is intentional even if somewhat un-intuitive. We show in Figure 17 that detaching |Rk/2+ϵ (i)| in Eq. 27 increases R@1 on SOP when λ is high, but decreases R@1 in all other scenarios. I. Additional Experimental Results I.1. Experimental Results on Similarity Regularizer In the main paper, we presented the similarity regularizer Lreg as complementary to the contextual similarity loss Lcontext . However, there is no theoretical reason for limiting the regularizer to our contextual similarity framework. We show in Table 6 that the similarity regularizer offers an improvement in retrieval performance when combined with some baselines. In particular, the regularizer consistently improves R@1 performance of Triplet, Multi-Similarity, Smooth-AP, and ROADMAP 23 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 6. Similarity Regularizer Results. Here, we experiment with naively adding our similarity regularizer to a subset of the baseline methods. The overall loss is γLreg + (1 − γ)(baseline loss) where γ = 0.1. We try two different values for s̃ and compare the difference in various performance metrics with the original baseline. The colored subscript denotes the increase or decrease in each metric when our similarity regularizer is added. Observe that in most cases, the regularizer improves the performance of the baseline or has negligible effect, with an appropriate s̃ value. Cars mAP mAP@R R@1 Method No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 Contrastive Triplet NtXent MS Smooth-AP ROADMAP Ours 38.2 34.9 36.2 43.1 37.5 38.3 – 40.6 +2.4 39.2 +4.3 37.0 +0.8 42.7 -0.4 38.9 +1.4 39.2 +0.9 42.4 39.4 +1.2 38.5 +3.6 35.8 -0.4 43.6 +0.5 38.1 +0.6 39.6 +1.3 – 28.8 24.7 26.3 33.7 27.5 28.7 – 31.2 +2.4 29.9 +5.2 27.0 +0.7 33.3 -0.4 28.8 +1.3 29.6 +0.9 33.0 30.0 +1.2 29.2 +4.5 26.0 -0.3 34.1 +0.4 28.2 +0.7 30.0 +1.3 – 85.4 77.6 79.0 88.7 81.1 84.5 – 88.0 +2.6 87.8 +10.2 79.9 +0.9 88.8 +0.1 82.1 +1.0 85.7 +1.2 89.3 87.1 +1.7 86.7 +9.1 78.7 -0.3 89.3 +0.6 82.4 +1.3 85.7 +1.2 – SOP mAP mAP@R R@1 Method No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 Contrastive Triplet NtXent MS Smooth-AP ROADMAP Ours 62.9 63.1 60.1 62.5 63.1 64.4 – 62.8 -0.1 64.4 +1.3 60.3 +0.2 62.4 -0.1 63.2 +0.1 64.4 +0.0 65.0 62.7 -0.2 64.6 +1.5 60.0 -0.1 62.6 +0.1 63.5 +0.4 64.4 +0.0 – 57.0 56.4 53.4 56.2 56.4 58.2 – 56.9 -0.1 57.8 +1.4 53.7 +0.3 56.1 -0.1 56.5 +0.1 58.2 +0.0 58.6 56.8 -0.2 58.0 +1.6 53.4 +0.0 56.3 +0.1 56.8 +0.4 58.2 +0.0 – 82.4 82.0 79.7 81.4 82.0 83.1 – 82.3 -0.1 83.1 +1.1 80.0 +0.3 81.4 +0.0 82.1 +0.1 83.1 +0.0 83.3 82.3 -0.1 83.3 +1.3 79.7 +0.0 81.8 +0.4 82.3 +0.3 83.1 +0.0 – iNaturalist mAP mAP@R R@1 Method No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 No Lreg s̃ = 0.25 s̃ = 0.3 Contrastive Triplet NtXent MS Smooth-AP ROADMAP Ours 16.0 12.1 15.9 16.6 16.5 17.8 – 15.1 -0.9 13.3 +1.2 15.9 +0.0 16.6 +0.0 16.6 +0.1 18.1 +0.3 17.1 15.5 -0.5 13.6 +1.5 15.9 +0.0 16.6 +0.0 16.7 +0.2 18.0 +0.2 – 11.6 7.9 10.6 11.9 11.3 12.7 – 11.2 -0.4 9.8 +1.9 10.7 +0.1 11.9 +0.0 11.4 +0.1 13.1 +0.4 12.4 24 11.5 -0.1 10.0 +2.1 10.6 +0.0 11.9 +0.0 11.4 +0.1 13.0 +0.3 – 43.5 35.4 40.8 44.9 42.7 45.9 – 43.4 -0.1 41.4 +6.0 40.7 -0.1 44.9 +0.0 43.1 +0.4 47.1 +1.2 46.2 43.7 +0.2 41.7 +6.3 40.7 -0.1 45.0 +0.1 43.2 +0.5 46.8 +0.9 – Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 13. Hyperparameter tuning on SOP. We experiment with various values for ϵ, γ and s̃. The R@1 values are similar for the different hyperparameter choices. losses across Cars, SOP and iNaturalist benchmarks. This is a promising result. I.2. In-Shop Results We offer R@k results for standard k values on the In-Shop dataset in Table 8. In-Shop ((Liu et al., 2016)) is a popular image retrieval benchmark. We use the standard train-test split: we use 25,882 images from 3,997 classes for training; for testing, we use a query set with 14,218 images and a gallery set (aka. reference set) with 12,612 images, both from 3,985 classes. We use the same augmentation and batch size as SOP. An epoch is defined as 600 batches. I.3. Average Precision Results for Table 4 We offer Average Precision (AP) results in Table 7. We present results for two flavors of AP ((McFee & Lanckriet, 2010)): mAP and mAP@R. mAP is the precision at k, averaged across correctly retrieved samples, averaged across the query set. mAP@R ((Musgrave et al., 2020a)) only averages across retrievals up to the number of positive pairs in the gallery set. Hence, mAP@R is always smaller than mAP. The two versions of AP are equivalent bases for comparison. In the following equations, Xi+ denotes the set of samples in the gallery set with the same label as query i and Xi− denotes the set of samples in the gallery set with a different label than query i. Prec@k denotes the precision at k, which is the percentage of samples ranked less than or equal to k with the same label as the query. 1 mAPi = |Xi+ | |Xi+ |+|Xi− | X Prec@k 1[k ∈ Xi+ ] k=1 + |Xi | 1 X mAP@Ri = Prec@k 1[k ∈ Xi+ ] |Xi+ | k=1 I.4. Comparison on Hierarchical Retrieval Metrics Over-reliance on binary supervision is a long-standing problem in metric learning which motivates our contextual loss. Another interesting line of work uses hierarchical labels during training to mitigate this over-reliance (Sun et al. (2021), Zheng et al. (2022), and Ramzi et al. (2022)). These works fall under the umbrella of “dynamic metric learning” and “hierarchical metric learning”. One potential drawback of these works is their requirement for hierarchical labels, which may not always be available. Broadly speaking, hierarchical metric learning losses enforce a larger penalty on mistakes in discriminating between labels that are farther apart in the hierarchy during training. The resulting embedding space is more consistent in the sense that most retrieval mistakes are between labels close together in the hierarchy. For example, if the query is an image of a bird, any incorrectly retrieved images are likely to be birds from a closely related species, rather than images of other animals. 25 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 14. Test R@1 vs. train R@1 accuracy for varying λ (γ = 0). This figure shows that adding the standard contrastive loss (lowering λ) increases the R@1 on training data. This also increases the test R@1 up to a point (approximately λ = 0.9), before overfitting. Figure 15. Plot of contextual similarity loss Lcontext with varying λ. Each color is a different λ. Results are on the SOP dataset. Observe that the contrastive loss implicitly minimizes the contextual loss (blue line). Also observe that in general, a lower contextual loss corresponds to a higher test R@1. This shows that our loss function is a reasonable objective. Note that the plot on the left plots Lcontext , not the complete training loss. γ = 0. Table 7. mAP and mAP@R Results. We use ResNet-50 with an embedding size of 512 for all experiments. We re-run all baselines in this Table using the implementation by (Musgrave et al., 2020b). Standard deviations are based on three trials with the same train-test split. CUB Method mAP mAP@R Cars mAP SOP mAP@R mAP mAP@R mini-iNaturalist mAP mAP@R Contrastive 36.6 ± 0.4 26.6 ± 0.5 38.2 ± 0.4 28.8 ± 0.5 62.9 ± 0.1 57.0 ± 0.1 16.0 ± 0.1 11.6 ± 0.0 Triplet 36.9 ± 0.4 26.5 ± 0.2 34.9 ± 0.8 24.7 ± 0.7 63.1 ± 0.0 56.4 ± 0.0 12.1 ± 0.0 7.9 ± 0.0 NtXent 36.3 ± 0.3 26.0 ± 0.3 36.2 ± 0.7 26.3 ± 0.7 60.1 ± 0.1 53.4 ± 0.1 15.9 ± 0.0 10.6 ± 0.0 MS 38.0 ± 0.1 27.7 ± 0.1 43.1 ± 0.3 33.7 ± 0.3 62.5 ± 0.0 56.2 ± 0.1 16.6 ± 0.0 11.9 ± 0.0 Fast-AP 34.4 ± 0.8 24.4 ± 0.7 32.7 ± 0.0 23.5 ± 0.0 60.7 ± 0.1 54.2 ± 0.1 13.9 ± 0.1 9.1 ± 0.1 Smooth-AP 36.8 ± 0.4 26.4 ± 0.4 37.5 ± 0.3 27.5 ± 0.2 63.1 ± 0.2 56.4 ± 0.2 16.5 ± 0.0 11.3 ± 0.0 ROADMAP 37.7 ± 0.1 27.5 ± 0.2 38.3 ± 0.4 28.7 ± 0.4 64.4 ± 0.1 58.2 ± 0.1 17.8 ± 0.0 12.7 ± 0.0 Ours 38.3 ± 0.2 28.0 ± 0.2 42.4 ± 0.4 33.0 ± 0.3 65.0 ± 0.1 58.6 ± 0.1 17.1 ± 0.0 12.4 ± 0.0 26 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 16. Ablation results on CUB and Cars. The blue line represents test R@1 of our framework for varying λ with γ = 0. The other colors represent various modifications or simplifications of the loss function (reference Table 5 and Section F for full description). We perform each ablation experiment with varying λ since the optimal λ is not constant across all experiments. Clearly, the modified versions of Lcontext are sub-optimal. Figure 17. Investigating the effect of detaching |Rk/2+ϵ (i)| in Eq. 27. The blue line shows R@1 of our framework with varying λ. γ = 0. The orange line shows R@1 after detaching |Rk/2+ϵ (i)|. Overall, detaching this normalization factor is undesirable. 27 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 8. In-Shop Results. We use ResNet-50 with an embedding size of 512 for all experiments. † indicates results reported by the original authors; we re-run all other baselines using the implementation by (Musgrave et al., 2020b). Our R@1 performance is better than all baselines. We are at least comparable to baselines on other R@k metrics. In-Shop Method R@1 R@10 R@20 R@30 R@40 R@50 Contrastive Triplet NtXent MS N-Softmax† Proxy NCA ++† Fast-AP Smooth-AP ROADMAP Ours 90.1 90.2 89.3 86.9 88.6 90.4 89.7 90.2 90.4 90.7 97.4 98.0 97.6 96.1 97.5 98.1 97.5 97.9 97.6 97.8 98.3 98.7 98.3 97.3 98.4 98.8 98.3 98.7 98.3 98.5 98.6 99.0 98.7 97.9 98.8 99.0 98.6 99.0 98.6 98.9 98.8 99.2 98.9 98.1 – 99.2 98.8 99.2 98.8 99.1 98.9 99.3 99.0 98.3 – – 98.9 99.2 99.0 99.2 Table 9. Hierarchical Retrieval Results. We use ResNet-50 with an embedding size of 512 for all experiments. We re-run all baselines in this table using the implementation by (Musgrave et al., 2020b). Image size is 224x224, and the experimental settings follow the settings presented in Section 5 of the main paper. The results in this table are from one random trial. “fine R@1” indicates the R@1 at the fine-grained label level (which is the same R@1 as the rest of the paper). On CUB, there are two higher levels in the label hierarchy: the family and order that the bird species belongs to, indicated as “mid” and “coarse”, resp. On Cars, the coarse label is the make of the vehicle. On SOP, the coarse label is the category of the product. We follow the hierarchical labels used in (Chang et al., 2021) for CUB and Cars; we use the coarse labels given by the original SOP dataset. CUB Method Contrastive Roadmap Triplet MS+miner Proxy Anchor Proxy NCA Contextual (ours) Cars SOP fine R@1 fine AP mid AP coarse AP fine R@1 fine AP coarse AP fine R@1 fine AP coarse AP 66.2 66.2 65.0 69.4 68.0 65.7 72.7 35.9 36.0 34.5 37.4 36.6 34.9 40.3 52.0 53.8 53.8 53.5 53.9 55.5 54.3 85.7 88.1 89.8 88.5 89.1 88.0 86.4 81.1 83.5 89.3 90.7 88.8 88.0 91.0 33.4 37.3 43.3 42.7 39.0 37.6 43.4 39.8 41.9 42.7 39.4 38.6 38.5 39.1 80.7 82.1 81.8 82.1 79.8 78.9 82.7 60.1 62.6 62.5 63.2 59.4 58.6 63.7 13.0 13.9 15.0 13.5 16.2 16.8 13.5 The contextual loss as presented in this paper does not use hierarchical labels, since our focus is advancing the state-of-the-art in fine-grained retrieval, irrespective of coarse-grained retrieval. Keeping this in mind, we offer results on hierarchical retrieval metrics in Table 9. From Table 9, we conclude that our contextual loss always achieves the highest fine-grained performance, but under-performs baselines at higher levels in the hierarchy. This trade-off in fine vs. coarse retrieval is consistent with results from HAPPIER (Ramzi et al., 2022). We hope that these results will be helpful for future work on adapting the contextual loss to the hierarchical retrieval setting. I.5. Contextual Similarity Re-ranking at Test Time Using contextual similarity for re-ranking at test time is a common trick used to improve recall accuracy. However, this is not a common evaluation setting, so we omitted it from the main paper. We offer R@1 results with contextual re-ranking in Table 10. The contextual re-ranking procedure is as follows. After training, we calculate the cosine similarity between all pairs of samples in the test set. Denote this as sij , for each pair of test samples i and j. We then calculate a simplified version of the contextual similarity w̃ij between each pair of test samples i and j using Equations 15 and 16 in Appendix E. Note that sij ∈ [−1, 1] while w̃ij ∈ [0, 1], so we need to rectify sij to occupy the same interval as w̃ij . To this end, define s̃ij = exp(2sij − 2). For retrieval, we use a weighted sum of the rectified cosine similarity and the contextual similarity: β w̃ij + (1 − β)s̃ij . 28 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Table 10. Contextual re-ranking R@1 results. β = 0.1 is weight on contextual similarity. We experiment with different values for k. We use ResNet-50 with an embedding size of 512 for all experiments. We re-run all baselines in this table using the implementation by (Musgrave et al., 2020b). Image size is 224x224, and the experimental settings follow the settings presented in Section 5 of the main paper. The results in this table are from one random trial. CUB Method Contrastive Roadmap Triplet MS+miner Proxy Anchor Proxy NCA Contextual (ours) Cars SOP no re-rank k = 16 k = 32 no re-rank k = 16 k = 32 no re-rank k=4 66.2 66.2 65.0 69.4 68.0 65.7 72.7 66.6 67.4 66.6 71.1 69.1 67.2 74.3 66.8 66.9 65.8 70.7 69.0 66.9 73.9 81.1 83.5 89.3 90.7 88.8 88.0 91.0 81.2 83.5 90.0 91.2 89.3 88.5 91.4 80.8 83.3 89.9 91.0 89.1 88.5 91.1 80.7 82.1 81.8 82.1 79.8 78.9 82.7 81.0 82.3 82.2 82.4 80.1 79.1 82.9 Figure 18. CUB R@1 results for our contextual loss function, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Table 10 shows that all methods benefit from contextual re-ranking, and our contextual loss remains the best method across the three benchmarks. Embeddings optimized by the contextual loss do not benefit more from contextual re-ranking than baselines (i.e. the relative increase in R@1 with contextual re-ranking is about the same across all methods). There is no theory to suggest that optimizing contextual similarity at train time results in better contextual similarity re-ranking at test time, and we make no claims regarding this. I.6. Additional Comparisons to AP Surrogates Figures 18 through 29 present a comparison between our contextual loss function and various AP surrogates. The goal of these figures is to show that our contextual loss function is definitively better at ranking than AP surrogates. We optimize a convex combination of the ranking loss and standard contrastive loss. We present results for different choices of learning rates, λ (weight on ranking loss), and positive margin on the contrastive loss, since the optimal hyperparameters vary between methods. We include results for all three benchmarks. The AP surrogates included in this comparison are: Roadmap (Ramzi et al., 2021), Smooth-AP (Brown et al., 2020), Softbin-AP (Revaud et al., 2019), and Blackbox-AP (Rolı́nek et al., 2020). 29 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 19. CUB R@1 results for Roadmap, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 20. CUB R@1 results for Smooth-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 21. CUB R@1 results for Softbin-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. 30 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 22. CUB R@1 results for Blackbox-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 23. Cars R@1 results for our contextual loss function, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 24. Cars R@1 results for Roadmap, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. 31 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 25. Cars R@1 results for Smooth-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 26. Cars R@1 results for Softbin-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 27. Cars R@1 results for Blackbox-AP, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. 32 Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization Figure 28. SOP R@1 results for our contextual loss function, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. Figure 29. SOP R@1 results for Roadmap, with varying λ, learning rate, and positive margin on contrastive loss. We plot both train and test R@1 results. “lr” and “pm” in the legend stand for learning rate and positive margin δ+ , respectively. 33