Learning Dense Correspondences between Photos and Sketches
Xuanchen Lu 1 Xiaolong Wang 1 Judith E. Fan 1 2

Figure 1: We propose a self-supervised method for learning the dense correspondence between sketches and photos. For
each photo-sketch pair, we show the annotated keypoints from our benchmark dataset PSC6K (first column), the predicted
correspondences (second column), and the result of warping the photo to the sketch (third column).

Abstract

loy et al., 2016) with fine-grained correspondence
metadata. Second, we propose a self-supervised
method for learning dense correspondences
between sketch-photo pairs, building upon
recent advances in correspondence learning
for pairs of photos. Our model uses a spatial
transformer network to estimate the warp flow
between latent representations of a sketch and
photo extracted by a contrastive learning-based
ConvNet backbone. We found that this approach
outperformed several strong baselines and
produced predictions that were quantitatively
consistent with other warp-based methods. However, our benchmark also revealed systematic
differences between predictions of the suite of
models we tested and those of humans. Taken
together, our work suggests a promising path
towards developing artificial systems that achieve
more human-like understanding of visual images
at different levels of abstraction. Project page:

Humans effortlessly grasp the connection
between sketches and real-world objects, even
when these sketches are far from realistic.
Moreover, human sketch understanding goes
beyond categorization — critically, it also
entails understanding how individual elements
within a sketch correspond to parts of the
physical world it represents. What are the
computational ingredients needed to support this
ability? Towards answering this question, we
make two contributions: first, we introduce a
new sketch-photo correspondence benchmark,
PSC6k, containing 150K annotations of 6250
sketch-photo pairs across 125 object categories,
augmenting the existing Sketchy dataset (Sangk1
University of California, San Diego 2 Stanford University. Correspondence to: Judith Fan <jefan@stanford.edu>.

Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).

https://photo-sketch-correspondence.
github.io

1

Learning Dense Correspondences between Photos and Sketches

1. Introduction

bust feature representations learned by deep neural networks
are more robust to variations in appearance and shape. However, finding correspondence between photos and sketches
is particularly challenging as human-generated sketches are
inherently selective, highlighting the most relevant aspects
of an object’s appearance at the expense of other aspects
(Fan et al., 2020; Huey et al., 2021). Moreover, sketches
typically lack the texture and color cues that can facilitate
dense correspondence learning for color photos. As a consequence, the task of learning dense semantic correspondences
between photos and sketches relies on a substantial degree
of visual abstraction in order to establish strong semantic
alignment between images from different modalities.

Sketching is a powerful technique humans use to create images that capture key aspects of the visual world. It is also
among the most enduring and versatile of image generation
techniques, with the earliest known sketch-like images dating to at least 40,000-60,000 years ago (Hoffmann et al.,
2018; Aubert et al., 2014). Although the retinal image cast
by a sketch and a real-world object are highly distinct, humans are nevertheless able to grasp the meaning of that
sketch at multiple levels of abstraction, including the category label that best applies to it, the specific object instance
it represents, as well as detailed correspondences between
elements in the sketch and the parts of the object (Fan et al.,
2018; Mukherjee et al., 2019; Yang & Fan, 2021). What are
the computational ingredients needed to achieve such robust
image understanding across domains and at multiple levels
of abstraction?

Self-supervised representation learning. A robust finding
from the past decade is that deep neural networks trained
with supervision at large, labeled image datasets can achieve
state-of-the-art performance (Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; He et al., 2016). Moreover,
models trained in this way currently provide the most quantitatively accurate models of biological vision in non-human
primates and humans (Yamins et al., 2014; Khaligh-Razavi
& Kriegeskorte, 2014; Rajalingham et al., 2018; Cadena
et al., 2019). Nevertheless, such models are unlikely to
explain how humans are capable of achieving such robust
image understanding across different modalities given the
implausibility that such large, labeled datasets were available to or necessary for humans to learn to understand natural visual inputs, much less to interpret sketches (Hochberg
& Brooks, 1962; Kennedy & Ross, 1975). Recent advances
in self-supervised representation learning have begun to approach the performance of supervised models without the
need for such labels (Wu et al., 2018; He et al., 2020), while
also emulating key aspects of visual processing in biological systems (Zhuang et al., 2021; Konkle & Alvarez, 2020).
However, it remains unclear to what degree these advances
are sufficient to support challenging multi-domain image understanding tasks, including predicting dense photo-sketch
correspondences.

Generalizing across photorealistic and stylized image
distributions. There has been substantial recent progress
in the development of artificial vision systems that capture
some key aspects of sketch understanding, especially sketch
categorization and sketch-based image retrieval (Eitz et al.,
2012; Sangkloy et al., 2016; Yu et al., 2016; 2017; Bhunia
et al., 2020). In addition, the availability of larger models
that have been trained on vast quantities of paired image and
text data have led to encouraging results on tasks involving
images exhibiting different visual styles (Radford et al.,
2021), including sketch generation (Vinker et al., 2022).
However, recent evidence suggests that even otherwise highperforming vision models trained on photorealistic image
data do not generalize well to other image distributions as
well as neurons in primate inferotemporal cortex (a key
brain region supporting object categorization) (Bagus et al.),
indicating that a large gap remains between the capabilities
of current computer vision systems and those achieved by
biological systems.
Perceiving semantic correspondences between images.
In particular, a core open problem in human sketch understanding concerns the computational ingredients required
to encode the internal structure of a sketch with sufficient
fidelity to establish a detailed mapping between parts of a
sketch with parts of the object it represents (Kulvicki, 2015;
Fodor, 2007). The problem of discovering semantic correspondences between images is a well established problem
in computer vision. In the typical setting, the goal is to
establish dense correspondences between images containing objects belonging to the same class. Classical methods
(Berg et al., 2005; Kim et al., 2013; Liu et al., 2010) determine the alignment with hand-crafted feature descriptors
such as SIFT (Lowe, 1999) or DOG (Dalal & Triggs, 2005).
More recently developed methods (Ham et al., 2016; Rocco
et al., 2018a; Truong et al., 2021), which benefit from the ro-

Our contributions: Evaluating a self-supervised method
for learning photo-sketch correspondences. Towards
meeting these challenges, our paper makes two key contributions: first, we establish a new benchmark for photo-sketch
dense correspondence learning: PSC6k. This benchmark
consists of 150,000 pairs of keypoint annotations for 6250
photo-sketch pairs spanning 125 object categories. Each
annotation consists of a keypoint marked by a human participant on an object in a color photo that they judged to
correspond to a given keypoint appearing on a sketch of the
same object. All photo-sketch pairs were sampled from the
well established Sketchy dataset (Sangkloy et al., 2016), a
collection of 75K sketches produced by humans to depict
objects in 12.5K color photographs of objects spanning 125
categories.
2

Learning Dense Correspondences between Photos and Sketches

Our second contribution is a self-supervised method for
learning photo-sketch correspondences that leverages a
learned nonlinear “warping” function to map one image
to the other. This approach embodies the hypothesis that
sketches preserve key information about spatial relations
between an object’s constituent parts, even if they also manifest distortions in the size and shape of these parts. This
hypothesis is motivated by the view that representational
line drawings, as sparse as they are, are meant to accurately
convey 3D shape (Hertzmann, 2020), which stands in sharp
contrast to the view that the relationship between drawings
and objects are established purely by convention (Goodman,
1976). Nevertheless, the nonlinear “warping” approach we
propose diverges from very strong versions of the 3D-shapepreservation account (Greenberg, 2021), which are not well
equipped to handle the kinds of nonlinear visual distortions
that human-generated sketches exhibit (Eitz et al., 2012;
Sangkloy et al., 2016; Fan et al., 2018).
Our system consists of two main components: the first is a
multimodal image encoder trained with a contrastive loss
(Wu et al., 2018; Zhuang et al., 2021), with photos and
sketches of the same object being treated as positive examples, and those depicting different objects as negative examples. The second component is a spatial transformer network
(Jaderberg et al., 2015) that estimates the transformation between each photo and sketch and aims to maximize the
similarity between the feature maps for both images. Using our newly developed PSC6k benchmark, we find that
our system outperforms other existing self-supervised and
weakly supervised correspondence learning methods, and
thus establishes the new state-of-the-art for sketch-photo
dense correspondence prediction. We will publicly release
PSC6k with extensive documentation and code to enhance
its usability to the research community.

Figure 2: Examples of human-annotated photo-sketch pairs
from our new photo-sketch correspondence benchmark
PSC6k.
photographed object from the wrong perspective (Sangkloy
et al., 2016). We then randomly sampled 5 sketches from
among the remaining valid sketches produced of each photo,
resulting in 6250 unique photo-sketch pairs.
2.2. Collecting Human Keypoint Annotations
We formalize the problem of identifying photo-sketch correspondences as the ability to map a keypoint located on
a sketch to the location in the source photograph that best
corresponds to it. For example, a keypoint appearing on the
left wing of a sketch of an airplane should be mapped to the
“same” location on the left wing of the photograph of that
same airplane. For each photo-sketch pair, we sampled 8
keypoints spanning as much of the object as possible. To
determine these keypoints, we first computed segmentation
masks for each sketch, relying upon the heuristic that outermost contour of the sketch naturally serves as the contour of
the object in the sketch. The pixels covered by the segmentation mask were then clustered into 8 groups to estimate 8
“pseudo-part” regions. We employ nearest-neighbor-based
spectral clustering to prioritize connectivity within each
pseudo-part. A keypoint was then placed at the centroid of
each pseudo-part.

2. Photo-Sketch Correspondence Benchmark
(PSC6k)
Our first goal was to establish a novel photo-sketch correspondence benchmark satisfying two criteria: first, it should
build directly upon existing benchmarks in sketch understanding and second, it should provide broad coverage of a
wide variety of visual concepts. Towards that end, we developed PSC6k by directly augmenting the Sketchy dataset
(Sangkloy et al., 2016), which already contains 75,471 human sketches produced from 12,500 unique photographs
spanning 125 object categories.

This approach allowed us to automatically discover regions
of the sketch that are likely to be semantically meaningful
without the need for explicit part labels. However, this approach is also less sensitive to sketch regions that constitute
only a small portion of object mask (e.g., a cat’s whiskers).
As such, future work could employ a combination of regionbased and stroke-based keypoints to gain fuller coverage of
semantically meaningful regions of sketches.

2.1. Sampling Photo-Sketch Pairs
We sampled photo-sketch pairs from the original test split
of the Sketchy dataset, which consisted of 1250 photos
and their corresponding sketches. We manually filtered out
sketches that were completely off-target or that depicted the
3

Learning Dense Correspondences between Photos and Sketches

Base
Encoder 𝜙

f = 𝜙(I)

Contrastive Loss

𝜙

Correlation Matrix
𝑊𝐻×𝑊𝐻

Momentum
Encoder

Negative
Queue

…

𝜙

Perceptual
Similarity

g

𝜙

Forward-backward
Consistency

Network
module
Objectives

Figure 3: We propose a self-supervised framework for learning photo-sketch correspondence by estimating a dense flow
that warps one image to the other. The framework consists of a multi-modal feature encoder that aligns the photo-sketch
representation with a contrastive loss, and an STN-based warp estimator to predict transformation that maximizes the
similarity between feature maps of the two images. The estimator learns to optimize a combination of weighted perceptual
similarity and forward-backward consistency.
3.1. Feature Encoder ϕ

Next, we recruited 1,384 participants using the Prolific
crowdsourcing platform to provide annotations. Participants
provided informed consent in accordance with the UC San
Diego Institutional Review Board (IRB). On each trial, participants were cued with a keypoint appearing on a sketch
and asked to indicate its corresponding location in a photo
appearing next to it. Each participant provided annotations
for 125 photo-sketch pairs, one from each category. We collected three annotations from different participants for each
keypoint in every sketch, resulting in 150,000 annotations
across all 6250 photo-sketch pairs. We defined the centroid
over these annotations as the ground-truth keypoint in the
photo. In rare cases, there was one annotation out of three
with an exceptionally large distance from the median location of all three annotations; these responses were flagged as
outliers and excluded from the determination of the centroid.
See Appendix A for additional details regarding the creation
of this photo-sketch correspondence benchmark.

Here we leverage advances in contrastive learning to develop
a weakly-supervised feature encoder on photo-sketch data
pairs. Contrastive learning obtains a feature representation
by contrasting similar and dissimilar pairs. Here, the photo
Ip and the sketch Is depicting the same object become a
natural choice to construct similar pairs. Unlike typical
contrastive learning schemes (Wu et al., 2018; Chen et al.,
2020a; He et al., 2020) that take augmented views of the
same image I as positives, our model uses augmented views
from the same photo-sketch pair (Ip , Is ). To minimize the
contrastive loss over a set of photo-sketch pairs, the encoder
must learn a feature space that attracts photo/sketch from
the same pair and separates photo/sketch from distinct pairs.
Similar to (He et al., 2020), we formulate pair-level contrastive learning as a dictionary look-up problem. For a
given photo-sketch pair (Ip , Is ), random data augmentation
is applied to generate the view pair (Iep , Ies ). One view in
the pair is randomly selected as the query and the other
becomes the corresponding key. We denote their representations encoded by ϕ as q and k + , respectively. The query
token q should match its key k + over a set of negative keys
k − sampled from other photo-sketch pairs. To optimize this
target, we minimize InfoNCE (Oord et al., 2018) as follows:

3. Weakly-supervised Photo-Sketch
Correspondence
In this section, we present our weakly-supervised model
for finding the pixel-level correspondence between photosketch pairs. We formulate the problem as estimating the
displacement field across a sketch Is ∈ Rh×w×3 and a photo
Ip ∈ Rh×w×3 that depict the same object (Figure 3). Our
goal is to find the cross-modal photo-sketch alignment in a
weakly-supervised manner, by maximizing the perceptual
similarity of an image in (Ip , Is ) and its warped counterpart.
Our framework consists of a feature encoder ϕ that learns
a shared feature space of photo and sketch, and a warp
estimator T based on the spatial transformer network (STN)
that directly predicts the displacement field F ∈ Rh×w×2 ,
where we extract the dense correspondence.

Lnce = − log

exp (q·k + /τ )
P
, (1)
exp (q·k + /τ ) + k− exp (q·k − /τ )

where τ is a temperature hyperparameter scaling the data
distribution in the metric space.
To explore the inherent similarity between photos and
sketches, we use a shared encoder ϕ for images from both
modalities. We replace the batch normalization (BN) (Ioffe
& Szegedy, 2015) in the encoder with conditional batch
normalization (De Vries et al., 2017) for better domain
4

Learning Dense Correspondences between Photos and Sketches
Image Pair

alignment. Detailed implementation and experiment are
reported in section 4.

Feature Map Weight Map

Results

3.2. Warp Estimator T
Given the source and target image Is , It and their representation Xs , Xt , the warp estimator T predicts the displacement
field FIs →It = T (Xs , Xt ). Inspired by (Sun et al., 2018),
we propose a simplified pyramidal warp estimation module
for the ResNet backbone.
Affinity function f . While it is possible to estimate the
correspondence based on the feature affinity at a specific
layer of the encoder ϕ, e.g., the final convolutional layer, it
is beneficial to evaluate affinities at multiple layers along
the feature pyramid. We select a set of n feature layers of
n−1
interest, denoted as Xs = {xis }i=0
and Xt = {xit }n−1
i=0 . We
bilinearly upsample all selected feature maps to the same
spatial resolution, and concatenate them along the channel
dimension for the multi-layer feature maps, Xs ∈ Rc×h×w
and Xt ∈ Rc×h×w .
With the source and target feature maps Xs and Xt , we
compute affinity as the correlation between feature embeddings: with pixel i in feature map Xs and pixel j in feature
map Xt , A(s,t) (i, j) = Xs (i)T Xt (j). The pairwise affinity
between every pixel in the source and target feature maps
forms the affinity matrix f (Xs , Xt ) := A(s,t) ∈ Rhw×hw .

Figure 4: Example image pairs, feature maps, weight maps,
and final results processed in our warp estimator. The weight
maps highlight semantic parts that have the largest correlation between the two images. We use PCA to project the
feature dimensions to 3 principal components as RGB.

Estimation Module g. Module g takes the affinity matrix
A(s,t) and directly estimates the displacement field F from
the source image to the target image. Following the idea of
coarse-to-fine refinement, it consists of three STN blocks at
different scales with residual connections, denoted as g1 , g2
and g3 . Each STN-block (except the first block) takes the
affinity matrix warped by the previous block and regresses
a new displacement field to refine the alignment. The first
block g1 regresses at the 4×4 scale, estimating displacement
field F (0) ∈ R4×4×2 . g2 and g3 regress at the 8 × 8 and
16 × 16 scale, respectively. The displacement field at each
block is computed as
F (1) = g1 (f (Xs , Xt )),
F

(k)

=F

(k−1)

+ gi (f (warp(Xs , F

(k−1)

ity using the warped source feature map (direct similarity),
we pass the warped source image into the feature encoder
again and evaluate similarity using the new feature map,
so that the feature encoder serves as a soft constraint that
reduces warping artifacts and stabilizes training (perceptual similarity). We use subscripts to indicate the direction
of warp; for example, the displacement field from Is to
It is denoted Fs→t . We also denote the warped image as
Is→t = warp(Is , Fs→t ).

(2)
), Xt )),

(3)

Perceptual similarity s. For an image pair (Is , It ), the
model estimates the flow Fs→t and renders the warped
source image Is→t . The warped source image is passed
through the encoder ϕ to generate its new set of feature maps
Xs→t , as well as its new affinity with the target A(s→t,t) .
The new affinity matrix represents how well the warped
source image semantically aligns with the target.

where warp(I, F ) operation warps image I to target according to the displacement field F . It is implemented with
bilinear interpolation.
After g3 generates the 16 × 16 displacement field, it is
upsampled to full image resolution as the final estimation.

In the ideal case, each pixel in the warped source Xs→t will
have the highest correlation with the pixel at the same location in the target Xt . This is reflected in the affinity space
A(s→t,t) ∈ Rn×hw×hw as a maximized diagonal along the
second and third axes. For a pixel in warped source Xs→t ,

3.3. Weighted Perceptual Similarity
We propose using weighted perceptual similarity to evaluate the quality of estimated displacement field between
the photo-sketch pair. Instead of directly evaluating similar5

Learning Dense Correspondences between Photos and Sketches

where FI is the identity displacement that maps all locations
to themselves.

we formulate the optimization as selecting the pixel that
matches correctly from all pixels in target Xt :

exp A(s→t,t) (n, i, i)/τ
 , (4)
s(n, i) = − log P
j exp A(s→t,t) (n, i, j)/τ

Overall, our final objective is
L = λsim Lsim + λcon Lcon ,

where n is the index of the feature layer to evaluate on; i, j
are indices of pixels in the source and target feature map.
Weight function w. While it is possible to optimize flow
estimation with the above formula, there are two problems.
First, sketches contain a large number of empty pixels, and
photos often suffer from background clutter. Moreover,
while the encoder activation generally lies over the entire
object in the photo, the activation concentrates along the
strokes in a sketch. As a result, optimizing the correspondence of every pixel is inefficient and biased toward the
background. To focus optimization on important matches,
we consider an intuitive rule: important pixels in one image should have greater affinities to the other image. It is
formulated as a weight function:
w(n, i) = scale(max [norm(A(s→t,t) )(n, i)])
j

4. Experiments
Here we empirically evaluate our method and compare it to
existing approaches in dense correspondence learning on the
photo-sketch correspondence benchmark. We analyze the
difference between human annotations and predictions from
existing methods. We show that our method establishes the
state-of-the-art in the photo-sketch correspondence benchmark and learns a more human-like representation from the
photo-sketch contrastive learning objectives. We conducted
additional experiments to evaluate generalization to unseen
categories in Appendix C.

(5)

4.1. Implementation Details
The input image size is set to 256 following our photosketch correspondence benchmark. We use ResNet-18 and
ResNet-101 as our feature encoder. The encoder is initialized with pretrained weights from MoCo training (He
et al., 2020) on ImageNet-2012 (Deng et al., 2009). We then
train our encoder on the training split of Sketchy for 1300
epochs. Since there are multiple sketches for each photo in
the dataset, at each epoch, we iterate through all photos and
sample a corresponding sketch for each photo. We follow
the recipe from MoCo (He et al., 2020; Chen et al., 2020c),
with dim = 128, m = 0.999, t = 0.07, lr = 0.03 and a
two-layer MLP head. Noticeably, we set the size of the
memory queue to K = 8192 to prevent multiple positive
pairs from appearing at the same time.

where norm is the normalization over the affinity matrix to
penalize pixels that have multiple large affinities in the other
image. scale is an arbitrary operation to standardize the
weight function. We use Min-Max to scale its distribution
to [0, 1].
Therefore, the final perceptual similarity loss is given by
Lsim (n, i) = w(n, i)s(n, i)

(6)

In Figure 4, we visualize the image pairs, feature maps,
weight maps, and the final alignment results of photo-sketch
pairs from PSC6k to exhibit the function of each component
in our estimator.

We then train the estimator for 1200 epochs with a learning
rate of 0.003, leading to 2500 epochs of training in total. We
set the weights of the objectives to λsim = 0.1, λcon = 1.0.
We compute Lsim using the features after ResNet stages 2
and 3, and the temperature is set to τ = 0.001.

3.4. Additional Objectives
In addition to the perceptual similarity loss, we consider an
additional self-supervised loss to assist robust warp estimation and stabilize training.
Forward-backward consistency. Forward-backward consistency is a classical idea in tracking (Vondrick et al., 2018;
Wang et al., 2019; Jabri et al., 2020) and flow estimation
(Meister et al., 2018; Rocco et al., 2017; Jeon et al., 2018;
Truong et al., 2021; Huang et al., 2019) as constraints.
Namely, we expect the estimated forward flow Fs→t to be
the inverse of the estimated backward flow Ft→s . It poses
a strict constraint on the network for symmetric prediction.
We minimize the L2 norm between the identity flow and the
composition of the forward flow and backward flow:
Lcon = ∥warp(Fs→t , Ft→s ) − FI ∥,

(8)

We apply the same set of augmentations to both feature
encoder and the warp estimator, consisting of random color
jitter, grayscale, and Gaussian blur, which are consistent
with the settings in MoCo v2 (Chen et al., 2020c) and SimCLR (Chen et al., 2020a). However, we replace random
cropping with a combination of affine and TPS transformations for a more complex spatial distortion.
We train the network with the SGD optimizer, a weight
decay of 1e − 4, a batch size of 256, and the native mixed
precision from PyTorch. We adopt a cosine learning rate
decay schedule (Loshchilov & Hutter, 2016).

(7)
6

Learning Dense Correspondences between Photos and Sketches

Methods

Encoder

Transfer
PCK-5 PCK-10

Retrain
PCK-5 PCK-10

CNNGeo (Rocco et al., 2018a)
WeakAlign (Rocco et al., 2018a)
NC-Net (Rocco et al., 2018b)
DCCNet (Huang et al., 2019)
PMD (Li et al., 2021)
WarpC-SemanticGLUNet (Truong et al., 2021)
Ours
Ours

ResNet-101
ResNet-101
ResNet-101
ResNet-101
VGG-16
VGG-16
ResNet-18
ResNet-101

27.59
35.65
40.60
42.43
35.77
48.79
–
–

19.19
43.55
–
–
–
56.78
56.01
57.92

57.71
68.76
63.50
66.53
71.24
71.43
–
–

42.57
78.60
–
–
–
79.70
82.89
84.72

Table 1: State-of-the-art comparison for photo-sketch correspondence learning.
4.2. Photo-sketch Correspondence Estimation

Training Description

PCK-5 PCK-10

We evaluate our correspondence estimation results qualitatively and quantitatively. We compare our method with
existing approaches in correspondence learning with image
or pair-level supervision, and present a state-of-the-art comparison on photo-sketch correspondence in Table 1. For
fair comparisons, we retrain existing open-sourced methods on the same photo-sketch dataset we used to develop
our own model (Sangkloy et al., 2016). We report their
PCK for α = (0.05, 0.1) in two settings: transfer (directly
evaluate on photo-sketch correspondence with pretrained
weights) and retrain (train from scratch on photo-sketch
correspondence). Methods that fail to converge on photosketch dataset are left blank. In Appendix B, we include
methods with stronger supervision to the table and detail
the training/evaluation setting of each method.

ImageNet only
CL on individual image
CL on image class
CL on image pair

17.20
44.41
54.81
56.01

48.93
75.67
81.72
82.89

Table 2: Ablation study on training feature encoder.
Ablation Description

PCK-5 PCK-10

No Lsim
No perceptual Lsim
No Lcon
No weight function w
No multiple feature layers
No conditional BN
Complete model

17.46
49.41
52.49
54.29
55.19
55.84
56.01

49.43
80.59
80.38
82.52
83.14
82.67
82.89

Table 3: Ablation study on correspondence estimation.

Our approach sets a new state-of-the-art for photo-sketch
correspondence. Although we only regress flow at 16 × 16,
which is less than the granularity of PCK-05, our ResNet101 model gains a substantial increase of +1.14%/+5.02%
compared to the second-best method WarpC-SemanticGLUNet (Truong et al., 2021). This is surprising as the latter
method benefits from flow resolution four times as large
as ours, and additional two-stage training on CityScape
(Cordts et al., 2016), DPED (Ignatov et al., 2017), and ADE
(Zhou et al., 2019). Our smaller ResNet-18 model also
outperforms most existing methods despite a significantly
shallower feature encoder, demonstrating the effectiveness
of our pair-based contrastive learning scheme in finding
dense correspondences between images from different image modalities. We visualize more examples of the dense
correspondence that our model predicts in Appendix D.

contrastive learning. The following rows compare the performance of different ways of constructing positive pairs: 1)
two augmented views from single images from the photosketch dataset, as in classical contrastive learning; 2) a photo
and a sketch randomly sampled from the same class; and
3) a photo and a sketch from the same photo-sketch pair.
We find that the pretrained model on ImageNet leads to the
worst performance due to its failure to generalize to sketch
data. Classical contrastive learning on the photo-sketch
dataset also harms model estimation, because the domains
of photo and sketch are not explicitly aligned in the representation space. The best result comes from contrastive
learning on photo-sketch pairs, as it provides the strongest
supervision for learning discriminative features. In Table 3,
we analyze the key components of our correspondence estimation framework. We first show the importance of our
objectives, by ablating the similarity loss, the perceptual
version of the similarity loss, and the consistency loss. In
addition, we show that the use of the weight function, multiple feature layers, and conditional BN further improves the
model performance.

4.3. Ablation Study
We conduct two sets of ablation experiments on the ResNet18 version of our framework. In Table 2 we analyze different training schemes for the feature encoder. In the first
row, we directly use the pretrained weights from ImageNet
7

Learning Dense Correspondences between Photos and Sketches

4.4. Comparing model and human error patterns

photo-sketch contrastive learning (46.36%), and the result of
human participants (95.04%). The model trained on photosketch contrastive learning exhibits a reliably weaker texture
bias (i.e., and thus stronger shape bias) than its photo-only
counterparts (Figure 6).

To what degree do any of the models tested generate predictions that achieve the degree of consistency that we observe
between individual human annotators? To evaluate this
question, for each pair of systems (whether two models, two
humans, or a model and a human), we computed the normalized mean pixel distance between the predictions they
generated for a given photo-sketch pair, then normalized
this distance by the image size. We find that while higherperforming models tend to produce predictions that are more
similar to one another, all of the models taken together display systematic biases that are distinct from those of humans
performing the photo-sketch correspondence task Figure 5.
These results indicate the size of the current human-model
gap and suggest that future progress on this benchmark will
entail bringing human-model consistency values closer to
that observed between individual humans.

1.0

Shape Bias

0.8
0.6
0.4
0.2
0.0

Human1 0 0.06 0.06 0.12 0.13 0.13 0.21 0.14 0.12 0.2 0.18 0.14 0.15
Human2 0.06 0 0.06 0.12 0.13 0.13 0.21 0.14 0.12 0.2 0.18 0.14 0.15
Human3 0.06 0.06 0 0.12 0.13 0.13 0.21 0.14 0.12 0.2 0.18 0.14 0.15
Ours(PS) 0.12 0.12 0.12 0 0.07 0.06 0.15 0.1 0.08 0.17 0.13 0.07 0.09
WarpC(PS) 0.13 0.13 0.13 0.07 0 0.08 0.17 0.07 0.1 0.18 0.14 0.09 0.11
Weakalign(PS) 0.13 0.13 0.13 0.06 0.08 0 0.15 0.11 0.08 0.18 0.14 0.07 0.09
CNNGeo(PS) 0.21 0.21 0.21 0.15 0.17 0.15 0 0.2 0.14 0.24 0.18 0.12 0.1
WarpC(PF) 0.14 0.14 0.14 0.1 0.07 0.11 0.2 0 0.12 0.2 0.16 0.12 0.14
PMD(PF) 0.12 0.12 0.12 0.08 0.1 0.08 0.14 0.12 0 0.18 0.14 0.09 0.1
DCCNet(PF) 0.2 0.2 0.2 0.17 0.18 0.18 0.24 0.2 0.18 0 0.2 0.18 0.2
NCNet(PF) 0.18 0.18 0.18 0.13 0.14 0.14 0.18 0.16 0.14 0.2 0 0.13 0.14
Weakalign(PF) 0.14 0.14 0.14 0.07 0.09 0.07 0.12 0.12 0.09 0.18 0.13 0 0.06
CNNGeo(PF) 0.15 0.15 0.15 0.09 0.11 0.09 0.1 0.14 0.1 0.2 0.14 0.06 0

ImageNet
CLS

ImageNet
CL

Sketch-photo
CL

Human

Figure 6: Comparing the degree of shape vs. texture bias
between models trained with different objectives. Higher
values suggest that the model recognition depends more on
shape information. Our model exhibits more human-like
performance. Each dot represents an object category from
(Geirhos et al., 2018). Error bars indicate 95% CI.

5. Related Work
Self-supervised Representation Learning. Learning with
self-supervision aims to obtain generic representations for
diverse downstream tasks with minimal dependence on human labels (Wang & Gupta, 2015; Doersch et al., 2015;
Pathak et al., 2016; Noroozi & Favaro, 2016; Zhang et al.,
2016; Gidaris et al., 2018; Wu et al., 2018). Recent research on sketch understanding also benefits from such
development (Pang et al., 2020; Xu et al., 2020; Bhunia
et al., 2021). These approaches are especially important
for making progress towards human-like image understanding, given that large numbers of labeled images are neither
available to nor necessary for humans to develop robust
perceptual abilities (Zhuang et al., 2021; Konkle & Alvarez,
2020; Rajalingham et al., 2018), including the ability to
understand sketches (Hochberg & Brooks, 1962; Kennedy
& Ross, 1975). In particular, recently proposed contrastive
learning techniques demonstrate competitive performance
with supervised baselines not only on visual recognition
(Hjelm et al., 2018; Oord et al., 2018; Wu et al., 2018;
Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Chen
et al., 2020b; 2021), but also on learning visual representations from inputs varying across sensory views (Tian et al.,
2020a;b), across frames in video (Jabri et al., 2020; Xu
& Wang, 2021; Zhuang et al., 2020), and even between
text and images (Radford et al., 2021; Jia et al., 2021).

Figure 5: Measuring human and model consistency. Each
cell represents the mean pixel distance between correspondence predictions generated by two systems (whether artificial or human), normalized by the image size. We denote
models trained on Photo-sketch pairs with PS, and models
trained on PF-Pascal (Ham et al., 2016) as PF.
4.5. Shape Bias in Learned Representation
Recent work has shown that ImageNet-trained CNNs are
biased towards object texture compared to global object
shape on image recognition tasks (Geirhos et al., 2018).
Since sketch recognition requires relies on cues to object
category apart from texture, we hypothesized that our photosketch contrastive learning pre-training procedure would
mitigate this texture bias. To evaluate this hypothesis, we
followed the same evaluation protocol as in (Geirhos et al.,
2018; 2021). It devises a cue-conflict experiment in which
a model aims to classify images with conflicting shape and
texture. We report the shape bias of ResNet-18 models
from several different training objectives: ImageNet classification (20.06%), ImageNet contrastive learning (28.93%),
8

Learning Dense Correspondences between Photos and Sketches

Here, we leverage contrastive learning-based pretraining
to achieve strong performance on visual correspondence between images from highly distinct distributions (i.e., photos
and sketches). To the best of our knowledge, ours is the first
paper to successfully apply these approaches to the problem
of photo-sketch dense correspondence prediction.

eral strong correspondence learning baselines. Our results
suggest that our approach combining contrastive learning
and spatial transformer network is effective for capturing
photo-sketch correspondences, but there remain systematic
deviations from human judgments on the same task. Taken
together, we hope that these findings, along with our new
fine-grained multimodal image understanding benchmark,
will catalyze progress towards achieving more human-like
vision systems.

Weakly-supervised Semantic Correspondence Learning.
Geometric matching (Melekhov et al., 2019; Li et al., 2020;
Rocco et al., 2020; Shen et al., 2020; Truong et al., 2020)
is perhaps the most basic form of correspondence prediction, which aims to align two views of the same scene. On
the contrary, semantic matching (Ham et al., 2016; Rocco
et al., 2018a;b; Huang et al., 2019; Li et al., 2021; Truong
et al., 2021) aims to establish more abstract correspondences
between the image of objects in the same class, in a way
that is tolerant to greater variation in appearance and shape.
Due to difficulties in collecting ground-truth data for dense
correspondence learning, prior work has generally resorted
to weak supervision, such as synthetic transformation on
single images (Rocco et al., 2018a; Jeon et al., 2018; Seo
et al., 2018) and image pairs (Rocco et al., 2018b; Kim et al.,
2019; 2018; Jeon et al., 2020; Huang et al., 2019; Li et al.,
2021; Truong et al., 2021; 2022). Various objectives have
been proposed to explore the correspondence from weak
supervision, including synthetic supervision, optimization
of the cost volume, forward-backward consistency, or a combination of these objectives. Most work utilizes hierarchical
features in deep models from supervised pretraining on ImageNet. The dense correspondence is then predicted with a
dense flow field (Ham et al., 2016; Rocco et al., 2018a; Jeon
et al., 2018; Seo et al., 2018; Li et al., 2021; Truong et al.,
2021) or a cost volume (Rocco et al., 2018b; Huang et al.,
2019; Truong et al., 2022). In this work, we propose a photosketch correspondence learning framework that explicitly
estimates the dense flow field with image pair supervision.

Acknowledgements
Many thanks to the members of the Cognitive Tools Lab
and the Prof. Wang’s Lab at UC San Diego for their helpful feedback and support. This work was supported by an
NSF CAREER Award #2047191 to J.E.F.. J.E.F is additionally supported by an ONR Science of Autonomy award
and a Stanford Hoffman-Yee grant. Prof. Wang’s lab was
supported, in part, by NSF CAREER Award IIS-2240014,
DARPA LwLL, Amazon Research Award, and gifts from
Qualcomm.

References
Aubert, M., Brumm, A., Ramli, M., Sutikna, T., Saptomo,
E. W., Hakim, B., Morwood, M. J., van den Bergh, G. D.,
Kinsley, L., and Dosseto, A. Pleistocene cave art from
sulawesi, indonesia. Nature, 514(7521):223–227, 2014.
Bagus, A. M. I. G., Marques, T., Sanghavi, S., DiCarlo,
J. J., and Schrimpf, M. Primate inferotemporal cortex
neurons generalize better to novel image distributions
than analogous deep neural networks units. In SVRHM
2022 Workshop@ NeurIPS.
Berg, A. C., Berg, T. L., and Malik, J. Shape matching and
object recognition using low distortion correspondences.
In 2005 IEEE computer society conference on computer
vision and pattern recognition (CVPR’05), volume 1, pp.
26–33. IEEE, 2005.

6. Conclusions
What is needed to develop artificial systems that learn to
perceive the visual world as keenly as humans do? While artificial vision systems have made dramatic improvements in
a variety of tasks, there remain key aspects of human image
understanding that continue to pose major challenges. Here
we focused on one of these aspects: the ability to understand the semantic content of color photos and line drawings
well enough to establish a detailed mapping between them.
Our paper introduces a new photo-sketch correspondence
benchmark containing 150K human annotations of 6250
sketch-photo pairs across 125 object categories, augmenting existing photo-sketch benchmark datasets (Sangkloy
et al., 2016). In addition, we conduct several experiments
to evaluate a self-supervised approach to learning to predict
these correspondences and compare this approach to sev-

Bhunia, A. K., Yang, Y., Hospedales, T. M., Xiang, T.,
and Song, Y.-Z. Sketch less for more: On-the-fly finegrained sketch-based image retrieval. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 9779–9788, 2020.
Bhunia, A. K., Chowdhury, P. N., Yang, Y., Hospedales,
T. M., Xiang, T., and Song, Y.-Z. Vectorization and rasterization: Self-supervised learning for sketch and handwriting. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 5672–
5681, 2021.
Cadena, S. A., Denfield, G. H., Walker, E. Y., Gatys, L. A.,
Tolias, A. S., Bethge, M., and Ecker, A. S. Deep con9

Learning Dense Correspondences between Photos and Sketches

volutional models improve predictions of macaque v1
responses to natural images. PLoS computational biology, 15(4):e1006897, 2019.

Fan, J. E., Yamins, D. L., and Turk-Browne, N. B. Common
object representations for visual production and recognition. Cognitive science, 42(8):2670–2698, 2018.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
simple framework for contrastive learning of visual representations. In International conference on machine
learning, pp. 1597–1607. PMLR, 2020a.

Fan, J. E., Hawkins, R. D., Wu, M., and Goodman, N. D.
Pragmatic inference and visual abstraction enable contextual flexibility during visual communication. Computational Brain & Behavior, 3(1):86–101, 2020.

Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
Hinton, G. E. Big self-supervised models are strong
semi-supervised learners. Advances in neural information
processing systems, 33:22243–22255, 2020b.

Fodor, J. The revenge of the given. Contemporary debates
in philosophy of mind, pp. 105–116, 2007.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained cnns are
biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231,
2018.

Chen, X., Fan, H., Girshick, R., and He, K. Improved
baselines with momentum contrastive learning. arXiv
preprint arXiv:2003.04297, 2020c.
Chen, X., Xie, S., and He, K. An empirical study of training
self-supervised vision transformers. In Proceedings of
the IEEE/CVF International Conference on Computer
Vision, pp. 9640–9649, 2021.

Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T.,
Bethge, M., Wichmann, F. A., and Brendel, W. Partial
success in closing the gap between human and machine
vision. Advances in Neural Information Processing Systems, 34:23885–23899, 2021.

Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., and Kim, S.
Cats: Cost aggregation transformers for visual correspondence. In Thirty-Fifth Conference on Neural Information
Processing Systems, 2021.

Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv
preprint arXiv:1803.07728, 2018.
Goodman, N. Languages of art: An approach to a theory of
symbols. Hackett publishing, 1976.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., Franke, U., Roth, S., and Schiele, B.
The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.

Greenberg, G. Semantics of pictorial space. Review of
Philosophy and Psychology, 12(4):847–887, 2021.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.,
Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z.,
Gheshlaghi Azar, M., et al. Bootstrap your own latent-a
new approach to self-supervised learning. Advances in
neural information processing systems, 33:21271–21284,
2020.

Dalal, N. and Triggs, B. Histograms of oriented gradients
for human detection. In 2005 IEEE computer society
conference on computer vision and pattern recognition
(CVPR’05), volume 1, pp. 886–893. Ieee, 2005.
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O.,
and Courville, A. C. Modulating early visual processing
by language. Advances in Neural Information Processing
Systems, 30, 2017.

Ham, B., Cho, M., Schmid, C., and Ponce, J. Proposal flow.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3475–3484, 2016.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database.
In 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255. Ieee, 2009.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.

Doersch, C., Gupta, A., and Efros, A. A. Unsupervised
visual representation learning by context prediction. In
Proceedings of the IEEE international conference on computer vision, pp. 1422–1430, 2015.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation
learning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 9729–9738,
2020.

Eitz, M., Richter, R., Boubekeur, T., Hildebrand, K., and
Alexa, M. Sketch-based shape retrieval. ACM Transactions on graphics (TOG), 31(4):1–10, 2012.

Hertzmann, A. Why do line drawings work? a realism
hypothesis. Perception, 49(4):439–451, 2020.
10

Learning Dense Correspondences between Photos and Sketches

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,
K., Bachman, P., Trischler, A., and Bengio, Y. Learning
deep representations by mutual information estimation
and maximization. arXiv preprint arXiv:1808.06670,
2018.

Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H.,
Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling up
visual and vision-language representation learning with
noisy text supervision. In International Conference on
Machine Learning, pp. 4904–4916. PMLR, 2021.

Hochberg, J. and Brooks, V. Pictorial recognition as an
unlearned ability: A study of one child’s performance. the
american Journal of Psychology, 75(4):624–628, 1962.

Kennedy, J. M. and Ross, A. S. Outline picture perception
by the songe of papua. Perception, 4(4):391–406, 1975.

Hoffmann, D. L., Standish, C. D., Garcı́a-Diez, M., Pettitt, P. B., Milton, J. A., Zilhão, J., Alcolea-González,
J. J., Cantalejo-Duarte, P., Collado, H., de Balbı́n,
R., Lorblanchet, M., Ramos-Muñoz, J., Weniger, G.C., and Pike, A. W. G. U-th dating of carbonate
crusts reveals neandertal origin of iberian cave art.
Science, 359(6378):912–915, 2018. doi: 10.1126/
science.aap7778. URL https://www.science.
org/doi/abs/10.1126/science.aap7778.
Huang, S., Wang, Q., Zhang, S., Yan, S., and He, X.
Dynamic context correspondence network for semantic
alignment. 2019 IEEE/CVF International Conference on
Computer Vision (ICCV), pp. 2010–2019, 2019.

Khaligh-Razavi, S.-M. and Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10(11):
e1003915, 2014.
Kim, J., Liu, C., Sha, F., and Grauman, K. Deformable
spatial pyramid matching for fast dense correspondences.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2307–2314, 2013.
Kim, S., Lin, S., Jeon, S. R., Min, D., and Sohn, K. Recurrent transformer networks for semantic correspondence.
Advances in neural information processing systems, 31,
2018.
Kim, S., Min, D., Jeong, S., Kim, S., Jeon, S., and Sohn, K.
Semantic attribute matching networks. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 12339–12348, 2019.

Huey, H., Lu, X., Walker, C., and Fan, J. Explanatory
drawings prioritize functional properties at the expense
of visual fidelity. 2021.

Konkle, T. and Alvarez, G. A. Instance-level contrastive
learning yields human brain-like representation without
category-supervision. BioRxiv, 2020.

Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., and
Van Gool, L. Dslr-quality photos on mobile devices with
deep convolutional networks. In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3277–
3285, 2017.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
Communications of the ACM, 60(6):84–90, 2017.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
In International conference on machine learning, pp. 448–
456. PMLR, 2015.

Kulvicki, J. Analog representation and the parts principle. Review of Philosophy and Psychology, 6(1):165–180,
2015.
Li, X., Han, K., Li, S., and Prisacariu, V. Dual-resolution
correspondence networks. Advances in Neural Information Processing Systems, 33:17346–17357, 2020.

Jabri, A., Owens, A., and Efros, A. Space-time correspondence as a contrastive random walk. Advances in neural
information processing systems, 33:19545–19560, 2020.
Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial
transformer networks. Advances in neural information
processing systems, 28, 2015.

Li, X., Fan, D.-P., Yang, F., Luo, A., Cheng, H., and Liu, Z.
Probabilistic model distillation for semantic correspondence. 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 7501–7510, 2021.

Jeon, S., Kim, S., Min, D., and Sohn, K. Parn: Pyramidal
affine regression networks for dense semantic correspondence. In Proceedings of the European Conference on
Computer Vision (ECCV), pp. 351–366, 2018.

Liu, C., Yuen, J., and Torralba, A. Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence, 33
(5):978–994, 2010.

Jeon, S., Min, D., Kim, S., Choe, J., and Sohn, K. Guided
semantic flow. In European Conference on Computer
Vision, pp. 631–648. Springer, 2020.

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983, 2016.
11

Learning Dense Correspondences between Photos and Sketches

Lowe, D. G. Object recognition from local scale-invariant
features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pp. 1150–
1157. Ieee, 1999.

Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt,
K., and DiCarlo, J. J. Large-scale, high-resolution comparison of the core visual object recognition behavior
of humans, monkeys, and state-of-the-art deep artificial
neural networks. Journal of Neuroscience, 38(33):7255–
7269, 2018.

Meister, S., Hur, J., and Roth, S. Unflow: Unsupervised
learning of optical flow with a bidirectional census loss.
In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Rocco, I., Arandjelovic, R., and Sivic, J. Convolutional
neural network architecture for geometric matching. In
Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 6148–6157, 2017.

Melekhov, I., Tiulpin, A., Sattler, T., Pollefeys, M., Rahtu,
E., and Kannala, J. Dgc-net: Dense geometric correspondence network. In 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 1034–1042.
IEEE, 2019.

Rocco, I., Arandjelović, R., and Sivic, J. End-to-end weaklysupervised semantic alignment. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 6917–6925, 2018a.

Min, J. and Cho, M. Convolutional hough matching networks. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 2940–
2950, 2021.

Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla,
T., and Sivic, J. Neighbourhood consensus networks.
Advances in neural information processing systems, 31,
2018b.

Min, J., Lee, J., Ponce, J., and Cho, M. Hyperpixel flow:
Semantic correspondence with multi-layer neural features.
In ICCV, 2019.

Rocco, I., Arandjelović, R., and Sivic, J. Efficient neighbourhood consensus networks via submanifold sparse
convolutions. In European conference on computer vision, pp. 605–621. Springer, 2020.

Mukherjee, K., Hawkins, R. X., and Fan, J. W. Communicating semantic part information in drawings. In CogSci,
pp. 2413–2419, 2019.

Sangkloy, P., Burnell, N., Ham, C., and Hays, J. The sketchy
database: learning to retrieve badly drawn bunnies. ACM
Transactions on Graphics (TOG), 35(4):1–12, 2016.

Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Springer,
2016.

Seo, P. H., Lee, J., Jung, D., Han, B., and Cho, M. Attentive
semantic alignment with offset-aware correlation kernels.
In Proceedings of the European Conference on Computer
Vision (ECCV), pp. 349–364, 2018.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint
arXiv:1807.03748, 2018.

Shen, X., Darmon, F., Efros, A. A., and Aubry, M. Ransacflow: generic two-stage image alignment. In European
Conference on Computer Vision, pp. 618–637. Springer,
2020.

Pang, K., Yang, Y., Hospedales, T. M., Xiang, T., and
Song, Y.-Z. Solving mixed-modal jigsaw puzzle for finegrained sketch-based image retrieval. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 10347–10355, 2020.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. Pwc-net: Cnns
for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 8934–8943, 2018.

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and
Efros, A. A. Context encoders: Feature learning by
inpainting. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 2536–2544,
2016.

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
coding. In European conference on computer vision, pp.
776–794. Springer, 2020a.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
et al. Learning transferable visual models from natural
language supervision. In International Conference on
Machine Learning, pp. 8748–8763. PMLR, 2021.

Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and
Isola, P. What makes for good views for contrastive
learning? Advances in Neural Information Processing
Systems, 33:6827–6839, 2020b.
12

Learning Dense Correspondences between Photos and Sketches

Truong, P., Danelljan, M., and Timofte, R. Glu-net: Globallocal universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 6258–6268,
2020.

Yang, J. and Fan, J. E. Visual communication of object
concepts at different levels of abstraction. arXiv preprint
arXiv:2106.02775, 2021.
Yu, Q., Liu, F., Song, Y.-Z., Xiang, T., Hospedales, T. M.,
and Loy, C.-C. Sketch me that shoe. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 799–807, 2016.

Truong, P., Danelljan, M., Yu, F., and Van Gool, L. Warp
consistency for unsupervised learning of dense correspondences. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 10346–10356, 2021.

Yu, Q., Yang, Y., Liu, F., Song, Y.-Z., Xiang, T., and
Hospedales, T. M. Sketch-a-net: A deep neural network
that beats humans. International journal of computer
vision, 122(3):411–425, 2017.

Truong, P., Danelljan, M., Yu, F., and Van Gool, L. Probabilistic warp consistency for weakly-supervised semantic
correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
8708–8718, 2022.

Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision,
pp. 649–666. Springer, 2016.

Vinker, Y., Pajouheshgar, E., Bo, J. Y., Bachmann, R. C.,
Bermano, A. H., Cohen-Or, D., Zamir, A., and Shamir,
A. Clipasso: Semantically-aware object sketching. arXiv
preprint arXiv:2202.05822, 2022.

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso,
A., and Torralba, A. Semantic understanding of scenes
through the ade20k dataset. International Journal of
Computer Vision, 127(3):302–321, 2019.

Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S.,
and Murphy, K. Tracking emerges by colorizing videos.
In Proceedings of the European conference on computer
vision (ECCV), pp. 391–408, 2018.

Zhuang, C., She, T., Andonian, A., Mark, M. S., and Yamins,
D. Unsupervised learning from video with deep neural
embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
9563–9572, 2020.

Wang, X. and Gupta, A. Unsupervised learning of visual
representations using videos. In Proceedings of the IEEE
international conference on computer vision, pp. 2794–
2802, 2015.

Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C.,
DiCarlo, J. J., and Yamins, D. L. Unsupervised neural network models of the ventral visual stream. Proceedings of
the National Academy of Sciences, 118(3):e2014196118,
2021.

Wang, X., Jabri, A., and Efros, A. A. Learning correspondence from the cycle-consistency of time. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 2566–2576, 2019.
Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised
feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 3733–3742, 2018.
Xu, J. and Wang, X. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 10075–10085, 2021.
Xu, P., Song, Z., Yin, Q., Song, Y.-Z., and Wang, L.
Deep self-supervised representation learning for freehand sketch. IEEE Transactions on Circuits and Systems
for Video Technology, 31(4):1503–1513, 2020.
Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A.,
Seibert, D., and DiCarlo, J. J. Performance-optimized
hierarchical models predict neural responses in higher
visual cortex. Proceedings of the national academy of
sciences, 111(23):8619–8624, 2014.
13

Learning Dense Correspondences between Photos and Sketches

A. Details of the Photo-Sketch
Correspondence Benchmark (PSC6k)

B.2. Training and Evaluation Details
Evaluation setting. All methods are evaluated on our
PSC6k benchmark using their original evaluation scripts.
We make necessary edits to adapt the existing codes to
PSC6k.

A.1. Keypoint Sampling
We visualize the steps we take to sample eight keypoints
spanning the object in Figure 7. First, we fill in the outermost contour detected in the sketch to generate the segmentation of the object. In cases where multiple contours are
detected due to unconnected strokes, we apply dilation and
contour filling iteratively until all strokes are connected. We
then cluster the pixels covered by the segmentation mask
into 8 pseudo-parts, by building a nearest-neighbor-based
affinity matrix over pixels and applying spectral clustering.
Since the affinity between two pixels is defined by the shortest path instead of the L2 distance, it ensures a clustering
that maintains the connectivity within each pseudo-part.

Sketch

Segmentation Pseudo-part

Training setting. In the transfer setting, we use the pretrained weights on PF-Pascal provided by each method. In
the retrain setting, we train the methods on the training split
of the Sketchy dataset (Sangkloy et al., 2016) using their
codes and default hyperparameters. Since there is no validation split, we do not select the best checkpoint and evaluate
with the last checkpoint after training.
Since the training set of Sketchy dataset is 88X larger than
that of PF-Pascal, it is impossible to keep the original training epochs and learning rate schedule in large models such
as WarpC. Therefore, we make the following changes to
the series: instead of training 100 epochs as in the original
settings, we find that training for 2 epochs on Sketchy has
already guaranteed an optimal performance (since it leads to
1.77× iterations compared to the original training scheme).
We reduce the learning rate to 0.125× in the second epoch
to approximate the original LR schedule of the method.

Keypoint

Figure 7: Example of the keypoint sampling process. We
show the sketch, segmentation mask, pseudo-parts, and final
keypoints.

Causes of blank entries. The retrain performance of several
methods are left blank for the following reasons:

A.2. Annotation Filtering

• The method requires stronger supervision than what
the Sketchy training set provides. This applies to all
methods with keypoint supervision.

In rare cases, for a given keypoint, one of the three annotations has an exceptionally large distance from the median
location x̃ of all three annotations, denoted as d = ∥x − x̃∥22 .
We gather the distance d for the 150,000 annotations that we
collect and compute its mean and standard deviation. The
annotations with d of three standard deviations away from
the mean are then considered outliers and excluded from the
final determination of the centroids. This rejects 0.74% of
the annotations.

• The method fails to converge on the photo-sketch correspondence task, which is observed in NC-Net and
DCCNet. We hypothesize that since the sketch samples are out-of-domain, their cost volume optimization
blocks fail to handle the large disparity between the
representations of photos and sketches.
• The method does not provide codes for training: PMD.
• In addition, methods that did not release source code,
failed to execute, or did not provide pre-trained weights
are excluded from the table.

B. Additional Evaluation on PSC6k
B.1. Methods with Stronger Supervision
For a more comprehensive evaluation of existing correspondence learning methods on our PSC6k benchmark, we include methods with keypoint supervision in Table 4 and
report their PCK for α = (0.05, 0.1). We report the performance of keypoint-supervised models in the transfer setting
only (directly evaluate on the photo-sketch correspondence
with pretrained weights), because they require supervision
beyond what the sketchy training set provides. Interestingly,
we observe that CATs (Cho et al., 2021) performs exceptionally well on the photo-sketch correspondence, even without
retraining on photo-sketch pairs. This suggests its good
ability of generalization.

C. Generalization to Unseen Categories
To analyze the generalization capability of our proposed
model, we evaluate its performance on categories that were
not included during the training phase. Specifically, we
randomly sample N categories from the full set of 125 categories in the Sketchy dataset, and hold them out during
the training of both the feature encoder and warp estimator.
Then we evaluate the model performance of correspondence
estimation on these N held-out categories. We conduct experiments for N=10 and N=20. The mean performance and
standard deviation were calculated based on three randomly
14

Learning Dense Correspondences between Photos and Sketches

Sup

Methods

Transfer
PCK-5 PCK-10

Retrain
PCK-5 PCK-10

KP

HPF(Min et al., 2019)
CHM (Min & Cho, 2021)
PMD(Li et al., 2021)
CATs(Cho et al., 2021)

50.55
40.52
28.62
52.36

78.18
69.91
63.95
81.80

–
–
–
–

–
–
–
–

Pair

CNNGeo (Rocco et al., 2018a)
WeakAlign (Rocco et al., 2018a)
NC-Net (Rocco et al., 2018b)
DCCNet (Huang et al., 2019)
PMD (Li et al., 2021)
WarpC-SemanticGLUNet (Truong et al., 2021)
Ours (ResNet-18)
Ours (ResNet-101)

27.59
35.65
40.60
42.43
35.77
48.79
–
–

57.71
68.76
63.50
66.53
71.24
71.43
–
–

19.19
43.55
–
–
–
56.78
56.01
57.92

42.57
78.60
–
–
–
79.70
82.89
84.72

Table 4: Comprehensive evaluation for photo-sketch correspondence learning.
sampled held-out splits for each of the two conditions. The
results are presented in Table 5.
As shown in the table, our method maintains a very decent
performance on the 10/20 categories absent during training,
with a decrease of -0.33%/-0.26% for 10 held-out categories
and a decrease of -0.50%/-0.37% for 20 held-out categories.
This shows that our method is robust in generalization to
unseen categories.

# Categories (N)

PCK-5 (±std) PCK-10 (±std)

0
10
20

56.01
55.68 (0.20)
55.51 (0.27)

82.89
82.63 (0.15)
82.52 (0.18)

Photo

Sketch

Warped
Photo

Keypoint
Correspondence

Table 5: Model performance on unseen categories.
Figure 8: Examples of three typical failure patterns. The
method has worse performance for: 1) commonly cooccurred objects, 2) fine structures, and 3) non-continuous
transformations.

D. Additional Qualitative Results
We show typical failure patterns in Figure 8. Specifically,
the model has degraded performance in 1) discriminating between commonly cooccurred objects; 2) aligning fine structures due to low resolution; and 3) handling non-continuous
transformation caused by large changes in perspective and
structure, which violates the continuity assumption in warpbased models that close points should correspond to close
locations. We believe that they are the main problems that
need to be addressed in future studies.
Lastly, we exhibit more examples of photo-sketch correspondence predicted by our model (Figure 9, Figure 10,
Figure 11).

15

Learning Dense Correspondences between Photos and Sketches

Photo

Sketch

Warped
Photo

Keypoint
Correspondence

Figure 9: More alignment examples on the PSC6k dataset.

16

Learning Dense Correspondences between Photos and Sketches

Photo

Sketch

Warped
Photo

Keypoint
Correspondence

Figure 10: More alignment examples on the PSC6k dataset.

17

Learning Dense Correspondences between Photos and Sketches

Photo

Sketch

Warped
Photo

Keypoint
Correspondence

Figure 11: More alignment examples on the PSC6k dataset.

18