Deep Models of Interactions Across Sets

Jason Hartford * 1 Devon R Graham * 1 Kevin Leyton-Brown 1 Siamak Ravanbakhsh 1

Abstract
We use deep learning to model interactions across
two or more sets of objects, such as user–movie
ratings, protein–drug bindings, or ternary useritem-tag interactions. The canonical representation of such interactions is a matrix (or a higherdimensional tensor) with an exchangeability property: the encoding’s meaning is not changed
by permuting rows or columns. We argue that
models should hence be Permutation Equivariant (PE): constrained to make the same predictions across such permutations. We present a
parameter-sharing scheme and prove that it could
not be made any more expressive without violating PE. This scheme yields three benefits. First,
we demonstrate state-of-the-art performance on
multiple matrix completion benchmarks. Second,
our models require a number of parameters independent of the numbers of objects, and thus
scale well to large datasets. Third, models can be
queried about new objects that were not available
at training time, but for which interactions have
since been observed. In experiments, our models achieved surprisingly good generalization performance on this matrix extrapolation task, both
within domains (e.g., new users and new movies
drawn from the same distribution used for training) and even across domains (e.g., predicting
music ratings after training on movies).

1. Introduction
Suppose we are given a set of users N = {1, . . . , N }, a set of
movies M = {1, . . . , M }, and their interaction in the form
of tuples X = ⟨n, m, x⟩ with n ∈ N, m ∈ M and x ∈ R.
Our goal is to learn a function that models the interaction
between users and movies, i.e., mapping from N × M to
*
Equal contribution 1 Department of Computer Science, University of British Columbia, Canada. Correspondence to: Devon Graham <drgraham@cs.ubc.ca>, Jason Hartford <jasonhar@cs.ubc.ca>.

Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).

R. The canonical representation of such a function is a
matrix X ∈ RN ×M ; of course, we want Xn,m = x for each
⟨n, m, x⟩ ∈ X. Learning our function corresponds to matrix
completion: using patterns in X to predict values for the
remaining elements of X. X is what we will call an exchangeable matrix: any row- and column-wise permutation
of X represents the same set of ratings and hence the same
matrix completion problem.
Exchangeability has a long history in machine learning and
statistics. de Finetti’s theorem states that exchangeable
sequences are the product of a latent variable model. Extensions of this theorem characterize distributions over other
exchangeable structures, from matrices to graphs; see Orbanz & Roy (2015) for a detailed treatment. In machine
learning, a variety of frameworks formalize exchangeability
in data, from plate notation to statistical relational models (Raedt et al., 2016; Getoor & Taskar, 2007). When
dealing with exchangeable arrays (or tensors), a common
approach is tensor factorization. In particular, one thread
of work leverages tensor decomposition for inference in
latent variable models (Anandkumar et al., 2014). However, in addition to having limited expressive power, tensor
factorization requires retraining models for each new input.
We call a learning algorithm Permutation Equivariant (PE)
if it is constrained to give the same answer across all exchangeable inputs; we argue that PE is an important form
of inductive bias in exchangeable domains. However, it is
not trivial to achieve; indeed, all existing parametric factorization and matrix/tensor completion methods associate
parameters with each row and column, and thus are not PE.
How can we enforce this property? One approach is to augment the input with all M ! × N ! permutations of rows and
columns. However, this is computationally wasteful and becomes infeasible for all but the smallest N and M . Instead,
we show that a simple parameter-sharing scheme suffices to
ensure that a deep model can represent only PE functions.
The result is analogous to the idea of a convolution layer: a
lower-dimensional effective parameter space that enforces a
desired equivariance property. Indeed, parameter-sharing is
a generic and efficient approach for achieving equivariance
in deep models (Ravanbakhsh et al., 2017).
When the matrix models the interaction between the
members of the same group, one could further con-

Deep Models of Interactions Across Sets

Figure 1. Structure of our parameter matrix for the 1D (left), 2D
(centre), and 3D (right) input arrays. The parameter-sharing patterns for the weight matrix of the higher dimensional arrays can be
produced via the Kronecker product of the weight matrix for the
1D array (i.e., vector).

strain permutations to be identical across both rows and
columns. An example of such a jointly exchangeable matrix (Orbanz & Roy, 2015), modelling the interaction of the
nodes in a graph, is the adjacency matrix deployed by graph
convolution. Our approach reduces to graph convolution in
the special case of 2D arrays with this additional parametersharing constraint, but also applies to arbitrary matrices and
higher dimensional arrays.
Finally, we explain connections to some of our own past
work. First, we introduced a similar parameter-sharing
scheme in the context of behavioral game theory (Hartford et al., 2016): rows and columns represented players’
actions and the exchangeable matrix encoded payoffs. The
current work provides a theoretical foundation for these
ideas and shows how a similar architecture can be applied
much more generally. Second, our model for exchangeable
matrices can be seen as a generalization of the deep sets
architecture (Zaheer et al., 2017), where a set can be seen
as a one-dimensional exchangeable array.
In what follows, we begin by introducing our parametersharing approach in Section 2, considering the cases of both
dense and sparse matrices. In Section 3.2 we study two
architectures for matrix completion that use an exchangeable matrix layer. In particular the factorized autoencoding
model provides a powerful alternative to commonly used
matrix factorization methods; notably, it does not require
retraining to be evaluated on previously unseen data. In
Section 4 we present extensive results on benchmark matrix completion datasets. We generalize our approach to
higher-dimensional tensors in Section 5.

2. Exchangeable Matrix Layer
Let X ∈ RN ×M be our “exchangeable” input matrix. We
use vec(X) ∈ RN M to denote its vectorized form and
vec−1 (x) to denote the inverse of the vectorization that
reshapes a vector of length N M into an N × M matrix—
i.e., vec−1 (vec(X)) = X. Consider a fully connected layer
Y ∶= vec−1 (σ(W vec(X))) where σ is an element-wise

Figure 2. Parameter sharing in an exchangeable matrix layer. The
application of different tied parameters to input elements is illustrated using dark blue for w1 , green for w2 , red for w3 , and light
blue for w4 . The same structure (with the same four parameters)
repeats across all units of the output layer.

nonlinear function such as sigmoid, W ∈ RN M ×N M , and
Y ∈ RN ×M is the output matrix. Exchangeablity of X
motivates the following property: suppose we permute the
rows and columns X using two arbitrary permutation matrices G(N ) ∈ {0, 1}N ×N and G(M ) ∈ {0, 1}M ×M to get
X ′ ∶= G(N ) XG(M ) . Permutation Equivariance (PE) requires the new output matrix Y ′ ∶= vec−1 (σ(W vec(X ′ )))
to experience the same permutation of rows and columns—
that is, we require Y ′ = G(N ) Y G(M ) .
Definition 2.1 (exchangeable matrix layer, simplified1 )
Given X ∈ RN ×M , a fully connected layer σ(W vec(X))
with W ∈ RN M ×N M is called an exchangeable matrix
layer if, for all permutation matrices G(N ) ∈ {0, 1}N ×N
and G(M ) ∈ {0, 1}M ×M , permutation of the rows and
columns results in the same permutations of the output:
vec−1 (σ(W vec(G(N ) XG(M ) ))) =
(N )

G

−1

(1)
(M )

vec (σ(W vec(X)))G

.

This requirement heavily constrains the weight matrix W :
indeed, its number of effective degrees of freedom cannot
even grow with N and M . Instead, the resulting layer is
forced to have the following, simple form:
⎧
w1
⎪
⎪
⎪
⎪
⎪
⎪
⎪w2
W(n,m),(n′ ,m′ ) ∶= ⎨
⎪
w3
⎪
⎪
⎪
⎪
⎪
⎪
w
⎩ 4

n = n′ , m = m′
n = n′ , m ≠ m′
n ≠ n′ , m = m′
n ≠ n′ , m ≠ m′ .

(2)

For each output element Yn,m , we have the following parameters: one connecting it to its counterpart Xn,m ; one
1

This definition is simplified to ease exposition; the full definition (see Section 5) adds the additional constraint that the layer
not be equivariant wrt any other permutation of the elements of X.
Otherwise, a trivial layer with a constant weight matrix Wn,m = c
would also satisfy the stated equivariance property.

Deep Models of Interactions Across Sets

each connecting it to the inputs of the same row and the
inputs of the same column; and one shared by all the other
connections. We also include a bias parameter; see Fig. 2
for an illustration of the action of this parameter matrix, and
Fig. 1 for a visual illustration of it. Theorem 2.1 formalizes
the requirement on our parameter matrix. All proofs are
deferred to the appendix.

Jointly Exchangeable Matrices For jointly exchangeable matrices, such as adjacency matrix, Equation (1) is
constrained to have N = M and G(N ) = G(M ) . This will in
turn constrain the corresponding parameter-sharing so that
w2 = w3 in Equation (2).

Theorem 2.1 Given a strictly monotonic function σ, a neural layer σ(W vec(X)) is an exchangeable matrix layer
iff the elements of the parameter matrix W are tied together
such that the resulting fully connected layer simplifies to

Real-world arrays are often extremely sparse. Indeed, matrix and tensor completion is only meaningful with missing
entries. Fortunately, the equivariance properties of Theorem 2.1 continue to hold when we only consider the nonzero
(observed) entries. For sparse matrices, we continue to
use the same parameter-sharing scheme across rows and
columns, with the only difference being that we limit the
model to observed entries. We now make this precise.

Y = σ (w1′ X +
+

w2′
w3′
(1N 1T
(X1M 1T
N X) +
M)
N
M

(3)

w4′
T
′
T
(1N 1T
N X1M 1M ) + w5 1N 1M )
NM

where 1N = [1, . . . , 1]T and w1′ , . . . , w5′ ∈ R.
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
lengthN

This parameter sharing is translated to summing or averaging elements across rows and columns; more generally, PE is
preserved by any commutative pooling operation. Moreover,
stacking multiple layers with the same equivariance properties preserves equivariance (Ravanbakhsh et al., 2017). This
allows us to build deep PE models.

2.1. Sparse Inputs

Let X ∈ RN ×M ×K be a sparse exchangeable array with K
channels, where all the channels for each row-column pair
⟨n, m⟩ are either fully observed or completely missing. Let
I identify the set of such non-zero indices. Let Rn = {m ∣
(n, m) ∈ I} be the non-zero entries in the nth row of X, and
let Cm be the non-zero entries of its mth column. For this
sparse matrix, the terms in the layer of (4) are adapted as
one would expect:
⟨k,o⟩

w2
N

⟨k,o⟩
⎛K
w
⟨k,o⟩ ⟨k⟩
⟨k⟩
⟨o⟩
Yn,m
= σ ∑ (w1 Xn,m
+ 2 (∑ Xn′ ,m )+
N
⎝k=1
n′
⟨k,o⟩

(4)

⟨k,o⟩

w3
w
⟨k⟩
⟨k⟩
⟨o⟩ ⎞
(∑ X ′ ) + 4 ( ∑ Xn′ ,m′ ) + w5 )
M m′ n,m
N M n′ ,m′
⎠
Input Features for Rows and Columns In some applications, in addition to the matrix X ∈ RN ×M ×K , where
K is the number of input channels/features, we may have
′
additional features for rows R ∈ RN ×K and/or columns
M ×K ′′
C ∈R
. We can preserve PE by broadcasting these
row/column features over the whole matrix and treating
them as additional input channels.

w3
M

⟨k,o⟩
w4

→

w2
⟨k⟩
∑ X ′
∣Cm ∣ n′ ∈Cm n ,m

⟨k⟩

→

w3
⟨k⟩
∑ X ′
∣Rn ∣ m′ ∈Rn n,m

n′

⟨k,o⟩

Multiple Input–Output Channels Equation (3) describes the layer as though it has single input and output
matrices. However, as with convolution, we may have K
input and O output channels. We use superscripts X ⟨k⟩ and
Y ⟨o⟩ to denote such channels. Cross-channel interactions
are fully connected—that is, we have five unique parame⟨k,o⟩
⟨k,o⟩
ters w1 , . . . , w4
for each combination of input–output
channels; note that the bias parameter w5 does not depend
on the input. Similar to convolution, the number of channels
provides a tuning nob for the expressive power of the model.
⟨o⟩
In this setting, the scalar output Yn,m is given as

⟨k,o⟩

⟨k⟩

∑ Xn′ ,m
∑ Xn,m′

⟨k,o⟩

m′
⟨k⟩

∑ Xn′ ,m′

N M n′ ,m′

⟨k,o⟩

→

w4
∣I∣

∑
(n′ ,m′ )∈I

⟨k⟩

Xn′ ,m′

3. Matrix Factorization and Completion
Recommender systems are very widely applied, with many
modern applications suggesting new items (e.g., movies,
friends, restaurants, etc.) to users based on previous ratings
of other items. The core underlying problem is naturally
posed as a matrix completion task: each row corresponds to
a user and each column corresponds to an item; the matrix
has a value for each rating given to an item by a given user;
the goal is to fill in the missing values of this matrix.
In Section 3.1 we review two types of analysis in dealing
with exchangeable matrices. Section 3.2 introduces two architectures: a self-supervised model—a simple composition
of exchangeable matrix layers—that is trained to produce
randomly removed entries of the observed matrix during the
training; and a factorized model that uses an auto-encoding
nonlinear factorization scheme. While there are innumerable methods for (nonlinear) matrix factorization and completion, both of these models are the first to generalize to
inductive settings while achieving competitive performance

Deep Models of Interactions Across Sets

Figure 3. Factorized exchangeable autoencoder. The encoder maps
from the input tensor to an embedding layer of row / column factors
via one or more hidden layers. The decoder attempts to reconstruct
the input using the factors via one or more hidden layers.

in transductive settings. Section 3.3 introduces two subsampling techniques for large sparse matrices followed by a
literature review in Section 3.4.
3.1. Inductive and Transductive Analysis
In matrix completion, during training we are given a sparse
input matrix Xtr with observed entries Itr = {(n, m)}. At
test time, we are given Xts with observed entries Its =
{(n′ , m′ )}, and we are interested in predicting (some of)
the missing entries of Xts , expressed through Its′ . In the transductive or matrix interpolation setting, Itr and Its have overlapping rows and/or columns—that is, at least one of the following is true: {m ∣ (n, m) ∈ I} ∩ {m′ ∣ (n′ , m′ ) ∈ I′ } ≠ ∅
or {n ∣ (n, m) ∈ I} ∩ {n′ ∣ (n′ , m′ ) ∈ I′ } ≠ ∅. In fact, often
Xtr and Xts are identical. Conversely, in the inductive or
matrix extrapolation setting, we are interested in making
predictions about completely unseen entries: the training
and test row/column indices are completely disjoint. We
will even consider cases where Xtr and Xts are completely
different datasets—e.g., movie-rating vs music-rating. The
same distinction applies in matrix factorization. Training a
model to factorize a particular given matrix is transductive,
while factorizing unseen matrices after training is inductive.
3.2. Architectures
Self-supervised Exchangeable Model When the task is
matrix completion, a simple deep yet PE model is a composition of exchangeable matrix layers. That is the function
fss ∶ RN ×M ×K → RN ×M ×K is simply a composition of
exchangeable matrix layers. Given the matrix X with observed entries I, we divide I = Iin ∪ Ipr into disjoint input
and prediction entries. We then train fss (Xin ) to predict the
prediction entries Xpr .
Factorized Exchangeable Autoencoder (FEA) Our factorized autoencoder is composed of an encoder and a decoder. The encoder fenc ∶ RN ×M ×K → RKN ×N × RKM ×M
maps the (sparse) input matrix X ∈ RN ×M ×K to a rowfactor ZN ∈ RKN ×N and a column-factor ZM ∈ RKM ×M .
To do so, the encoder uses a composition of exchangeable
l
matrix layers. The output of the final layer Y l ∈ RN ×M ×K

is pooled across rows and columns (and optionally passed
through a feed-forward layer) to produce latent factors
ZN and ZM . The decoder gdec ∶ RKN ×N × RKM ×M →
RN ×M ×K also uses a composition of exchangeable matrix
layers, and reconstructs the input matrix X from the factors.
The optimization objective is to minimize reconstruction error `(X, gdec (fenc (X))); similar to classical auto-encoders.
This procedure is also analogous to classical matrix factorization, with an an important distinction that once trained,
we can readily factorize the unseen matrices without performing any optimization. Note that the same architecture
trivially extends to tensor-factorization, where we use an
exchangeable tensor layer (see Section 5).
Channel Dropout Both the factorized autoencoder and
self-supervised exchangeable model are flexible enough to
make regularization important for good generalization performance. Dropout (Srivastava et al., 2014) can be extended
to apply to exchangeable matrix layers by noticing that each
channel in an exchangeable matrix layer is analogous to a
single unit in a standard feed-forward network. We therefore randomly drop out whole channels during training (as
opposed to dropping out individual elements). This procedure is equivalent to the SpatialDropout technique used in
convolutional networks (Tompson et al., 2015).
3.3. Subsampling in Large Matrices
A key practical challenge arising from our approach is that
our models are designed to take the whole data matrix X as
input and will make different (and typically worse) predictions if given only a subset of the data matrix. As datasets
grow, the model and input data become too large to fit within
fixed-size GPU memory. This is problematic both during
training and at evaluation time because our models rely
on aggregating shared representations across data points to
make their predictions. To address this, we explore two
subsampling procedures.
Uniform sampling The simplest approach is to sample
sub-matrices of X by uniformly sampling from its (typically
sparse) elements. This has the advantage that each submatrix is an unbiased sample of the full matrix and that the
procedure is computationally cheap, but has the potential to
limit the performance of the model because the relationships
between the elements of X are sparsified.
Conditional sampling Rather than sparsifying interactions between all set members, we can pick a subset of
rows and columns and maintain all their interactions; see
Figure 4. This procedure is unbiased as long as each element (n, m) ∈ I has the same probability of being sampled. To achieve this, we first sample a subset of rows
N′ ⊆ N = {1, . . . , N } from the marginal P (n) ∶= ∣R∣I∣n ∣ ,
followed by subsampling of colums using the marginal

Deep Models of Interactions Across Sets

distribution over the columns, within the selected rows:
∣{(m,n)∈I∣n∈N′ }∣
P (m ∣ N′ ) = {(m
′ ,n)∈I∣n∈N′ }∣ . This gives us a set of columns
′
M ⊆ M. We consider any observation within N′ × M′ as
our subsample: Isample ∶= {(n, m) ∈ I ∣ n ∈ N′ , m ∈ M′ }.
This sampling procedure is more expensive than uniform
sampling, as we have to calculate conditional marginal distributions for each set of samples.

our approach is able to make inductive completions on rating matrices that may differ from that which was observed
during training without using any side information (though
our approach can easily incorporate side information).
Matrix completion may also be posed as predicting edge
weights in bipartite graphs (Berg et al., 2017; Monti et al.,
2017). This approach builds on recent work applying convolutional neural networks to graph-structured data (Scarselli
et al., 2009; Bruna et al., 2014; Duvenaud et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016; Hamilton et al.,
2017). As we saw, parameter sharing in graph convolution
is closely related to parameter sharing in exchangeable matrices, and indeed it is a special case where w2 = w3 in
Equation (2).

4. Empirical Results

Figure 4. Uniform sampling (left) selects samples (red) uniformly
from the non-zero indices of the the matrix X while conditional
sampling (right) first samples a set of rows (shown in orange) from
the row marginal distribution (green) and then selects sample from
the resulting column conditional distribution.

Sampling at test time At training time all we must ensure
is that we sample in an unbiased way; however, at test time
we need to ensure that the test indices Its of large test matrix
Xts are all sampled. Fortunately, according to the coupon
collectors’ lemma, in expectation we only need to repeat
random sampling of indices log(∣Its ∣) times more (≈ 10× in
practice) than an ideal partitioning of Its , in order to cover
all the relevant indices.
3.4. Related Literature
The literature in matrix factorization and completion is vast.
The classical approach to solving the matrix completion
problem is to find some low rank (or sparse) approximation
that minimizes a reconstruction loss for the observed ratings
(see e.g., Candès & Recht, 2009; Koren et al., 2009; Mnih
& Salakhutdinov, 2008). Because these procedures learn
embeddings for each user and item to make predictions,
they are transductive, meaning they can only make predictions about users and items observed during training. To our
knowledge this is also true for all recent deep factorization
and other collaborative filtering techniques (e.g., Salakhutdinov et al., 2007; Deng et al.; Sedhain et al., 2015; Wang
et al., 2015; Li et al., 2015; Zheng et al., 2016; Dziugaite
& Roy, 2015). An exception is a recent work by Yang et al.
(2016) that extends factorization-style approaches to the
inductive setting (where predictions can be made on unseen
users and items). However their method relies on additional
side information to represent users and items. By contrast,

For reproducibility we have released Tensorflow and Pytorch implementations of our model.2 Details on the training and architectures appear in the appendix. Section 4.1
reports experimental results in the standard transductive (matrix interpolation) setting. However, more interesting results
are reported in Section 4.2, where we test a trained deep
model on a completely different dataset. Finally Section 4.3
compares the model’s performance when using different
sampling procedures. The datasets used in our experiments
are summarized in Table 1.
Dataset
MovieLens 100K
MovieLens 1M
Flixster
Douban
Yahoo Music
Netflix

Users

Items

Ratings

Sparsity

943
6040
3000
3000
3000
480,189

1682
3706
3000
3000
3000
17,770

100,000
1,000,209
26173
136891
5335
100,480,507

6.30%
4.47%
0.291%
1.521%
0.059%
1.178%

Table 1. Number of users, items and ratings for the data sets
used in our experiments. MovieLens data sets are standard
(Harper & Konstan, 2015), as is Netflix, while for Flixster,
Douban and Yahoo Music we used the 3000 × 3000 submatrix
presented by (Monti et al., 2017) for comparison purposes.

4.1. Transductive Setting (Matrix Interpolation)
Here, we demonstrate that exploiting the PE structure of
the exchangeable matrices allows us to achieve results competitive with state-of-the-art, while maintaining a constant
number of parameters. Note that the number of parameters
in all the competing methods grow with N and/or M .
In Table 2 we report that the self-supervised exchangeable
model is able to achieve state of the art performance on
2

Tensorflow:
https://github.com/mravanba/
deep_exchangeable_tensors.
Pytorch:
https:
//github.com/jhartford/AutoEncSets.

Deep Models of Interactions Across Sets

MovieLens-100K dataset. For MovieLens-1M dataset, we
cannot fit the whole dataset into the GPU memory for training and therefore use conditional subsampling; also see
Section 4.3. Our results on this dataset are summarized in
table Table 3. On this larger dataset both models gave comparatively weaker performance than what we observed on
the smaller ML-100k dataset and in the extrapolation results.
We suspect this is largely due to memory constraints: there
is a trade-off between the size of the model (in terms of
number of layers and units per layer) and the batch size one
can train. We found that both larger batches and deeper models tended to offer better performance, but on these larger
datasets it is not currently possible to have both. The results
for the factorized exchangeable autoencoder architecture are
similar and reported in the same table.
Model

ML-100K

MC (Candès & Recht, 2009)
GMC (Kalofolias et al., 2014)
GRALS (Rao et al., 2015)
sRGCNN (Monti et al., 2017)
Factorized EAE (ours)
GC-MC (Berg et al., 2017)
Self-Supervised Model (ours)

0.973
0.996
0.945
0.929
0.920
0.910
0.910

Table 2. Comparison of RMSE scores for the MovieLens-100k
dataset, based on the canonical 80/20 training/test split.
Baseline numbers are taken from (Berg et al., 2017).

Model
PMF (Mnih & Salakhutdinov, 2008)
Self-Supervised Model (ours)
Factorized EAE (ours)
I-RBM (Salakhutdinov et al., 2007)
BiasMF (Koren et al., 2009)
NNMF (Dziugaite & Roy, 2015)
LLORMA-Local (Lee et al., 2013)
GC-MC (Berg et al., 2017)
I-AUTOREC (Sedhain et al., 2015)
CF-NADE (Zheng et al., 2016)

ML-1M
0.883
0.863
0.860
0.854
0.845
0.843
0.833
0.832
0.831
0.829

Table 3. Comparison of RMSE scores for the MovieLens-1M
dataset on random 90/10 training/test split. Baseline numbers are taken from (Berg et al., 2017).

4.2. Inductive Setting (Matrix Extrapolation)
Because our model does not rely on any per-user or permovie parameters, it should be able to generalize to new
users and movies that were not present during training. We
tested this by training an exchangeable factorized autoencoder on the MovieLens-100k dataset and then evaluating it
on a subsample of data from the MovieLens-1M dataset. At
test time, the model was shown a portion of the new ratings
and then made to make predictions on the remaining ratings.
Figure 5 summarizes the results where we vary the amount

of data shown to the model from 5% of the new ratings up to
95% and compare against K-nearest neighbours approaches.
Our model significantly outperforms the baselines in this
task and performance degrades gracefully as we reduce the
amount of data observed.

Figure 5. Evaluation of our model’s ability to generalize. We
trained on ML-100k and evaluated on a subset of the ML-1M
data. At evaluation time, p% of the ML-1M data was treated as
observed and the model was required to complete the remaining
(1 − p)% (p varied from 5% to 95%). The model outperforms
nearest-neighbour approaches for all values of p and degrades
gracefully to the baseline of predicting the mean in the small data
case.

Extrapolation to new datasets Perhaps most surprisingly, we were able to achieve competitive results when
training and testing on completely disjoint datasets. For this
experiment we stress-tested our model’s inductive ability by
testing how it generalizes to new datasets without retraining. We used a Factorized Exchangeable Autoencoder that
was trained to make predictions on the MovieLens-100k
dataset and tasked it with making predictions on the Flixster,
Douban and YahooMusic datasets. We then evaluated its
performance against models trained for each of these individual datasets. All the datasets involve rating prediction
tasks, so they share some similar semantics with MovieLens, but they have different user bases and (in the case of
YahooMusic) deal with music not movie ratings, so we may
expect some change in the rating distributions and user-item
interactions. Furthermore, the Flixster ratings are in 0.5
point increments from 1 − 5 and YahooMusic allows ratings
from 1 − 100, while Douban and MovieLens ratings are
on 1 − 5 scale. To account for the different rating scales,
we simply binned the inputs to our model to a 1 − 5 range
and, where applicable, linearly rescaled the output before
comparing it to the true rating3 . Despite all of this, Table 4
shows that our model achieves very competitive results with
models that were trained for the task.
For comparison, we also include the performance of a Factorized EAE trained on the respective datasets. This im3

Because of this binning procedure, our model received input
data that is considerably coarser-grained than that which was used
for the comparison models.

YahooMusic

Netflix

GRALS (Rao et al., 2015)
sRGCNN (Monti et al., 2017)
GC-MC (Berg et al., 2017)
Factorize EAE (ours)
Factorize EAE (trained on ML100k)
Netflix Baseline
PMF (Mnih & Salakhutdinov, 2008)
LLORMA-Local (Lee et al., 2013)
I-AUTOREC (Sedhain et al., 2015)
CF-NADE (Zheng et al., 2016)

Douban

Model

Flixster

Deep Models of Interactions Across Sets

1.313
1.179
0.941
0.908
0.987
-

0.833
0.801
0.734
0.738
0.766
-

38.0
22.4
20.5
20.0
23.3
-

0.918
0.951
0.897
0.834
0.823
0.803

Table 4. Evaluation of our model’s ability to generalize across
datasets. We trained a factorized model on ML100k and then
evaluated it on four new datasets. Results for the smaller
datasets are taken from (Berg et al., 2017). Netflix Baseline
shows state of the art prior to the Netflix Challenge.

Figure 6. Performance difference between sampling methods on
the ML-100k. The two sampling methods use minibatches of
size 20 000, while the full batch method used all 75 000 training
examples. Note that the y-axis does not begin at 0.

proves performance of our model over previous state of
the art results on the Flixster and YahooMusic datasets and
gives very similar performance to Berg et al. (2017)’s GCMC model on the Douban dataset. Interestingly, we see the
largest performance gains over existing approaches on the
datasets in which ratings are sparse (see Table 1).
4.3. Comparison of sampling procedures
We evaluated the effect of subsampling the input matrix
X on performance, for the MovieLens-100k dataset. The
results are summarized in Figure 6. The two key findings
are: I) even with large batch sizes of 20 000 examples,
performance for both sampling methods is diminished compared to the full batch case. We suspect that our models’
weaker results on the larger ML-1M dataset may be partially
attributed to the need to subsample. II) the conditional sampling method was able to recover some of the diminished
performance. We believe it is likely that more sophisticated sampling schemes that explicitly take into account
the dependence between hidden nodes will lead to better
performance but we leave that to future work.

5. Extention to Tensors
In this section we generalize the exchangeable matrix layer
to higher-dimensional arrays (tensors) and formalize its
optimal qualities. Suppose X ∈ RN1 ×...×ND is our Ddimensional data tensor, and vec(X) its vectorized form.
We index vec(X) by tuples (n1 , n2 , ..., nD ), corresponding to the original dimensions of X. The precise method
of vectorization is irrelevant, provided it is used consistently. Let N = ∏i Ni and let n = (ni , n−i ) be an element
of N1 × ... × ND such that ni is the value of the i-th entry
of n, and n−i the values of the remaining entries (where it
is understood that the ordering of the elements of n is un-

Figure 7. Pooling structure implied by the tied weights for matrices (left) and 3D tensors (right). The pink cube highlights one
element of the output. It is calculated as a function of the corresponding element from the input (dark blue), pooled aggregations
over the rows and columns of the input (green and yellow), and
pooled aggregation over the whole input matrix (red). In the tensor
case (right), we pool over all sub-tensors (orange and purple submatrices, green sub-vectors and red scalar). For clarity, the output
connections are not shown in the tensor case.

changed). We seek a layer that is equivariant only to certain,
meaningful, permutations of vec(X). This motivates our
definition of an exchangeable tensor layer in a manner that
is completely analogous to Definition 2.1 for matrices.
For any positive integer N , let S (N ) denote the symmetric
group of all permutations of N objects. Then S (N1 ) × ... ×
S (ND ) refers to the product group of all permutations of N1
through ND objects, while S (N) refers to the group of all
permutations of N = ∏i Ni objects. So S (N1 ) × ... × S (ND )
is a proper subgroup of S (N) having ∏i (Ni !) members,
compared to (∏i Ni )! members in S (N) . We want a layer
that is equivariant to only those permutations in S (N1 ) × ... ×
S (ND ) , but not to any other member of S (N) .
Definition 5.1 (exchangeable tensor layer) Let g(N) ∈
S (N1 ) × S (N2 ) × ... × S (ND ) and G(N) be the corresponding permutation matrix. Then the neural layer
vec−1 (σ(W vec(X))) with X ∈ RN1 ×...×ND and W ∈

Deep Models of Interactions Across Sets

RN×N is an exchangeable tensor layer if permuting the
elements of the input along any set of axes results in the
same permutation of the output tensor:
G(N) σ(W vec(X)) = σ(W G(N) vec(X))

∀X,

(5)

and moreover for any other permutation of the elements X
(i.e., permutations that are not along axes), there exists an
input X for which this equality does not hold.

entry in G(N) W that will differ from W G(N) , and therefore for some input tensor X, the equivariance property is
violated – i.e., G(N) σ(W vec(X)) ≠ σ(W G(N) vec(X)).
Proposition 5.3 Let g(N) ∈ S (N) with G(N) ∈ {0, 1}N×N
the corresponding permutation matrix. Suppose g(N) ∈/
S (N1 ) ×S (N2 ) ×...×S (ND ) , and let W ∈ RN×N be as defined
above. If there exists an i ∈ [D], and some ni , n′i , n−i , n′−i
such that
g(N) ((ni , n−i )) = (n′i , n−i ),

The following theorem asserts that a simple parameter tying
scheme achieves the desired permutation equivariance for
tensor-structured data.

g

(N)

and

((ni , n′−i )) ≠ (n′i , n′−i ),

Theorem 5.1 The layer Y = vec−1 (σ(W vec(X))), where
σ is any strictly monotonic, element-wise nonlinearity
(e.g., sigmoid), is an exchangeable tensor layer iff the parameter matrix W ∈ RN×N is defined as follows.

then

For each S ⊆ [D] = {1, . . . , D}, define a distinct parameter
wS ∈ R, and tie the entries of W as follows

Proposition 5.3 makes explicit a particular element at which
the products G(N ) W and W G(N ) will differ, provided g(N)
is not of the desired form.

Wn,n′ ∶= wS

s.t.

ni = n′i ⇐⇒ i ∈ S.

(6)

That is, the (n, n′ )-th element of W is uniquely determined
by the set of indices at which n and n′ are equal.
In the special case that X ∈ RN1 ×N2 is a matrix, this
gives the formulation of W described in Section 2. Theorem 5.1 says that a layer constructed in this manner
is equivariant with respect to only those permutations of
N objects that correspond to permutations of sub-tensors
along the D dimensions of the input. The proof is in the
Appendix. Equivariance with respect to permutations in
S (N1 ) × S (N2 ) × ... × S (ND ) ) follows as a simple corollary of Theorem 2.1 in (Ravanbakhsh et al., 2017). Nonequivariance with respect to other permutations follows
from the following Propositions.
Proposition 5.2 Let g(N) ∈ S (N) be an “illegal” permutation of elements of the tensor X – that is g(N) ∈/
S (N1 ) × S (N2 ) × ... × S (ND ) . Then there exists a dimension i ∈ [D] such that, for some ni , n′i , n−i , n′−i :
g(N) ((ni , n−i )) = (n′i , n−i ),

but

g(N) ((ni , n′−i )) ≠ (n′i , n′−i ).
If we consider the sub-tensor of X obtained by fixing the
value of the i-th dimension to ni , we expect a “legal” permutation to move this whole subtensor to some n′i (it could
additionally shuffle the elements within this subtensor.) This
Proposition asserts that an “illegal” permutation g(N) ∈/
S (N1 ) × S (N2 ) × ... × S (ND ) is guaranteed to violate this
constraint for some dimension/subtensor combination. The
next proposition asserts that if we can identify such inconsistency in permutation of sub-tensors then we can identify and

(G(N) W )(n′ ,n′ ),(n ,n ) ≠ (W G(N) )(n′ ,n′ ),(n ,n )
i

−i

i

−i

i

−i

i

−i

Theorem 5.1 shows that this parameter sharing scheme produces a layer that is equvariant to exactly those permutations
we desire, and moreover, it is optimal in the sense that any
layer having fewer ties in its parameters (i.e., more parameters) would fail to be equivariant.

Conclusion
This paper has considered the problem of predicting relationships between the elements of two or more distinct sets
of objects, where the data can also be expressed as an exchangeable matrix or tensor. We introduced a weight tying
scheme enabling the application of deep models to this type
of data. We proved that our scheme always produces permutation equivariant models and that no increase in model
expressiveness is possible without violating this guarantee.
Experimentally, we showed that our models achieve state-ofthe-art or competitive performance on widely studied matrix
completion benchmarks. Notably, our models achieved this
performance despite having a number of parameters independent of the size of the matrix to be completed, unlike all
other approaches that offer strong performance. Also unlike
these other approaches, our models can achieve competitive
results for the problem of matrix extrapolation: asking an
already-trained model to complete a new matrix involving
new objects that were unobserved at training time. Finally,
we observed that our methods were surprisingly strong on
various transfer learning tasks: extrapolating from MovieLens ratings to Fixter, Douban, and YahooMusic ratings.
All of these contained different user populations and different distributions over the objects being rated; indeed, in the
YahooMusic dataset, the underlying objects were not even
of the same kind as those rated in training data.

Deep Models of Interactions Across Sets

Acknowledgment
We want to thank the anonymous reviewers for their constructive feedback. This research was enabled in part by
support provided by NSERC Discovery Grant, WestGrid
and Compute Canada.

References
Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent
variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
Berg, R. v. d., Kipf, T. N., and Welling, M. Graph
convolutional matrix completion.
arXiv preprint
arXiv:1706.02263, 2017.
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral networks and locally connected networks on graphs.
ICLR, 2014.
Candès, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 2009.
Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized
spectral filtering. In Advances in Neural Information
Processing Systems, pp. 3844–3852, 2016.
Deng, Z., Navarathna, R., Carr, P., Mandt, S., Yue, Y.,
Matthews, I., and Mori, G. Factorized variational autoencoders for modeling audience reactions to movies.
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,
R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing
systems, 2015.
Dziugaite, G. K. and Roy, D. M. Neural network matrix
factorization. arXiv preprint arXiv:1511.06443, 2015.
Getoor, L. and Taskar, B. Introduction to statistical relational learning. MIT press, 2007.
Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural
Information Processing Systems, 2017.

Kalofolias, V., Bresson, X., Bronstein, M., and Vandergheynst, P. Matrix completion on graphs. arXiv
preprint arXiv:1408.1717, 2014.
Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint
arXiv:1609.02907, 2016.
Koren, Y., Bell, R., and Volinsky, C. Matrix factorization
techniques for recommender systems. 2009.
Lee, J., Kim, S., Lebanon, G., and Singer, Y. Local low-rank
matrix approximation. In Sanjoy Dasgupta and David
McAllester, editors, Proceedings of the 30th International
Conference on Machine Learning (ICML), volume 28 of
Proceedings of Machine Learning Research, pp. 82–90,
2013.
Li, S., Kawale, J., and Fu, Y. Deep collaborative filtering
via marginalized denoising auto-encoder. In Proceedings
of the 24th ACM International on Conference on Information and Knowledge Management, pp. 811–820. ACM,
2015.
Mnih, A. and Salakhutdinov, R. R. Probabilistic matrix factorization. In Advances in neural information processing
systems, 2008.
Monti, F., Bronstein, M., and Bresson, X. Geometric matrix
completion with recurrent multi-graph neural networks.
In Advances in Neural Information Processing Systems,
2017.
Orbanz, P. and Roy, D. M. Bayesian models of graphs,
arrays and other exchangeable random structures. IEEE
transactions on pattern analysis and machine intelligence,
2015.
Raedt, L. D., Kersting, K., Natarajan, S., and Poole, D.
Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial
Intelligence and Machine Learning, 10(2):1–189, 2016.
Rao, N., Yu, H.-F., Ravikumar, P. K., and Dhillon, I. S. Collaborative filtering with graph information: Consistency
and scalable methods. In C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Informa- tion Processing Systems 28,
pp. 21072115, 2015.

Harper, F. M. and Konstan, J. A. The movielens datasets:
History and context. ACM Transactions on Interactive
Intelligent Systems, 2015.

Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance
through parameter-sharing. In Proceedings of the 34th International Conference on Machine Learning, volume 70
of JMLR: WCP, August 2017.

Hartford, J. S., Wright, J. R., and Leyton-Brown, K. Deep
learning for predicting human strategic behavior. In Advances in Neural Information Processing Systems, pp.
2424–2432, 2016.

Salakhutdinov, R., Mnih, A., and Hinton, G. Restricted
boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine
learning, pp. 791–798. ACM, 2007.

Deep Models of Interactions Across Sets

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
Monfardini, G. The graph neural network model. IEEE
Transactions on Neural Networks, 20(1):61–80, 2009.
Sedhain, S., Menon, A. K., Sanner, S., and Xie, L. Autorec:
Autoencoders meet collaborative filtering. In Proceedings
of the 24th International Conference on World Wide Web,
pp. 111–112. ACM, 2015.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. The Journal of Machine
Learning Research, 2014.
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. Efficient object localization using convolutional
networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 648–656,
2015.
Wang, H., Wang, N., and Yeung, D.-Y. Collaborative deep
learning for recommender systems. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244. ACM,
2015.
Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisiting
semi-supervised learning with graph embeddings. arXiv
preprint arXiv:1603.08861, 2016.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,
Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, 2017.
Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative filtering. In Proceedings of the 33nd International Conference on Machine
Learning, pp. 764–773, 2016.

Deep Models of Interactions Across Sets

Now, let i ∈ [D] be such that, for some ni , n′i , n−i , n′−i we
have

A. Notation
• X ∈ RN1 ×...×ND , the data tensor

g(N) ((ni , n−i )) = (n′i , n−i ),

• x ∈ R∏i Ni , vectorized X, also denoted by vec(X).

g

• [N ]: the sequence {n}n=1,...,N = (1, 2, ...N )
the
sequence
• [N1
× ... × ND ],
{(n1 , ..., nD )}n1 ∈[N1 ],...,nD ∈[ND ] of D-dimensional
tuples over N1 × ... × ND

(G(N) )(n′ ,n ),(n ,n ) = 1,

and

−i

(G(N) )(n′ ,n′ ),(n ,n′ ) ≠ 1.
i

−i

−i

And so:

• S (N ) , symmetric group of all permutations of N objects
(N )
G(N ) = {gi }i , a permutation group of N objects
(N )
(N )
gi or Gi , both can refer to the matrix form of the
(N )
permutation gi ∈ G(N ) .
(N )

i

−i

i

i

• N = ∏ i Ni

•

and

((ni , n′−i )) ≠ (n′i , n′−i ).

Then by the observation above we have:

• n = (ni , n−i ), an element of N1 × ... × ND

•

(N)

(N )

• gi (n) refers to the result of applying gi
any n ∈ [N ].

to n, for

(G(N) W )(n′ ,n′ ),(n ,n ) = (G(N) )(n′ ,n′ ),∗ (W )∗,(n ,n )
i

=

B. Proofs
B.1. Proof of Proposition 5.2
Proof Let X ∈ RN1 ×...×ND be the data matrix. We prove
the contrapositive by induction on D, the dimension of X.
Suppose that, for all i ∈ [D] and all ni , n′i , n−i , n′−i , we have
that g(N) ((ni , n−i )) = (n′i , n−i ) implies g(N) ((ni , n′−i )) =
(n′i , n′−i ). This means that, for any n = (ni , n−i ) =
(n1 , ..., ni , ..., nD ) the action g(N) takes on ni is independent of the action g(N) takes on n−i , the remaining dimensions. Thus we can write
g(N) (n) = g(Ni ) (ni ) × g(N/Ni ) (n−i )
Where it is understood that the order of the group product is
maintained (this is a slight abuse of notation). If D = 2 (base
case) we have g(N) ((n1 , n2 )) = g(N1 ) (n1 ) × g(N2 ) (n2 ).
So g(N) ∈ S (N1 ) × S (N2 ) , and we are done. Otherwise, an
inductive argument on g(N/Ni ) allows us to write g(N) (n) =
g(N1 ) (n1 ) × g(N2 ) (n2 ) × ... × g(ND ) (nD ). And so g(N) ∈
S (N1 ) × S (N2 ) × ... × S (ND ) , completing the proof.
B.2. Proof of Proposition 5.3
Proof First, observe that
g(N) (n) = n′ ⇐⇒ (G(N) )n′ ,n = 1

−i

i

(N)

(G

∑
k∈[N1 ×...×ND ]

i

−i

−i

)(n′ ,n′ ),k (W )k,(n ,n )
i

−i

i

−i

≠ W(ni ,n′−i ),(ni ,n−i )
The last line follows from the observation above and the fact
that G(N) is a permutation matrix and so has only one 1
per row. Similarly,
(W G(N) )(n′ ,n′ ),(n ,n ) = (W )(n′ ,n′ ),∗ (G(N) )∗,(n ,n )
i

• σ, an arbitrary, element-wise, strictly monotonic, nonlinear function such as sigmoid.

i

−i

=

i

−i

−i

∑
k∈[N1 ×...×ND ]

i

i

−i

−i

(W )(n′ ,n′ ),k (G(N) )k,(n ,n )
i

−i

i

−i

= (W )(n′ ,n′ ),(n′ ,n )
i

−i

i

−i

Where again the last line follows from the above observation.
Now, consider W(ni ,n′−i ),(ni ,n−i ) and W(n′i ,n′−i ),(n′i ,n−i ) .
Observe that (ni , n′−i ) differs from (ni , n−i ) at exactly
the same indices that (n′i , n′−i ) differs from (n′i , n−i ). Let
S ⊆ [D] be the set of indices at which n−i differs from n′−i .
We therefore have
W(ni ,n′−i ),(ni ,n−i ) = W(n′i ,n′−i ),(n′i ,n−i ) = θS ,
Which completes the proof.
B.3. Proof of Theorem 5.1
Proof We will prove both the forward and backward direction:
(⇐) Suppose W has the form given by (6). We must show
the layer is only equivariant with respect to permutations in
S (N1 ) × ... × S (ND ) :
• Equivariance: Let g(N) ∈ S (N1 ) × ... × S (ND ) , and let
G(N) be the corresponding permutation matrix. Then
a simple extension of Theorem 2.1 in (Ravanbakhsh
et al., 2017) implies G(N) W X = W G(N) X for all X,
and thus the layer is equivariant. Intuitively, if g(N) ∈
S (N1 ) ×...×S (ND ) we can “decompose” g(N) (n) into
D permutations S (N1 ) (n1 ), ..., S (ND ) (nD ) which act
independently on the D dimensions of X.

Deep Models of Interactions Across Sets

• No equivariance wrt any other permutation: Let
g(N) ∈ S (N) , with g(N) ∈/ S (N1 ) × ⋅ ⋅ ⋅ × S (ND ) , and let
G(N) be the corresponding permutation matrix. Using
Propositions 5.2 and 5.3 we have:
G(N) W ≠ W G(N)
So there exists an index at which these two matrices
differ, call it (n, n′ ). Then if we take vec(X) to be the
vector of all 0’s with a single 1 in position n′ , we will
have:
G(N) W vec(X) ≠ W G(N) vec(X).
And since σ is assumed to be strictly monotonic, we
have:
σ(G(N) W vec(X)) ≠ σ(W G(N) vec(X)).
And finally, since G(N) is a permutation matrix and σ
is applied element-wise, we have:
G(N) σ(W vec(X)) ≠ σ(W G(N) vec(X)).
Therefore, the layer σ(W vec(X)) is not a exchangeable tensor layer, and the proof is completed.
This proves the first direction.
(⇒) We prove the contrapositive. Suppose Wn,n′ ≠ Wm,m′
for some n, n′ , m, m′ ∈ N1 × ... × ND with
{i ∶ ni = n′i } = {i ∶ mi = m′i }. We want to
show that the layer Y = vec−1 (σ(W vec(X))) is not
equivariant to some permutation g(N) ∈ S (N1 ) ×...×S (ND ) .
We define this permutation as follows:
⎧
⎪
m
⎪
⎪
⎪
⎪
⎪
m′
⎪
⎪
⎪
⎪
(N)
g (ν) = ⎨n
⎪
⎪
⎪
⎪
n′
⎪
⎪
⎪
⎪
⎪
⎪
⎩ν

if ν = n
if ν = n′
if ν = m
if ν = m′
otherwise

That is, g(N) “swaps” n with m and n′ with m′ . This is a
valid permutation first since it acts element-wise, but also
since {i ∶ ni = n′i } = {i ∶ mi = m′i } implies that n = n′
iff m = m′ (and so g(N) is injective, and thus bijective).
So if G(N) is the permutation matrix of g(N) then we have
(G(N) )(n′ ,n) = (G(N) )(m′ ,m) = 1, and:
(G(N) W )(m,n′ ) =

∑

(G(N) )(m,k) W(k,n′ )

k∈[N1 ×...×ND ]

= W(n,n′ )
≠ W(m,m′ )
=

∑
k∈[N1 ×...×ND ]

W(m,k) (G(N) )(k,n′ )

= (W G(N) )(m,n′ )

And so G(N) W ≠ W G(N) and by an argument identical
to the above, with appropriate choice of X, this layer does
not satisfy the requirements of an exchangeable tensor layer.
This completes the proof.
B.4. Proof of Theorem 2.1
A simple reparameterization allows us to write the matrix
W of (6) in the form of (3). Thus Theorem 2.1 is just the
special case of Theorem 5.1 where D = 2 and the proof
follows from that.

C. Details of Architecture and Training
S ELF -S UPERVISED M ODEL . Details of architecture and
training: we train a simple feed-forward network with 9
exchangeable matrix layers using a leaky ReLU activation
function. Each hidden layer has 256 channels and we apply
a channel-wise dropout with probability 0.5 after the first
to seventh layers. We found this channel-wise dropout to
be crucial to achieving good performance. Before the input
layer we mask out a proportion of the ratings be setting their
values to 0 uniformly at random with probability 0.15. We
convert the input ratings to one-hot vectors and interpret
the model output as a probability distribution over potential
rating levels. We tuned hyper-parameters by training on
75% of the data, evaluating on a 5% validation set. We test
this model using the canonical u1.base/u1.test training/test
split, which reserves 20% of the ratings for testing. For the
MovieLens-1M dataset, we use the same architecture as for
ML-100k and trained on 85% of the data, validating on 5%,
and reserving 10% for testing. The limited size of GPU
memory becomes an issue for this larger dataset, so we had
to employ conditional sampling for training. At validation
time we used full batch predictions using the CPU in order
to avoid memory issues.
FACTORIZED
E XCHANGEABLE
AUTOENCODER
M ODEL . Details of architecture and training: we use
three exchangeable matrix layers for the encoder. The
first two have 220 channels, and the third layer maps the
input to 100 features for each entry, with no activation
function applied. This is followed by mean pooling along
both dimensions of the input. Thus, each user and movie
is encoded into a length 100 vector of real-valued latent
features. The decoder uses five similar exchangeable matrix
layers, with the final layer having five channels. We apply a
channel-wise dropout with probability 0.5 after the third
and fourth layers, which we again found to be crucial for
good performance. We convert the input ratings to one-hot
vectors and optimize using cross-entropy loss.