Mixed batches and symmetric discriminators for GAN training

Thomas Lucas* 1 Corentin Tallec* 2 Jakob Verbeek 1 Yann Ollivier 3

Abstract

vincing source of samples of natural images (Karras et al.,
2018). GANs consist of a generator and a discriminator
network. The generator maps samples from a latent random
variable with a basic prior, such as a multi-variate Gaussian,
to the observation space. This defines a probability distribution over the observation space. A discriminator network is
trained to distinguish between generated samples and true
samples in the observation space. The generator, on the
other hand, is trained to fool the discriminator. In an idealized setting with unbounded capacity of both networks and
infinite training data, the generator should converge to the
distribution from which the training data has been sampled.

Generative adversarial networks (GANs) are powerful generative models based on providing feedback to a generative network via a discriminator
network. However, the discriminator usually assesses individual samples. This prevents the discriminator from accessing global distributional
statistics of generated samples, and often leads to
mode dropping: the generator models only part
of the target distribution. We propose to feed
the discriminator with mixed batches of true and
fake samples, and train it to predict the ratio of
true samples in the batch. The latter score does
not depend on the order of samples in a batch.
Rather than learning this invariance, we introduce
a generic permutation-invariant discriminator architecture. This architecture is provably a universal approximator of all symmetric functions.
Experimentally, our approach reduces mode collapse in GANs on two synthetic datasets, and
obtains good results on the CIFAR10 and CelebA
datasets, both qualitatively and quantitatively.

In most adversarial setups, the discriminator classifies individual data samples. Consequently, it cannot directly detect
discrepancies between the distribution of generated samples
and global statistics of the training distribution, such as its
moments or quantiles. For instance, if the generator models
a restricted part of the support of the target distribution very
well, this can fool the discriminator at the level of individual
samples, a phenomenon known as mode dropping. In such a
case there is little incentive for the generator to model other
parts of the support of the target distribution. A more thorough explanation of this effect can be found in (Salimans
et al., 2016).

1. Introduction

In order to access global distributional statistics, imagine a
discriminator that could somehow take full probability distributions as its input. This is impossible in practice. Still,
it is possible to feed large batches of training or generated
samples to the discriminator, as an approximation of the
corresponding distributions. The discriminator can compute
statistics on those batches and detect discrepancies between
the two distributions. For instance, if a large batch exhibits
only one mode from a multimodal distribution, the discriminator would notice the discrepancy right away. Even though
a single batch may not encompass all modes of the distribution, it will still convey more information about missing
modes than an individual example.

Estimating generative models from unlabeled data is one
of the challenges in unsupervised learning. Recently, several latent variable approaches have been proposed to learn
flexible density estimators together with efficient sampling,
such as generative adversarial networks (GANs) (Goodfellow et al., 2014), variational autoencoders (Kingma &
Welling, 2014; Rezende et al., 2014), iterative transformation of noise (Sohl-Dickstein et al., 2015), or non-volume
preserving transformations (Dinh et al., 2017).
In this work we focus on GANs, currently the most con*
Equal contribution 1 Université Grenoble Alpes, Inria, CNRS,
Grenoble INP, LJK, 38000 Grenoble, France. 2 Université
Paris Sud, INRIA, équipe TAU, Gif-sur-Yvette, 91190, France.
3
Facebook Artificial Intelligence Research Paris, France. Correspondence to: Corentin Tallec <corentin.tallec@inria.fr>,
Thomas Lucas <thomas.lucas@inria.fr>.

Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).

Training the discriminator to discriminate “pure” batches
with only real or only synthetic samples makes its task too
easy, as a single bad sample reveals the whole batch as synthetic. Instead, we introduce a “mixed” batch discrimination
task in which the discriminator needs to predict the ratio of
real samples in a batch.
1

Mixed batches and symmetric discriminators for GAN training

This use of batches differs from traditional minibatch learning. The batch is not used as a computational trick to increase parallelism, but as an approximate distribution, on
which to compute global statistics.
A naive way of doing so would be to concatenate the samples in the batch, feeding the discriminator a single tensor
containing all the samples. However, this is parameterhungry, and the computed statistics are not automatically
invariant to the order of samples in the batch. To compute
functions that depend on the samples only through their distribution, it is necessary to restrict the class of discriminator
networks to permutation-invariant functions of the batch.
For this, we adapt and extend an architecture from McGregor (2007) to compute symmetric functions of the input. We
show this can be done with minimal modification to existing
architectures, at a negligible computational overhead w.r.t.
ordinary batch processing.

...

...

+

...

Figure 1. Graphical representation of our discriminator architecture. Each convolutional layer of an otherwise classical CNN
architecture is modified to include permutation invariant batch
statistics, denoted ρ(x). This is repeated at every layer so that the
network gradually builds up more complex statistics.

In summary, our contributions are the following:
• Naively training the discriminator to discriminate
“pure” batches with only real or only synthetic samples makes its task way too easy. We introduce a discrimination loss based on mixed batches of true and
fake samples, that avoids this pitfall. We derive the
associated optimal discriminator.
• We provide a principled way of defining neural networks that are permutation-invariant over a batch of
samples. We formally prove that the resulting class
of functions comprises all symmetric continuous functions, and only symmetric functions.
• We apply these insights to GANs, with good experimental results, both qualitatively and quantitatively.
We believe that discriminating between distributions at the
batch level provides an equally principled alternative to
approaches to GANs based on duality formulas (Nowozin
et al., 2016; Gulrajani et al., 2017; Arjovsky et al., 2017).

2. Related work
The training of generative models via distributional rather
than pointwise information has been explored in several
recent contributions. Batch discrimination (Salimans et al.,
2016) uses a handmade layer to compute batch statistics
which are then combined with sample-specific features to
enhance individual sample discrimination. Karras et al.
(2018) directly compute the standard deviation of features
and feed it as an additional feature to the last layer of the network. Both methods use a single layer of handcrafted batch
statistics, instead of letting the discriminator learn arbitrary
batch statistics useful for discrimination as in our approach.
Moreover, in both methods the discriminator still assesses
single samples, rather than entire batches. Radford et al.
(2015) reported improved results with batch normalization

in the discriminator, which may also be due to reliance on
batch statistics.
Other works, such as (Li et al., 2015) and (Dziugaite et al.,
2015), replace the discriminator with a fixed distributional
loss between true and generated samples, the maximum
mean discrepancy, as the criterion to train the generative
model. This has the advantage of relieving the inherent
instability of GANs, but lacks the flexibility of an adaptive
discriminator.
The discriminator we introduce treats batches as sets of
samples. Processing sets prescribes the use of permutation invariant networks. There has been a large body of
work around permutation invariant networks, e.g (McGregor, 2007; 2008; Qi et al., 2016; Zaheer et al., 2017; Vaswani
et al., 2017). Our processing is inspired by (McGregor,
2007; 2008) which designs a special kind of layer that provides the desired invariance property. The network from
McGregor (2007) is a multi-layer perceptron in which the
single hidden layer performs a batchwise computation that
makes the result equivariant by permutation. Here we show
that stacking such hidden layers and reducing the final layer
with a permutation invariant reduction, covers the whole
space of continuous permutation invariant functions.
Zaheer et al. (2017) first process each element of the set
independently, then aggregate the resulting representation
using a permutation invariant operation, and finally process the permutation invariant quantity. Qi et al. (2016)
process 3D point cloud data, and interleave layers that process points independently, and layers that apply equivariant
transformations. The output of their networks are either
permutation equivariant for pointcloud segmentation, or permutation invariant for shape recognition. In our approach
we stack permutation equivariant layers that combine batch

Mixed batches and symmetric discriminators for GAN training

the distributions of interest. Such statistics could be useful
to avoid mode dropping. Adversarial learning (Goodfellow
et al., 2014) can easily be extended to the batch discrimination case. For a fixed batch size B, the corresponding
two-player optimization procedure becomes

0.15

Discriminator loss

0.12

0.09

min max Ex1 ,...,xB ∼D [log D(x1 , . . . , xB )] +
G

0.06
° = 0: 5
° = 0: 3
° = 0: 2
0.03

0.00
0

700

1400

2100

2800

Iteration

3500 (£100)

D

(1)

Ez1 ,...,zB ∼Z [log(1 − D(G(z1 ), . . . , G(zB )))]
with D the empirical distribution over data, Z a distribution
over the latent variable that is the input of the generator, G
a pointwise generator and D a batch discriminator.1 This
leads to a learning procedure similar to the usual GAN
algorithm, except that the loss encourages the discriminator
to output 1 when faced with an entire batch of real data, and
0 when faced with an entire batch of generated data.

3.3

3.0

Generator loss

° = 0: 2

° = 0: 3

2.7

° = 0: 5

2.4

2.1

Unfortunately, this basic procedure makes the work of the
discriminator too easy. As the discriminator is only faced
with batches that consist of either only training samples
or only generated samples, it can base its prediction on
any subset of these samples. For example, a single poor
generated sample would be enough to reject a batch. To
cope with this deficiency, we propose to sample batches that
mix both training and generated data. The discriminator’s
task is to predict the proportion of real images in the batch,
which is clearly a permutation invariant quantity.

1.8
0

700

1400

2100

2800

3500 (£100)

Iteration

Figure 2. Effect of batch smoothing with different γ’s on the generator and discriminator losses.

information and sample information at every level, and aggregate these in the final layer using a permutation invariant
operation.
More complex approaches to permutation invariance or
equivariance appear in (Guttenberg et al., 2016). We prove,
however, that our simpler architecture already covers the
full space of permutation invariant functions.
Improving the training of GANs has received a lot of recent
attention. For instance, Arjovsky et al. (2017), Gulrajani
et al. (2017) and Miyato et al. (2018) constrain the Lipschitz
constant of the network and show that this stabilizes training
and improves performance. Karras et al. (2018) achieved
impressive results by gradually increasing the resolution of
the generated images as training progresses.

3. Adversarial learning with
permutation-invariant batch features
Using a batch of samples rather than individual samples as
input to the discriminator can provide global statistics about

3.1. Batch smoothing as a regularizer
A naive approach to sampling mixed batches would be, for
each batch index, to pick a datapoint from either real or
generated images with probability 12 . This is necessarily ill
behaved: as the batch size increases, the ratio of training
data to generated data in the batch tends to 12 by the law
of large numbers. Consequently, a discriminator always
predicting 12 would achieve very low error with large batch
sizes, and provide no training signal to the generator.
Instead, for each batch we sample a ratio p from a distribution P on [0, 1], and construct a batch by picking real
samples with probability p and generated samples with probability 1 − p. This forces the discriminator to predict across
an entire range of possible values of p.
Formally, suppose we are given a batch of training data
x ∈ RB×n and a batch of generated data x̃ ∈ RB×n . To
B
mix x and x̃, a binary vector β is sampled from B (p) , a
B-dimensional Bernoulli distribution with parameter p. The
mixed batch with mixing vector β is denoted
mβ (x, x̃) := x

β + x̃

(1 − β).

(2)

1
The generator G could also be modified to produce batches
of data, which can help to cover more modes per batch, but this
deviates from the objective of learning a density estimator from
which we can draw i.i.d. samples.

Mixed batches and symmetric discriminators for GAN training
Squares
Gan

mixup Gan

Circles
BGan(γ = 0.3)

Gan

mixup Gan

BGan(γ = 0.3)

Figure 3. Comparison between standard, mixup and batch smoothing GANs on a 2D experiment. Training at iterations 10, 100, 1000,
10000 and 20000.

This apparently wastes some samples, but we can reuse the
discarded samples by using 1 − β in the next batch.
The discriminator has to predict the ratio of real images, #β
B
where #β is the sum of the components of β. As a loss on
the predicted ratio, we use the Kullback–Leibler divergence
between a Bernoulli distribution with the actual ratio of
real images, and a Bernoulli distribution with the predicted
ratio. The divergence between Bernoulli distributions with
parameters u and v is
KL(B (u) || B (v)) = u log

u
1−u
+ (1 − u) log
. (3)
v
1−v

Formally, the discriminator D will minimize the objective
 


#β
Ep∼P, β∼B(p)B KL B
|| B (D(mβ (x, x̃))) ,
B
(4)
where the expectation is over sampling p from a distribution
P, typically uniform on [0, 1], then sampling a mixed minibatch. For clarity, we have omitted the expectation over the
sampling of training and generated samples

The generator is trained with the loss
Ep∼P, β∼B(p)B log(D(mβ (x, x̃))).

(5)

This loss, which is not the generator loss associated to
the min-max optimization problem, is known to saturate
less (Goodfellow et al., 2014).
In some experimental cases, using the discriminator loss
(4) with P = U([0, 1]) made discriminator training too
difficult. To alleviate some of the difficulty, we sampled
the mixing variable p from a reduced symmetric union of
intervals [0, γ] ∪ [1 − γ, 1]. With low γ, all generated
batches are nearly purely taken from either real or fake
data. We refer to this training method as batch smoothing-γ.
Batch smoothing-0 corresponds to no mixing, while batch
smoothing-0.5 corresponds to equation (4).
3.2. The optimal discriminator for batch smoothing
The optimal discriminator for batch smoothing can be computed explicitly, for p ∼ U([0, 1]), and extends the usual
GAN discriminator when B = 1.
Proposition 1. The optimal discriminator for the loss (4),

Mixed batches and symmetric discriminators for GAN training

given a batch y ∈ RB×N , is
D∗ (y) =

1 punbalanced (y)
2 pbalanced (y)

(6)

where the distribution pbalanced and punbalanced on batches are
defined as
pbalanced (y) =

1
B+1

2
punbalanced (y) =
B+1

p1 (y)β p2 (y)1−β

B

X

#β

β∈{0,1}B

X
β∈{0,1}

are invariant to permuting the order of samples within the
batch. In this section we propose a permutation equivariant
layer that can be used together with a permutation invariant
aggregation operation to build networks that are permutation
invariant. We also provide a sketch of proof (fully developed in the supplementary material) that this architecture is
able to reach all symmetric continuous functions, and only
represents such functions.
4.1. Building a permutation invariant architecture

β

p1 (y) p2 (y)1−β #β
.

B
B
B
#β
(7)

in which p1 is the data distribution and p2 the distribution
of generated samples, and where p1 (y)β is shorthand for
p1 (y1 )β1 . . . p1 (yB )βB .
The proof is technical and is deferred to the supplementary
material. For non-uniform beta distributions on p, a similar
result holds, with different coefficients depending on #β
and B in the sum.
These heavy expressions can be interpreted easily. First, in
the case B = 1, the optimal discriminator reduces to the opp1 (y)
timal discriminator for a standard GAN, D∗ = p1 (y)+p
.
2 (y)
Actually pbalanced (y) is simply the distribution of batches y
under our procedure of sampling p uniformly, then sampling
B
β ∼ B (p) . The binomial coefficients put on equal footing
contributions with different true/fake ratios.
The generator loss (5), when faced with the optimal discriminator, is the Kullback–Leibler divergence between
pbalanced and punbalanced (up to sign and a constant log(2)).
Since punbalanced puts more weight on batches with higher
#β (more true samples), this brings fake samples closer to
true ones.
Since pbalanced and punbalanced differ by a factor 2#β/B, the
(y)
ratio D∗ = 12 ppunbalanced
is simply the expectation of #β/B
balanced (y)
under a probability distribution on β that is proportional to
p1 (y)β p2 (y)1−β
. But this is the posterior distribution on

B
#β

β given the batch y and the uniform prior on the ratio p.
Thus, the optimal discriminator is just the posterior
h i mean
∗
of the ratio of true samples, D (y) = IEβ|y #β
B . This is
standard when minimizing the expected divergence between
Bernoulli distributions and the approach can therefore be
extended to non-uniform priors on p as shown in section 9.

4. Permutation invariant networks
Computing statistics of probability distributions from
batches of i.i.d. samples requires to compute quantities that

A naive way of achieving invariance to batch permutations
is to consider the batch dimension as a regular feature dimension, and to randomly reorder the batches at each step.
This multiplies the input dimension by the batch size, and
thus greatly increases the number of trainable parameters.
Moreover, this only provides approximate invariance to
batch permutation, as the network has to infer the invariance
based on the training data.
Instead, we propose to directly build invariance into the
architecture. This method drastically reduces the number of
parameters compared to the naive approach, bringing it back
in line with ordinary networks, and ensures strict invariance
to batch permutation.
Let us first formalize the notion of batch permutation invariance and equivariance. A function f from RB×l to RB×L
is batch permutation equivariant if permuting samples in
the batch results in the same permutation of the outputs: for
any permutation σ of the inputs,
f (xσ(1) , . . . , xσ(B) ) = f (x)σ(1) , . . . , f (x)σ(B) .

(8)

For instance, any regular neural network or other function
treating the inputs x1 , . . . , xB independently in parallel, is
batch permutation equivariant.
A function f from RB×l to RL is batch permutation invariant if permuting the inputs in the batch does not change the
output: for any permutation on batch indices σ,
f (xσ(1) , . . . , xσ(B) ) = f (x1 , . . . , xB ).

(9)

The mean, the max or the standard deviation along the batch
axis are all batch permutation invariant.
Permutation equivariant and permutation invariant functions
can be obtained by combining ordinary, parallel treatment of
batch samples with an additional batch-averaging operation
that performs an average of the activations across the batch
direction. In our architecture, this averaging is the only form
of interaction between different elements of the batch. It is
one of our main results that such operations are sufficient to
recover all invariant functions.
Formally, on a batch of data x ∈ RB×n , our proposed batch

Mixed batches and symmetric discriminators for GAN training

Figure 4. Sample images generated by our best model trained on CIFAR10.

permutation invariant network fθ is defined as
B

fθ (x) =

1 X
(φθp ◦ φθp−1 ◦ . . . ◦ φθ0 (x))b
B

(10)

b=1

where each φθi is a batch permutation equivariant function
from RB×li−1 to RB×li , where the li ’s are the layer sizes.
The equivariant layer operation φθ with l input features
and L output features comprises an ordinary weight matrix
Λ ∈ Rl×L that treats each data point of the batch independently (“non-batch-mixing”), a batch-mixing weight matrix
Γ ∈ Rl×L , and a bias vector β ∈ RL . As in regular neural
networks, Λ processes each data point in the batch independently. On the other hand, the weight matrix Γ operates after
computing an average across the whole batch. Defining ρ
as the batch average for each feature,
B

ρ(x1 , . . . , xB ) :=

1 X
xb
B

(11)

b=1

the permutation-equivariant layer φ is formally defined as


φθ (x)b := µ β + xb Λ + ρ(x)Γ
(12)
where µ is a nonlinearity, b is a batch index, and the parameter of the layer is θ = (β, Λ, Γ).
4.2. Networks of equivariant layers provide universal
approximation of permutation invariant functions
The networks constructed above are permutation invariant
by construction. However, it is unclear a priori that all
permutation invariant functions can be represented this way:
the functions that can be approximated to arbitrary precision
by those networks could be a strict subset of the set of
permutation invariant functions. The optimal solution for
the discriminator could lie outside this subset, making our
construction too restrictive. We now show this is not the
case: our architecture satisfies a universal approximation
theorem for permutation-invariant functions.

Theorem 1. The set of networks that can be constructed
by stacking as in Eq. (10) the layers φ defined in Eq. (12),
with sigmoid nonlinearities except on the output layer, is
dense in the set of permutation-invariant functions (for the
topology of uniform convergence on compact sets).
While the case of one-dimensional features is relatively
simple, the multidimensional case is more intricate, and the
detailed proof is given in the supplementary material. Let
us describe the key ideas underlying the proof.
The standard universal approximation theorem for neural networks proves the following: for any continuous
function f , we can find a network that given a batch
x = (x1 , . . . , xB ), computes (f (x1 ), . . . , f (xB )). This is
insufficient for our purpose as it provides no way of mixing
information between samples in the batch.
First, we prove that the set of functions that can be approximated to arbitrary precision by our networks is an algebra,
i.e., a vector space stable under products. From this point on,
it remains to be shown that this algebra contains a generative
family of the continuous symmetric functions.
To prove that we can compute the sum of two functions
f1 and f2 , compute f1 and f2 on different channels (this
is possible even if f1 and f2 require different numbers of
layers, by filling in with the identity if necessary). Then
sum across channels, which is possible in (12).
To compute products, first compute f1 and f2 on different
channels, then apply the universal approximation theorem
to turn this into log f1 and log f2 , then add, then take the
exponential thanks to the universal approximation theorem.
The key point is then the following: the algebra of all
permutation-invariant polynomials over the components of
(x1 , . . . , xB ) is generated as an algebra by the averages
1
B (f (x1 ) + . . . + f (xB )) when f ranges over all functions
of single batch elements. This non-trivial algebraic statement is proved in the supplementary material.

Mixed batches and symmetric discriminators for GAN training

Figure 5. Samples obtained after 66000 iterations on the celebA dataset. From left to right: (a) Standard GAN (b) Single batch
discriminator, no batch smoothing. (c) Single batch discriminator, batch smoothing γ = 0.5. (d) Multiple batch discriminators, batch
smoothing γ = 0.5

By construction, such functions B1 (f (x1 )+. . .+f (xB )) are
readily available in our architecture, by computing f as in
an ordinary network and then applying the batch-averaging
operation ρ in the next layer. Further layers provide sums
and products of those thanks to the algebra property. We can
conclude with a symmetric version of the Stone–Weierstrass
theorem (polynomials are dense in continuous functions).
4.3. Practical architecture
In our experiments, we apply the constructions above to standard, deep convolutional neural networks. In practice, for
the linear operations Λ and Γ in (12) we use convolutional
kernels (of size 3 × 3) acting over xb and ρ(x) respectively.
Weight tensors Λ and Γ are also reweighted like so that
at the start of training ρ(x) does not contribute dispropor|B|
tionately compared with other features: Λ̃ = |B|+1
Λ and
1
Γ where |B| denotes the size of batch B. While
Γ̃ = |B|+1
these coefficients could be learned, we have found this explicit initialization to improve training. Figure 1 shows how
to modify standard CNN architectures to adapt each layer
to our method.

In the first setup, which we refer to as BGAN, a permutation
invariant reduction is done at the end of the discriminator,
yielding a single prediction per batch, which is evaluated
with the loss in (4). We also introduce a setup, M-BGAN,
where we swap the order of averaging and applying the
loss. 2 Namely, letting y be the single target for the batch (in
our case, the proportion of real samples), the BGAN case
translates into

B

L((o1 , . . . , oB ), y) = `
2

1 X
oi , y
B i=1

This was initially a bug that worked.

!
(13)

while M-BGAN translates to
B

L((o1 , . . . , oB ), y) =

1 X
`(oi , y)
B i=1

(14)

where L is the final loss function, ` is the KL loss function
used in (4), (o1 , . . . , ob ) is the output of the last equivariant
layer, and y is the target for the whole batch.
Both these losses are permutation invariant. A more detailled explanation of M-BGAN is given in Section 11.

5. Experiments
5.1. Synthetic 2D distributions
The synthetic dataset from Zhang et al. (2017) is explicitly
designed to test mode dropping. The data are sampled from
a mixture of concentrated Gaussians in the 2D plane. We
compare standard GAN training, “mixup” training (Zhang
et al., 2017), and batch smoothing using the BGAN from
Section 4.3.
In all cases, the generators and discriminators are three-layer
ReLU networks with 512 units per layer. The latent variables of the generator are 2-dimensional standard Gaussians.
The models are trained on their respective losses using the
Adam (Kingma & Ba, 2015) optimizer, with default parameters. The discriminator is trained for five steps for each
generator step.
The results are summarized in Figure 3. Batch smoothing
and mixup have similar effects. Results for BGAN and
M-BGAN are qualitatively similar on this dataset and we
only display results for BGAN. The standard GAN setting
quickly diverges, due to its inability to fit several modes
simultaneously, while both batch smoothing and mixup successfully fit the majority of modes of the distribution.

Mixed batches and symmetric discriminators for GAN training
Table 1. Comparison to the state of the art in terms of inception
score (IS) and Fréchet inception distance (FID).

7.5
M-BGAN, mixed batches
7.0

Inception score

BGAN, mixed batches
6.5
Batch discrimination
6.0

5.5

BGAN, pure batches

5.0

Model

IS

FID

WGP (Miyato et al., 2018)
GP (Miyato et al., 2018)
SN (Miyato et al., 2018)

6.68 ± .06
6.93 ± .08
7.42 ± .08

40.2
37.7
29.3

Salimans et al.
BGAN
M-BGAN

7.09 ± .08
7.05 ± .06
7.49 ± .06

35.0
36.47
23.71

5.3. Effect of batch smoothing on the generator and
discriminator losses

4.5
0

1

2

3

4 (£1e5)

Iteration

Figure 6. Inception score for various versions of BGAN and for
batch discrimination (Salimans et al., 2016).

5.2. Experimental results on CIFAR10
Next, we consider image generation on the CIFAR10
dataset. We use the simple architecture from (Miyato
et al., 2018), minimally modified to obtain permutation
invariance thanks to (12). All other architectural choices are
unchanged. The same Adam hyperparameters from (Miyato
et al., 2018) are used for all models: α = 2e−4 , β1 = 0.5,
β2 = 0.999, and no learning rate decay. We performed
hyperparameter search for the number of discrimination
steps between each generation step, ndisc , over the range
{1, . . . , 5}, and for the batch smoothing parameter γ over
[0.2, 0.5]. All models are trained for 400, 000 iterations,
counting both generation and discrimination steps. We
compare smoothed BGAN and M-BGAN, and the same
network trained with spectral normalization (Miyato et al.,
2018) (SN), and gradient penalty (Gulrajani et al., 2017) on
both the Wasserstein (Arjovsky et al., 2017) (WGP) and
the standard loss (GP). We also compare to a model using
the batch-discrimination layer from (Salimans et al., 2016),
adding a final batch discrimination layer to the architecture
of (Miyato et al., 2018). All models are evaluated by
reporting the Inception Score and the Fréchet Inception
Distance (Heusel et al., 2017) and results are summarized
in Table 2. Figure 4 displays sample images generated with
our best model.
Figure 5.2 highlights the training dynamics of each model3 .
On this architecture, M-BGAN heavily outperforms both
batch discrimination and our other variants, and yields results similar to, or slightly better than (Miyato et al., 2018).
Model trained with batch smoothing display results on par
with batch discrimination, and much better than without
batch smoothing.
3

For readability, a slight smoothing is performed on the curves.

To check the effect of the batch smoothing parameter γ on
the loss, we plot the discriminator and generator losses of
the network for different γ’s. The smaller the γ, the purer
the batches. We would expect discriminator training to be
more difficult with larger γ. The results corroborate this
insight (Fig. 2). BGAN and M-BGAN behave similarly and
we only report on BGAN in the figure. The discriminator
loss is not directly affected by an increase in γ, but the
generator loss is lower for larger γ, revealing the relative
advantage of the generator on the discriminator.
This suggests to increase γ if the discriminator dominates
learning, and to decrease γ if the discriminator is stuck at a
high value in spite of poor generated samples.
5.4. Qualitative results on celebA
Finally, on the celebA face dataset, we adapt the simple architecture of (Miyato et al., 2018) to the increased resolution
by adding a layer to both networks. For optimization we use
Adam with β1 = 0, β2 = 0.9, α = 1e − 4, and ndisc = 1.
Fig. 5 dislays BGAN samples with pure batches, and BGAN
and M-BGAN samples with γ = .5. The visual quality of
the samples is reasonable; we believe that an improvement
is visible from pure batches to M-BGAN.

6. Conclusion
We introduced a method to feed batches of samples to the
discriminator of a GAN in an principled way, based on two
observations: feeding all-fake or all-genuine batches to a
discriminator makes its task too easy; second, a simple architectural trick makes it possible to provably recover all
functions of the batch as an unordered set. Experimentally,
this provides a new, alternative method to reduce mode dropping and reach good quantitative scores in GAN training.
ACKNOWLEDGMENTS
This work has been partially supported by the grant ANR-16CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab
(ANR-11-LABX-0025-01).

Mixed batches and symmetric discriminators for GAN training

References
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th
International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017, pp.
214–223, 2017. URL http://proceedings.mlr.
press/v70/arjovsky17a.html.
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and
Systems, 2(4):303–314, Dec 1989. ISSN 1435-568X. doi:
10.1007/BF02551274. URL https://doi.org/10.
1007/BF02551274.

McGregor, S. Neural network processing for multiset data.
In Proceedings of the 17th International Conference on
Artificial Neural Networks, ICANN’07, pp. 460–470,
Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 3-54074689-7, 978-3-540-74689-8. URL http://dl.acm.
org/citation.cfm?id=1776814.1776866.
McGregor, S. Further results in multiset processing with
neural networks. Neural Networks, 21(6):830–837, 2008.
doi: 10.1016/j.neunet.2008.06.020. URL https://
doi.org/10.1016/j.neunet.2008.06.020.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In ICLR, 2017.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks.
International Conference on Learning Representations,
2018. URL https://openreview.net/forum?
id=B1QRgziT-. accepted as oral presentation.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training
generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906,
2015.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training
generative neural samplers using variational divergence
minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
Y. Generative adversarial nets. In Advances in neural
information processing systems, pp. 2672–2680, 2014.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification
and segmentation. CoRR, abs/1612.00593, 2016. URL
http://arxiv.org/abs/1612.00593.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. C. Improved training of wasserstein gans.
CoRR, abs/1704.00028, 2017. URL http://arxiv.
org/abs/1704.00028.

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative
adversarial networks. arXiv preprint arXiv:1511.06434,
2015.

Guttenberg, N., Virgo, N., Witkowski, O., Aoki,
H., and Kanai, R. Permutation-equivariant neural
networks applied to dynamics prediction.
CoRR,
abs/1612.04530, 2016. URL http://arxiv.org/
abs/1612.04530.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,
Klambauer, G., and Hochreiter, S. Gans trained by
a two time-scale update rule converge to a nash equilibrium. CoRR, abs/1706.08500, 2017. URL http:
//arxiv.org/abs/1706.08500.
Karras, T., Aila, T., and abd J. Lehtinen, S. L. Progressive
growing of GANs for improved quality, stability, and
variation. In ICLR, 2018.
Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. In iclr, 2015.
Kingma, D. and Welling, M. Auto-encoding variational
Bayes. In ICLR, 2014.
Li, Y., Swersky, K., and Zemel, R. S. Generative moment
matching networks. CoRR, abs/1502.02761, 2015. URL
http://arxiv.org/abs/1502.02761.

Rezende, D., Mohamed, S., and Wierstra, D. Stochastic
backpropagation and approximate inference in deep generative models. In ICML, 2014.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
Radford, A., and Chen, X. Improved techniques for
training GANs. In NIPS, 2016.
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and
Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585,
2015.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6000–6010. Curran Associates, Inc.,
2017. URL http://papers.nips.cc/paper/
7181-attention-is-all-you-need.pdf.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B.,
Salakhutdinov, R., and Smola, A. J. Deep sets. CoRR,
abs/1703.06114, 2017. URL http://arxiv.org/
abs/1703.06114.

Mixed batches and symmetric discriminators for GAN training

Zhang, H., Cissé, M., Dauphin, Y. N., and Lopez-Paz, D.
mixup: Beyond empirical risk minimization. CoRR,
abs/1710.09412, 2017. URL http://arxiv.org/
abs/1710.09412.