Mixed batches and symmetric discriminators for GAN training Thomas Lucas* 1 Corentin Tallec* 2 Jakob Verbeek 1 Yann Ollivier 3 Abstract vincing source of samples of natural images (Karras et al., 2018). GANs consist of a generator and a discriminator network. The generator maps samples from a latent random variable with a basic prior, such as a multi-variate Gaussian, to the observation space. This defines a probability distribution over the observation space. A discriminator network is trained to distinguish between generated samples and true samples in the observation space. The generator, on the other hand, is trained to fool the discriminator. In an idealized setting with unbounded capacity of both networks and infinite training data, the generator should converge to the distribution from which the training data has been sampled. Generative adversarial networks (GANs) are powerful generative models based on providing feedback to a generative network via a discriminator network. However, the discriminator usually assesses individual samples. This prevents the discriminator from accessing global distributional statistics of generated samples, and often leads to mode dropping: the generator models only part of the target distribution. We propose to feed the discriminator with mixed batches of true and fake samples, and train it to predict the ratio of true samples in the batch. The latter score does not depend on the order of samples in a batch. Rather than learning this invariance, we introduce a generic permutation-invariant discriminator architecture. This architecture is provably a universal approximator of all symmetric functions. Experimentally, our approach reduces mode collapse in GANs on two synthetic datasets, and obtains good results on the CIFAR10 and CelebA datasets, both qualitatively and quantitatively. In most adversarial setups, the discriminator classifies individual data samples. Consequently, it cannot directly detect discrepancies between the distribution of generated samples and global statistics of the training distribution, such as its moments or quantiles. For instance, if the generator models a restricted part of the support of the target distribution very well, this can fool the discriminator at the level of individual samples, a phenomenon known as mode dropping. In such a case there is little incentive for the generator to model other parts of the support of the target distribution. A more thorough explanation of this effect can be found in (Salimans et al., 2016). 1. Introduction In order to access global distributional statistics, imagine a discriminator that could somehow take full probability distributions as its input. This is impossible in practice. Still, it is possible to feed large batches of training or generated samples to the discriminator, as an approximation of the corresponding distributions. The discriminator can compute statistics on those batches and detect discrepancies between the two distributions. For instance, if a large batch exhibits only one mode from a multimodal distribution, the discriminator would notice the discrepancy right away. Even though a single batch may not encompass all modes of the distribution, it will still convey more information about missing modes than an individual example. Estimating generative models from unlabeled data is one of the challenges in unsupervised learning. Recently, several latent variable approaches have been proposed to learn flexible density estimators together with efficient sampling, such as generative adversarial networks (GANs) (Goodfellow et al., 2014), variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014), iterative transformation of noise (Sohl-Dickstein et al., 2015), or non-volume preserving transformations (Dinh et al., 2017). In this work we focus on GANs, currently the most con* Equal contribution 1 Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. 2 Université Paris Sud, INRIA, équipe TAU, Gif-sur-Yvette, 91190, France. 3 Facebook Artificial Intelligence Research Paris, France. Correspondence to: Corentin Tallec , Thomas Lucas . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Training the discriminator to discriminate “pure” batches with only real or only synthetic samples makes its task too easy, as a single bad sample reveals the whole batch as synthetic. Instead, we introduce a “mixed” batch discrimination task in which the discriminator needs to predict the ratio of real samples in a batch. 1 Mixed batches and symmetric discriminators for GAN training This use of batches differs from traditional minibatch learning. The batch is not used as a computational trick to increase parallelism, but as an approximate distribution, on which to compute global statistics. A naive way of doing so would be to concatenate the samples in the batch, feeding the discriminator a single tensor containing all the samples. However, this is parameterhungry, and the computed statistics are not automatically invariant to the order of samples in the batch. To compute functions that depend on the samples only through their distribution, it is necessary to restrict the class of discriminator networks to permutation-invariant functions of the batch. For this, we adapt and extend an architecture from McGregor (2007) to compute symmetric functions of the input. We show this can be done with minimal modification to existing architectures, at a negligible computational overhead w.r.t. ordinary batch processing. ... ... + ... Figure 1. Graphical representation of our discriminator architecture. Each convolutional layer of an otherwise classical CNN architecture is modified to include permutation invariant batch statistics, denoted ρ(x). This is repeated at every layer so that the network gradually builds up more complex statistics. In summary, our contributions are the following: • Naively training the discriminator to discriminate “pure” batches with only real or only synthetic samples makes its task way too easy. We introduce a discrimination loss based on mixed batches of true and fake samples, that avoids this pitfall. We derive the associated optimal discriminator. • We provide a principled way of defining neural networks that are permutation-invariant over a batch of samples. We formally prove that the resulting class of functions comprises all symmetric continuous functions, and only symmetric functions. • We apply these insights to GANs, with good experimental results, both qualitatively and quantitatively. We believe that discriminating between distributions at the batch level provides an equally principled alternative to approaches to GANs based on duality formulas (Nowozin et al., 2016; Gulrajani et al., 2017; Arjovsky et al., 2017). 2. Related work The training of generative models via distributional rather than pointwise information has been explored in several recent contributions. Batch discrimination (Salimans et al., 2016) uses a handmade layer to compute batch statistics which are then combined with sample-specific features to enhance individual sample discrimination. Karras et al. (2018) directly compute the standard deviation of features and feed it as an additional feature to the last layer of the network. Both methods use a single layer of handcrafted batch statistics, instead of letting the discriminator learn arbitrary batch statistics useful for discrimination as in our approach. Moreover, in both methods the discriminator still assesses single samples, rather than entire batches. Radford et al. (2015) reported improved results with batch normalization in the discriminator, which may also be due to reliance on batch statistics. Other works, such as (Li et al., 2015) and (Dziugaite et al., 2015), replace the discriminator with a fixed distributional loss between true and generated samples, the maximum mean discrepancy, as the criterion to train the generative model. This has the advantage of relieving the inherent instability of GANs, but lacks the flexibility of an adaptive discriminator. The discriminator we introduce treats batches as sets of samples. Processing sets prescribes the use of permutation invariant networks. There has been a large body of work around permutation invariant networks, e.g (McGregor, 2007; 2008; Qi et al., 2016; Zaheer et al., 2017; Vaswani et al., 2017). Our processing is inspired by (McGregor, 2007; 2008) which designs a special kind of layer that provides the desired invariance property. The network from McGregor (2007) is a multi-layer perceptron in which the single hidden layer performs a batchwise computation that makes the result equivariant by permutation. Here we show that stacking such hidden layers and reducing the final layer with a permutation invariant reduction, covers the whole space of continuous permutation invariant functions. Zaheer et al. (2017) first process each element of the set independently, then aggregate the resulting representation using a permutation invariant operation, and finally process the permutation invariant quantity. Qi et al. (2016) process 3D point cloud data, and interleave layers that process points independently, and layers that apply equivariant transformations. The output of their networks are either permutation equivariant for pointcloud segmentation, or permutation invariant for shape recognition. In our approach we stack permutation equivariant layers that combine batch Mixed batches and symmetric discriminators for GAN training the distributions of interest. Such statistics could be useful to avoid mode dropping. Adversarial learning (Goodfellow et al., 2014) can easily be extended to the batch discrimination case. For a fixed batch size B, the corresponding two-player optimization procedure becomes 0.15 Discriminator loss 0.12 0.09 min max Ex1 ,...,xB ∼D [log D(x1 , . . . , xB )] + G 0.06 ° = 0: 5 ° = 0: 3 ° = 0: 2 0.03 0.00 0 700 1400 2100 2800 Iteration 3500 (£100) D (1) Ez1 ,...,zB ∼Z [log(1 − D(G(z1 ), . . . , G(zB )))] with D the empirical distribution over data, Z a distribution over the latent variable that is the input of the generator, G a pointwise generator and D a batch discriminator.1 This leads to a learning procedure similar to the usual GAN algorithm, except that the loss encourages the discriminator to output 1 when faced with an entire batch of real data, and 0 when faced with an entire batch of generated data. 3.3 3.0 Generator loss ° = 0: 2 ° = 0: 3 2.7 ° = 0: 5 2.4 2.1 Unfortunately, this basic procedure makes the work of the discriminator too easy. As the discriminator is only faced with batches that consist of either only training samples or only generated samples, it can base its prediction on any subset of these samples. For example, a single poor generated sample would be enough to reject a batch. To cope with this deficiency, we propose to sample batches that mix both training and generated data. The discriminator’s task is to predict the proportion of real images in the batch, which is clearly a permutation invariant quantity. 1.8 0 700 1400 2100 2800 3500 (£100) Iteration Figure 2. Effect of batch smoothing with different γ’s on the generator and discriminator losses. information and sample information at every level, and aggregate these in the final layer using a permutation invariant operation. More complex approaches to permutation invariance or equivariance appear in (Guttenberg et al., 2016). We prove, however, that our simpler architecture already covers the full space of permutation invariant functions. Improving the training of GANs has received a lot of recent attention. For instance, Arjovsky et al. (2017), Gulrajani et al. (2017) and Miyato et al. (2018) constrain the Lipschitz constant of the network and show that this stabilizes training and improves performance. Karras et al. (2018) achieved impressive results by gradually increasing the resolution of the generated images as training progresses. 3. Adversarial learning with permutation-invariant batch features Using a batch of samples rather than individual samples as input to the discriminator can provide global statistics about 3.1. Batch smoothing as a regularizer A naive approach to sampling mixed batches would be, for each batch index, to pick a datapoint from either real or generated images with probability 12 . This is necessarily ill behaved: as the batch size increases, the ratio of training data to generated data in the batch tends to 12 by the law of large numbers. Consequently, a discriminator always predicting 12 would achieve very low error with large batch sizes, and provide no training signal to the generator. Instead, for each batch we sample a ratio p from a distribution P on [0, 1], and construct a batch by picking real samples with probability p and generated samples with probability 1 − p. This forces the discriminator to predict across an entire range of possible values of p. Formally, suppose we are given a batch of training data x ∈ RB×n and a batch of generated data x̃ ∈ RB×n . To B mix x and x̃, a binary vector β is sampled from B (p) , a B-dimensional Bernoulli distribution with parameter p. The mixed batch with mixing vector β is denoted mβ (x, x̃) := x β + x̃ (1 − β). (2) 1 The generator G could also be modified to produce batches of data, which can help to cover more modes per batch, but this deviates from the objective of learning a density estimator from which we can draw i.i.d. samples. Mixed batches and symmetric discriminators for GAN training Squares Gan mixup Gan Circles BGan(γ = 0.3) Gan mixup Gan BGan(γ = 0.3) Figure 3. Comparison between standard, mixup and batch smoothing GANs on a 2D experiment. Training at iterations 10, 100, 1000, 10000 and 20000. This apparently wastes some samples, but we can reuse the discarded samples by using 1 − β in the next batch. The discriminator has to predict the ratio of real images, #β B where #β is the sum of the components of β. As a loss on the predicted ratio, we use the Kullback–Leibler divergence between a Bernoulli distribution with the actual ratio of real images, and a Bernoulli distribution with the predicted ratio. The divergence between Bernoulli distributions with parameters u and v is KL(B (u) || B (v)) = u log u 1−u + (1 − u) log . (3) v 1−v Formally, the discriminator D will minimize the objective     #β Ep∼P, β∼B(p)B KL B || B (D(mβ (x, x̃))) , B (4) where the expectation is over sampling p from a distribution P, typically uniform on [0, 1], then sampling a mixed minibatch. For clarity, we have omitted the expectation over the sampling of training and generated samples The generator is trained with the loss Ep∼P, β∼B(p)B log(D(mβ (x, x̃))). (5) This loss, which is not the generator loss associated to the min-max optimization problem, is known to saturate less (Goodfellow et al., 2014). In some experimental cases, using the discriminator loss (4) with P = U([0, 1]) made discriminator training too difficult. To alleviate some of the difficulty, we sampled the mixing variable p from a reduced symmetric union of intervals [0, γ] ∪ [1 − γ, 1]. With low γ, all generated batches are nearly purely taken from either real or fake data. We refer to this training method as batch smoothing-γ. Batch smoothing-0 corresponds to no mixing, while batch smoothing-0.5 corresponds to equation (4). 3.2. The optimal discriminator for batch smoothing The optimal discriminator for batch smoothing can be computed explicitly, for p ∼ U([0, 1]), and extends the usual GAN discriminator when B = 1. Proposition 1. The optimal discriminator for the loss (4), Mixed batches and symmetric discriminators for GAN training given a batch y ∈ RB×N , is D∗ (y) = 1 punbalanced (y) 2 pbalanced (y) (6) where the distribution pbalanced and punbalanced on batches are defined as pbalanced (y) = 1 B+1 2 punbalanced (y) = B+1 p1 (y)β p2 (y)1−β  B X #β β∈{0,1}B X β∈{0,1} are invariant to permuting the order of samples within the batch. In this section we propose a permutation equivariant layer that can be used together with a permutation invariant aggregation operation to build networks that are permutation invariant. We also provide a sketch of proof (fully developed in the supplementary material) that this architecture is able to reach all symmetric continuous functions, and only represents such functions. 4.1. Building a permutation invariant architecture β p1 (y) p2 (y)1−β #β .  B B B #β (7) in which p1 is the data distribution and p2 the distribution of generated samples, and where p1 (y)β is shorthand for p1 (y1 )β1 . . . p1 (yB )βB . The proof is technical and is deferred to the supplementary material. For non-uniform beta distributions on p, a similar result holds, with different coefficients depending on #β and B in the sum. These heavy expressions can be interpreted easily. First, in the case B = 1, the optimal discriminator reduces to the opp1 (y) timal discriminator for a standard GAN, D∗ = p1 (y)+p . 2 (y) Actually pbalanced (y) is simply the distribution of batches y under our procedure of sampling p uniformly, then sampling B β ∼ B (p) . The binomial coefficients put on equal footing contributions with different true/fake ratios. The generator loss (5), when faced with the optimal discriminator, is the Kullback–Leibler divergence between pbalanced and punbalanced (up to sign and a constant log(2)). Since punbalanced puts more weight on batches with higher #β (more true samples), this brings fake samples closer to true ones. Since pbalanced and punbalanced differ by a factor 2#β/B, the (y) ratio D∗ = 12 ppunbalanced is simply the expectation of #β/B balanced (y) under a probability distribution on β that is proportional to p1 (y)β p2 (y)1−β . But this is the posterior distribution on  B #β β given the batch y and the uniform prior on the ratio p. Thus, the optimal discriminator is just the posterior h i mean ∗ of the ratio of true samples, D (y) = IEβ|y #β B . This is standard when minimizing the expected divergence between Bernoulli distributions and the approach can therefore be extended to non-uniform priors on p as shown in section 9. 4. Permutation invariant networks Computing statistics of probability distributions from batches of i.i.d. samples requires to compute quantities that A naive way of achieving invariance to batch permutations is to consider the batch dimension as a regular feature dimension, and to randomly reorder the batches at each step. This multiplies the input dimension by the batch size, and thus greatly increases the number of trainable parameters. Moreover, this only provides approximate invariance to batch permutation, as the network has to infer the invariance based on the training data. Instead, we propose to directly build invariance into the architecture. This method drastically reduces the number of parameters compared to the naive approach, bringing it back in line with ordinary networks, and ensures strict invariance to batch permutation. Let us first formalize the notion of batch permutation invariance and equivariance. A function f from RB×l to RB×L is batch permutation equivariant if permuting samples in the batch results in the same permutation of the outputs: for any permutation σ of the inputs, f (xσ(1) , . . . , xσ(B) ) = f (x)σ(1) , . . . , f (x)σ(B) . (8) For instance, any regular neural network or other function treating the inputs x1 , . . . , xB independently in parallel, is batch permutation equivariant. A function f from RB×l to RL is batch permutation invariant if permuting the inputs in the batch does not change the output: for any permutation on batch indices σ, f (xσ(1) , . . . , xσ(B) ) = f (x1 , . . . , xB ). (9) The mean, the max or the standard deviation along the batch axis are all batch permutation invariant. Permutation equivariant and permutation invariant functions can be obtained by combining ordinary, parallel treatment of batch samples with an additional batch-averaging operation that performs an average of the activations across the batch direction. In our architecture, this averaging is the only form of interaction between different elements of the batch. It is one of our main results that such operations are sufficient to recover all invariant functions. Formally, on a batch of data x ∈ RB×n , our proposed batch Mixed batches and symmetric discriminators for GAN training Figure 4. Sample images generated by our best model trained on CIFAR10. permutation invariant network fθ is defined as B fθ (x) = 1 X (φθp ◦ φθp−1 ◦ . . . ◦ φθ0 (x))b B (10) b=1 where each φθi is a batch permutation equivariant function from RB×li−1 to RB×li , where the li ’s are the layer sizes. The equivariant layer operation φθ with l input features and L output features comprises an ordinary weight matrix Λ ∈ Rl×L that treats each data point of the batch independently (“non-batch-mixing”), a batch-mixing weight matrix Γ ∈ Rl×L , and a bias vector β ∈ RL . As in regular neural networks, Λ processes each data point in the batch independently. On the other hand, the weight matrix Γ operates after computing an average across the whole batch. Defining ρ as the batch average for each feature, B ρ(x1 , . . . , xB ) := 1 X xb B (11) b=1 the permutation-equivariant layer φ is formally defined as   φθ (x)b := µ β + xb Λ + ρ(x)Γ (12) where µ is a nonlinearity, b is a batch index, and the parameter of the layer is θ = (β, Λ, Γ). 4.2. Networks of equivariant layers provide universal approximation of permutation invariant functions The networks constructed above are permutation invariant by construction. However, it is unclear a priori that all permutation invariant functions can be represented this way: the functions that can be approximated to arbitrary precision by those networks could be a strict subset of the set of permutation invariant functions. The optimal solution for the discriminator could lie outside this subset, making our construction too restrictive. We now show this is not the case: our architecture satisfies a universal approximation theorem for permutation-invariant functions. Theorem 1. The set of networks that can be constructed by stacking as in Eq. (10) the layers φ defined in Eq. (12), with sigmoid nonlinearities except on the output layer, is dense in the set of permutation-invariant functions (for the topology of uniform convergence on compact sets). While the case of one-dimensional features is relatively simple, the multidimensional case is more intricate, and the detailed proof is given in the supplementary material. Let us describe the key ideas underlying the proof. The standard universal approximation theorem for neural networks proves the following: for any continuous function f , we can find a network that given a batch x = (x1 , . . . , xB ), computes (f (x1 ), . . . , f (xB )). This is insufficient for our purpose as it provides no way of mixing information between samples in the batch. First, we prove that the set of functions that can be approximated to arbitrary precision by our networks is an algebra, i.e., a vector space stable under products. From this point on, it remains to be shown that this algebra contains a generative family of the continuous symmetric functions. To prove that we can compute the sum of two functions f1 and f2 , compute f1 and f2 on different channels (this is possible even if f1 and f2 require different numbers of layers, by filling in with the identity if necessary). Then sum across channels, which is possible in (12). To compute products, first compute f1 and f2 on different channels, then apply the universal approximation theorem to turn this into log f1 and log f2 , then add, then take the exponential thanks to the universal approximation theorem. The key point is then the following: the algebra of all permutation-invariant polynomials over the components of (x1 , . . . , xB ) is generated as an algebra by the averages 1 B (f (x1 ) + . . . + f (xB )) when f ranges over all functions of single batch elements. This non-trivial algebraic statement is proved in the supplementary material. Mixed batches and symmetric discriminators for GAN training Figure 5. Samples obtained after 66000 iterations on the celebA dataset. From left to right: (a) Standard GAN (b) Single batch discriminator, no batch smoothing. (c) Single batch discriminator, batch smoothing γ = 0.5. (d) Multiple batch discriminators, batch smoothing γ = 0.5 By construction, such functions B1 (f (x1 )+. . .+f (xB )) are readily available in our architecture, by computing f as in an ordinary network and then applying the batch-averaging operation ρ in the next layer. Further layers provide sums and products of those thanks to the algebra property. We can conclude with a symmetric version of the Stone–Weierstrass theorem (polynomials are dense in continuous functions). 4.3. Practical architecture In our experiments, we apply the constructions above to standard, deep convolutional neural networks. In practice, for the linear operations Λ and Γ in (12) we use convolutional kernels (of size 3 × 3) acting over xb and ρ(x) respectively. Weight tensors Λ and Γ are also reweighted like so that at the start of training ρ(x) does not contribute dispropor|B| tionately compared with other features: Λ̃ = |B|+1 Λ and 1 Γ where |B| denotes the size of batch B. While Γ̃ = |B|+1 these coefficients could be learned, we have found this explicit initialization to improve training. Figure 1 shows how to modify standard CNN architectures to adapt each layer to our method. In the first setup, which we refer to as BGAN, a permutation invariant reduction is done at the end of the discriminator, yielding a single prediction per batch, which is evaluated with the loss in (4). We also introduce a setup, M-BGAN, where we swap the order of averaging and applying the loss. 2 Namely, letting y be the single target for the batch (in our case, the proportion of real samples), the BGAN case translates into B L((o1 , . . . , oB ), y) = ` 2 1 X oi , y B i=1 This was initially a bug that worked. ! (13) while M-BGAN translates to B L((o1 , . . . , oB ), y) = 1 X `(oi , y) B i=1 (14) where L is the final loss function, ` is the KL loss function used in (4), (o1 , . . . , ob ) is the output of the last equivariant layer, and y is the target for the whole batch. Both these losses are permutation invariant. A more detailled explanation of M-BGAN is given in Section 11. 5. Experiments 5.1. Synthetic 2D distributions The synthetic dataset from Zhang et al. (2017) is explicitly designed to test mode dropping. The data are sampled from a mixture of concentrated Gaussians in the 2D plane. We compare standard GAN training, “mixup” training (Zhang et al., 2017), and batch smoothing using the BGAN from Section 4.3. In all cases, the generators and discriminators are three-layer ReLU networks with 512 units per layer. The latent variables of the generator are 2-dimensional standard Gaussians. The models are trained on their respective losses using the Adam (Kingma & Ba, 2015) optimizer, with default parameters. The discriminator is trained for five steps for each generator step. The results are summarized in Figure 3. Batch smoothing and mixup have similar effects. Results for BGAN and M-BGAN are qualitatively similar on this dataset and we only display results for BGAN. The standard GAN setting quickly diverges, due to its inability to fit several modes simultaneously, while both batch smoothing and mixup successfully fit the majority of modes of the distribution. Mixed batches and symmetric discriminators for GAN training Table 1. Comparison to the state of the art in terms of inception score (IS) and Fréchet inception distance (FID). 7.5 M-BGAN, mixed batches 7.0 Inception score BGAN, mixed batches 6.5 Batch discrimination 6.0 5.5 BGAN, pure batches 5.0 Model IS FID WGP (Miyato et al., 2018) GP (Miyato et al., 2018) SN (Miyato et al., 2018) 6.68 ± .06 6.93 ± .08 7.42 ± .08 40.2 37.7 29.3 Salimans et al. BGAN M-BGAN 7.09 ± .08 7.05 ± .06 7.49 ± .06 35.0 36.47 23.71 5.3. Effect of batch smoothing on the generator and discriminator losses 4.5 0 1 2 3 4 (£1e5) Iteration Figure 6. Inception score for various versions of BGAN and for batch discrimination (Salimans et al., 2016). 5.2. Experimental results on CIFAR10 Next, we consider image generation on the CIFAR10 dataset. We use the simple architecture from (Miyato et al., 2018), minimally modified to obtain permutation invariance thanks to (12). All other architectural choices are unchanged. The same Adam hyperparameters from (Miyato et al., 2018) are used for all models: α = 2e−4 , β1 = 0.5, β2 = 0.999, and no learning rate decay. We performed hyperparameter search for the number of discrimination steps between each generation step, ndisc , over the range {1, . . . , 5}, and for the batch smoothing parameter γ over [0.2, 0.5]. All models are trained for 400, 000 iterations, counting both generation and discrimination steps. We compare smoothed BGAN and M-BGAN, and the same network trained with spectral normalization (Miyato et al., 2018) (SN), and gradient penalty (Gulrajani et al., 2017) on both the Wasserstein (Arjovsky et al., 2017) (WGP) and the standard loss (GP). We also compare to a model using the batch-discrimination layer from (Salimans et al., 2016), adding a final batch discrimination layer to the architecture of (Miyato et al., 2018). All models are evaluated by reporting the Inception Score and the Fréchet Inception Distance (Heusel et al., 2017) and results are summarized in Table 2. Figure 4 displays sample images generated with our best model. Figure 5.2 highlights the training dynamics of each model3 . On this architecture, M-BGAN heavily outperforms both batch discrimination and our other variants, and yields results similar to, or slightly better than (Miyato et al., 2018). Model trained with batch smoothing display results on par with batch discrimination, and much better than without batch smoothing. 3 For readability, a slight smoothing is performed on the curves. To check the effect of the batch smoothing parameter γ on the loss, we plot the discriminator and generator losses of the network for different γ’s. The smaller the γ, the purer the batches. We would expect discriminator training to be more difficult with larger γ. The results corroborate this insight (Fig. 2). BGAN and M-BGAN behave similarly and we only report on BGAN in the figure. The discriminator loss is not directly affected by an increase in γ, but the generator loss is lower for larger γ, revealing the relative advantage of the generator on the discriminator. This suggests to increase γ if the discriminator dominates learning, and to decrease γ if the discriminator is stuck at a high value in spite of poor generated samples. 5.4. Qualitative results on celebA Finally, on the celebA face dataset, we adapt the simple architecture of (Miyato et al., 2018) to the increased resolution by adding a layer to both networks. For optimization we use Adam with β1 = 0, β2 = 0.9, α = 1e − 4, and ndisc = 1. Fig. 5 dislays BGAN samples with pure batches, and BGAN and M-BGAN samples with γ = .5. The visual quality of the samples is reasonable; we believe that an improvement is visible from pure batches to M-BGAN. 6. Conclusion We introduced a method to feed batches of samples to the discriminator of a GAN in an principled way, based on two observations: feeding all-fake or all-genuine batches to a discriminator makes its task too easy; second, a simple architectural trick makes it possible to provably recover all functions of the batch as an unordered set. Experimentally, this provides a new, alternative method to reduce mode dropping and reach good quantitative scores in GAN training. ACKNOWLEDGMENTS This work has been partially supported by the grant ANR-16CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01). Mixed batches and symmetric discriminators for GAN training References Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 214–223, 2017. URL http://proceedings.mlr. press/v70/arjovsky17a.html. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989. ISSN 1435-568X. doi: 10.1007/BF02551274. URL https://doi.org/10. 1007/BF02551274. McGregor, S. Neural network processing for multiset data. In Proceedings of the 17th International Conference on Artificial Neural Networks, ICANN’07, pp. 460–470, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 3-54074689-7, 978-3-540-74689-8. URL http://dl.acm. org/citation.cfm?id=1776814.1776866. McGregor, S. Further results in multiset processing with neural networks. Neural Networks, 21(6):830–837, 2008. doi: 10.1016/j.neunet.2008.06.020. URL https:// doi.org/10.1016/j.neunet.2008.06.020. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In ICLR, 2017. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgziT-. accepted as oral presentation. Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014. Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. CoRR, abs/1612.00593, 2016. URL http://arxiv.org/abs/1612.00593. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. CoRR, abs/1704.00028, 2017. URL http://arxiv. org/abs/1704.00028. Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H., and Kanai, R. Permutation-equivariant neural networks applied to dynamics prediction. CoRR, abs/1612.04530, 2016. URL http://arxiv.org/ abs/1612.04530. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a nash equilibrium. CoRR, abs/1706.08500, 2017. URL http: //arxiv.org/abs/1706.08500. Karras, T., Aila, T., and abd J. Lehtinen, S. L. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In iclr, 2015. Kingma, D. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014. Li, Y., Swersky, K., and Zemel, R. S. Generative moment matching networks. CoRR, abs/1502.02761, 2015. URL http://arxiv.org/abs/1502.02761. Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In NIPS, 2016. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6000–6010. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf. Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R., and Smola, A. J. Deep sets. CoRR, abs/1703.06114, 2017. URL http://arxiv.org/ abs/1703.06114. Mixed batches and symmetric discriminators for GAN training Zhang, H., Cissé, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017. URL http://arxiv.org/ abs/1710.09412.