Differentially Private Diffusion Models
Generate Useful Synthetic Images
Sahra Ghalebikesabi1,+ , Leonard Berrada2 , Sven Gowal2 , Ira Ktena2 , Robert Stanforth2 , Jamie Hayes2 ,
Soham De2 , Samuel L. Smith2 , Olivia Wiles2 and Borja Balle2

arXiv:2302.13861v1 [cs.LG] 27 Feb 2023

1 University of Oxford, 2 DeepMind, + Work done at DeepMind.

The ability to generate privacy-preserving synthetic versions of sensitive image datasets could unlock
numerous ML applications currently constrained by data availability. Due to their astonishing image
generation quality, diffusion models are a prime candidate for generating high-quality synthetic data.
However, recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy. By privately fine-tuning ImageNet pre-trained diffusion models with more
than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both FID
and the accuracy of downstream classifiers trained on synthetic data. We decrease the SOTA FID on
CIFAR-10 from 26.8 to 9.8, and increase the accuracy from 51.0% to 88.0%. On synthetic data from
Camelyon17, we achieve a downstream accuracy of 91.1% which is close to the SOTA of 96.5% when
training on the real data. We leverage the ability of generative models to create infinite amounts of
data to maximise the downstream prediction performance, and further show how to use synthetic data
for hyperparameter tuning. Our results demonstrate that diffusion models fine-tuned with differential
privacy can produce useful and provably private synthetic data, even in applications with significant
distribution shift between the pre-training and fine-tuning distributions.

1. Introduction
Delivering impactful ML-based solutions for real-world applications in domains like health care and recommendation systems requires access to sensitive personal data that cannot be readily used or shared without
risk of introducing ethical and legal implications. Replacing real sensitive data with private synthetic data
following the same distribution is a clear pathway to mitigating these concerns (Chen et al., 2021b; Dankar
and Ibrahim, 2021; Patki et al., 2016). However, despite their theoretical appeal, general-purpose methods
for generating useful and provably private synthetic data remain a subject of active research (Dockhorn et al.,
2022; McKenna et al., 2022; Torfi et al., 2022). The central challenge in this line of work is how to obtain truly
privacy-preserving synthetic data free of the common pitfalls faced by classical anonymization approaches
(Stadler et al., 2021), while at the same time ensuring the resulting datasets remain useful for a wide variety
of downstream tasks, including statistical and exploratory analysis as well as machine learning model selection,
training and testing.
It is tempting to obtain synthetic data by training and then sampling from well-known generative models
like variational auto-encoders (Kingma and Welling, 2013), generative adversarial nets (Goodfellow et al.,
2020), and denoising diffusion probabilistic models (Ho et al., 2020; Song and Ermon, 2019). Unfortunately, it
is well-known that out-of-the-box generative models can potentially memorise and regenerate their training
data points1 and, thus, reveal private information. This holds for variational autoencoders (Hilprecht et al.,
2019), generative adversarial nets (Hayes et al., 2017), and also diffusion models (Carlini et al., 2023; Duan
et al., 2023; Hu and Pang, 2023; Somepalli et al., 2022). In particular, diffusion models have recently gained a
lot of attention, with pre-trained models made available online (Dhariwal and Nichol, 2021; Rombach et al.,
2022), and being fine-tuned on applications involving potentially sensitive data such as chest X-rays (Ali et al.,
2022; Chambon et al., 2022a,b) and brain MRIs (Pinaya et al., 2022; Rouzrokh et al., 2022).
Mitigating the privacy loss from sharing synthetic data produced by generative models trained on sensitive
1 A model does not contain its training data, but rather has “memorised” training data when the model is able to use the rules and

attributes it has learned about the training data to generate elements of that training data.

Corresponding author(s): sahra.ghalebikesabi@univ.ox.ac.uk, {lberrada | bballe}@deepmind.com
© 2023 DeepMind. All rights reserved

Differentially Private Diffusion Models Generate Useful Synthetic Images

Camelyon17

Synthetic samples

Real samples

CIFAR-10

Figure 1 | DP diffusion models are capable of producing high-quality images. More images can be found in
Figures 5, 6, 7.
data is not straightforward. Differential privacy (DP) (Dwork et al., 2006) has emerged as the gold standard
privacy mitigation when training ML models, and its application to generative models would provide guarantees
on the information the model (and synthetic data sampled from it) can leak about individual training examples.
Yet, scaling up DP training methods to modern large-scale models remains a significant challenge due to the
slow down incurred by DP-SGD (the standard workhorse of DP for deep learning) (Wang et al., 2017) and
the stark utility degradation often observed when training large models from scratch with DP (Stadler and
Troncoso, 2022; Stadler et al., 2021; Zhang et al., 2022). Most previous works on DP generative models worked
around these issues by focusing on small models, low-complexity data (Harder et al., 2021; Torkzadehmahani
et al., 2019; Xie et al., 2018) or using non-standard models (Harder et al., 2022). However, for DP applications
to image classification it is known that using models pre-trained on public data is a method for attaining good
utility which is compatible with large-scale models and complex datasets (Bu et al., 2022a,b; Cattan et al.,
2022; De et al., 2022; Tramèr et al., 2022; Xu et al., 2022).
Contributions. In this paper we demonstrate how to accurately train standard diffusion models with differential privacy. Despite the inherent difficulty of this task, we propose a simple methodology that allows
us to generate high-quality synthetic image datasets that are useful for a variety of important downstream
tasks. In particular, we privately train denoising diffusion probabilistic models (Ho et al., 2020) with more than
80M parameters on CIFAR-10 and Camelyon17 (Koh et al., 2021), and evaluate the usefulness of synthetic
images for downstream model training and evaluation. Crucially, we show that by pre-training on publicly
available data (i.e. ImageNet), it is possible to considerably outperform the results of extremely recent work on
a similar topic (Dockhorn et al., 2022). With this method, we are able to accurately train models 45x larger
than Dockhorn et al. (2022) and to achieve a high utility on datasets that are significantly more challenging
(e.g. CIFAR-10 and a medical dataset – instead of MNIST). Please refer to Table 2 for a detailed comparison of
our works. Our contributions can be summarized as follows:
• We demonstrate that diffusion models can be trained with differential privacy to sufficient quality that
we can create accurate classifiers based on the synthesized data only. To do so, we leverage pre-training,
and we demonstrate large state-of-the-art improvements even when there exists a significant distribution
shift between the pre-training and the fine-tuning data sets.
• We propose simple and practical improvements over existing methods to further boost the performance of
the model. Namely, we employ both image and timestep augmentations when using augmentation multiplicity, and we bias the diffusion timestep sampling so as to encourage learning of the most challenging
phase of the diffusion process.
• With this approach, we fine-tune a differentially private diffusion model with more than 80 million
parameters on CIFAR-10, and beat the previous state-of-the-art by more than 50%, decreasing the Fréchet
Inception Divergence (FID) from 26.8 to 9.8. Furthermore, we privately fine-tune the same model

2

Differentially Private Diffusion Models Generate Useful Synthetic Images

on histopathological scans of lymph node tissue available in the Camelyon17 dataset and show that a
classifier trained on synthetic data produced by this model achieves 91.1% accuracy (the highest accuracy
reported on the WILDS leaderboard (Koh et al., 2021) is 96.5% for a non-private model trained on real
data).
• We demonstrate that the accuracy of downstream classifiers can be further improved to a significant
extent by leveraging larger synthetic datasets and ensembling, which comes at no additional privacy cost.
Finally, we show that hyperparameter tuning downstream classifiers on the synthetic data reveals trends
that are also reflected when tuning on the private data set directly.
Paper outline. We start by comparing to related work in section 2, before we provide a brief introduction to
diffusion models, differential privacy, and DP-SGD in section 3. In section 4, we describe effective strategies
to fine-tune DP diffusion models, and then present our results on CIFAR-10 and Camelyon17 in section 5. In
section 6, we assess the utility of synthetic data for model selection.

2. Related Work
Differentially private synthetic image generation. DP image generation is an active area of research (Chen
et al., 2022b; Croft et al., 2022; Fan, 2020). Most efforts have focused on applying a differentially private
stochastic gradient procedure on popular generative models, i.e. generative adversarial networks (Augenstein
et al., 2019; Chen et al., 2020, 2021a; Jordon et al., 2018; Liu et al., 2019; Torkzadehmahani et al., 2019;
Xie et al., 2018; Xu et al., 2019; Yoon et al., 2020), or variational autoencoders (Jiang et al., 2022; Pfitzner
and Arnrich, 2022). Only one other work has so far analysed the application of differentially private gradient
descent on diffusion models (Dockhorn et al., 2022) which we contrast against in Table 2. Others have instead
proposed custom architectures (Cao et al., 2021; Chen et al., 2022a; Harder et al., 2021, 2022; Wang et al.,
2021). Harder et al. (2022), for instance, pre-train a perceptual feature extractor using public data, then
privatize the mean of the feature embeddings of the sensitive data records, and use the privatised mean to
train a generative model.
Limitations of previous work. DP image generation based on custom training pipelines and architectures
that are not used outside of the DP literature do not profit from the constant research progress on public image
generation. Other works that instead build upon popular public generative approaches have been shown to not
be differentially private despite such claims. This could be either due to faulty implementations or proofs. See
Stadler et al. (2021) for successful privacy attacks on DP GANs, or Appendix B of Dockhorn et al. (2022) for an
illustration on why DPGEN (Chen et al., 2022b) does not actually satisfy DP guarantees.
Limited success on natural images. DP synthesizers have found applications on tabular electronic healthcare
records (Fang et al., 2022; Torfi et al., 2022; Yan et al., 2022; Zhang et al., 2021), mobility trajectories (AlatristaSalas et al., 2022) and network traffic data (Fan and Pokkunuru, 2021). In the space of image generation,
positive results have only been reported on MNIST, FashionMNIST and CelebA (downsampled to 32 × 32) (Bie
et al., 2022; Harder et al., 2021; Liew et al., 2021; Wang et al., 2021). These datasets are relatively easy to
learn thanks to plain backgrounds, class separability, and repetitive image features within and even across
classes. Meanwhile, CIFAR-10 has been established as a considerably harder generation task than MNIST,
FashionMNIST or CelebA (Radiuk, 2017). The images are not only higher dimensional than MNIST and
FashionMNIST (32 × 32 × 3 compared to 28 × 28 feature dimensions), but the dataset has wider diversity
and complexity. This is reflected by more complex features and textures, such as lightning conditions, object
orientations, and complex backgrounds (Radiuk, 2017). Moreover, MNIST and FashionMNIST are considerably
lower dimensional than CIFAR-10 and Camelyon (28 × 28 vs 32 × 32 × 3 features), and CelebA is downsampled
to the same feature dimensionality as CIFAR-10 but has more than 3 times as many samples as CIFAR-10 (50k
vs 162k) which considerably reduces the information loss introduced by DP training. As far as we know, only
two other concurrent works have attempted DP image generation on CIFAR-10. While Dockhorn et al. (2022)
achieve a FID of only 97.7 by training a DP diffusion model from scratch, Harder et al. (2022) used pre-training
on ImageNet and achieved the SOTA with a FID of 26.8, and a downstream accuracy of only 51%.

3

Differentially Private Diffusion Models Generate Useful Synthetic Images

Limited targeted evaluation. The evaluation carried out on DP synthetic datasets is often not sufficiently
targeted towards their utility in practice. The performance of DP image synthesizers is commonly evaluated on
two types of metrics: 1) perceptual difference measures between the synthetic and real data distribution, such
as FID, and 2) predictive performance of classifiers that are trained on a synthetic dataset of the size of the
original training dataset and tested on the real test data. The former metric is known to be easy to manipulate
with factors not related to the image quality, such as the number of samples, or the version number of the
inception network (Kynkäänniemi et al., 2022). At the same time, it is not obvious how to jointly incorporate
the information from both metrics given that they may individually imply different conclusions. Dockhorn et al.
(2022), for instance, identify different diffusion model samplers to minimise either the FID or the downstream
test loss. Further, recent research has identified use cases where synthetic data is not able to capture important
first or second order statistics despite reportedly scoring highly on those metrics (Stadler and Troncoso, 2022;
Stadler et al., 2021). In this paper, we set out to provide examples of additional downstream evaluations.

3. Background
3.1. Denoising Diffusion Probabilistic Models
Denoising diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song and Ermon, 2019) are a class of
likelihood-based generative models that have recently established new SOTA results across diverse computer
vision problems (Dhariwal and Nichol, 2021; Lugmayr et al., 2022; Rombach et al., 2022). Given a forward
Markov chain that sequentially perturbs a real data sample 𝑥0 to obtain a pure noise distribution 𝑥𝑇 , diffusion
models parameterize the transition kernels of a backward chain by deep neural networks to denoise 𝑥𝑇 back to
𝑥0 .
Given 𝑥0 ∼ 𝑞 ( 𝑥0 ), one defines a forward process that generates gradually noisier samples 𝑥1 , ..., 𝑥𝑇 using a
transition kernel 𝑞 ( 𝑥𝑡 | 𝑥𝑡−1 ) typically chosen as a Gaussian perturbation. At inference time, 𝑥𝑇 , an observation
sampled from a noise distribution, is then gradually denoised 𝑥𝑇 −1 , 𝑥𝑇 −2 , ... until the final sample 𝑥0 is reached.
Ho et al. (2020) parameterize this process by a function 𝜖𝜃 ( 𝑥𝑡 , 𝑡 ) which predicts the noise component 𝜖 of a
noisy sample 𝑥𝑡 given timestep 𝑡 . They then propose a simplified training objective to learn 𝜃, namely
𝐿 ( 𝜃) = 𝔼𝑡,𝑥0 ,𝜖 [k 𝜖 − 𝜖𝜃 ( 𝑥𝑡 , 𝑡 ) k 2 ] ,

(1)

with 𝑡 ∼ U [0, 𝑇 ], where 𝑇 is the pre-specified maximum timestep,
√ and U√[ 𝑎, 𝑏] is the discrete uniform distribution
bounded by 𝑎 and 𝑏. The noisy sample 𝑥𝑡 is computed by 𝑥𝑡 = 𝛼
¯𝑡 𝑥0 + 1 − 𝛼
¯𝑡 𝜖 where 𝛼
¯𝑡 is defined such that 𝑥𝑡
follows the pre-specified forward process. Most importantly, 𝛼
¯𝑡 is a decreasing function of timestep 𝑡 . Thus, the
larger 𝑡 is, the noisier 𝑥𝑡 will be.
3.2. Differential Privacy
Differential Privacy (DP) is a formal privacy notion that, in informal terms, bounds how much a single observation
can change the output distribution of a randomised algorithm. More formally:
Definition 3.1 (Differential Privacy (Dwork et al., 2006)). Let 𝐴 be a randomized algorithm, and let 𝜀 > 0,
𝛿 ∈ [0, 1]. We say that 𝐴 is ( 𝜀, 𝛿)-DP if for any two neighboring datasets 𝐷, 𝐷 0 differing by a single element, we
have that
∀ 𝑆 ⊂ S , ℙ[ 𝐴 ( 𝐷) ∈ 𝑆] ≤ exp( 𝜀)ℙ[ 𝐴 ( 𝐷 0) ∈ 𝑆] + 𝛿,
where S denotes the support of 𝐴.
The privacy guarantee is thus controlled by two parameters, 𝜀 and 𝛿. While 𝜀 bounds the log-likelihood
ratio of any particular output that can be obtained when running the algorithm on two datasets differing in a
single data point, 𝛿 is a small probability which bounds the occurrence of infrequent outputs that violate this
bound (typically 1/𝑛, where 𝑛 is the number of training examples). The smaller these two parameters get, the
higher is the privacy guarantee. We therefore refer to the tuple ( 𝜀, 𝛿) as privacy budget.

4

Differentially Private Diffusion Models Generate Useful Synthetic Images

3.3. Differentially Private Stochastic Gradient Descent
Neural networks are typically privatised with Differentially Private Stochastic Gradient Descent (DP-SGD)
(Abadi et al., 2016), or alternatively a different DP optimizer like DP-Adam (McMahan et al., 2018). At each
training iteration, the mini-batch gradient is clipped per example, and Gaussian noise is added to it. More
formally, let 𝑙 𝑖 ( 𝑤) := L ( 𝑤, 𝑥 𝑖 , 𝑦𝑖 ) denote the learning
objective
given model parameters 𝑤 ∈ ℝ 𝑝 , input features
n
o
𝑥 𝑖 and label 𝑦𝑖 . Let clip𝐶 ( 𝑣) : 𝑣 ∈ ℝ 𝑝 ↦→ min 1, k 𝑣𝐶k · 𝑣 ∈ ℝ 𝑝 denote the clipping function which re-scales
2
its input to have a maximal 2 norm of 𝐶 . For a minibatch B with |B| = 𝐵 samples, the "privatised" minibatch
gradient ˆ𝑔 takes on the form
ˆ𝑔 =

1 ∑︁
𝐵

𝑖 ∈B

clip𝐶 (∇𝑙 𝑖 ( 𝑤)) +

𝜎𝐶
𝜉,
𝐵

with 𝜉 ∼ N (0, 𝐼 𝑝 ) and 𝐼 𝑝 ∈ ℝ 𝑝× 𝑝 being the identity matrix. In practice, the choice of noise variance 𝜎 > 0,
batch-size 𝐵 and maximum number of training iterations are constrained by the predetermined privacy budget
( 𝜀, 𝛿). Crucially, the choice of hyper-parameters can have a large impact on the accuracy of the resulting model,
and overall DP-SGD makes it challenging to accurately train deep neural networks. On CIFAR-10 for example,
the highest reported test accuracy for a DP model trained with 𝜖 = 8 was 63.4% in 2021 (Yu et al., 2021).
De et al. (2022) improved performance on image classification tasks and in particular obtained nearly 95%
test accuracy for 𝜖 = 1 on CIFAR-10, using notably 1) pre-training, 2) large batch sizes, and 3) augmentation
multiplicity.
As part of this paper, we analyze to what extent these performance gains transfer from DP image classification
to DP image generation. Diffusion models are inherently different model architectures that exhibit different
training dynamics than standard classifiers which makes introduces additional difficulties in adapting DP
training. First, diffusion models are significantly more computationally expensive to train. Indeed, they operate
on higher dimensional representations than image classifiers, so that they can output full images instead of a
single label. This makes each update step much more computationally expensive for diffusion models than
for classification ones. In addition, diffusion models also need more epochs to converge in public settings
compared to classifiers. For example, for a batch size of 128 samples Ho et al. (2020) train a diffusion model
for 800k steps on CIFAR-10, while Zagoruyko and Komodakis (2016) train a Wide ResNet for classification in
less than 80k steps. This high computational cost of training a diffusion model makes it difficult to finetune the
hyperparameters, which is known to be both challenging and crucial for good performance (De et al., 2022).
Second, and related to sample inefficiency, the noise inherent to the training of diffusion models introduces
an additional variance that compounds with the one injected by DP-SGD, which makes training all the more
challenging. Thus overall, it is currently not obvious how to efficiently and accurately train diffusion models
with differential privacy.

4. Improvements towards Fine-Tuning Diffusion Models with Privacy
Recommendations from previous work. De et al. (2022) identify pre-training, large batch sizes, and
augmentation multiplicity as effective strategies in DP image classification. We adopted their recommendations
in the training of DP diffusion models, and confirmed the effectiveness of their strategies to the task of DP
image generation. In contrast to the work of Dockhorn et al. (2022), where the batch-size is only scaled up to
2,048 samples, we implemented virtual batching which helps us to scale to up to 32,768 samples per batch.
Pre-training. Pre-training is especially integral to generating realistic image datasets, even if there is a
considerable distribution shift between the pre-training and fine-tuning distributions. Unless otherwise specified,
we thus pre-train all of our models on ImageNet32 (Chrabaszcz et al., 2017). ImageNet has been a popular
pre-training dataset used when little data is available (Raghu et al., 2019), to accelerate convergence times
(Liu et al., 2020), or to increase robustness (Hendrycks et al., 2019).
Augmentation multiplicity with timesteps. De et al. (2022) observe that data augmentation, as it is commonly implemented in public training, has a detrimental effect on the accuracy in DP image classification.
Instead they propose the use of augmentation multiplicity (Fort et al., 2021). In more detail, they augment

5

Differentially Private Diffusion Models Generate Useful Synthetic Images

each unique training observation within a batch, e.g. with horizontal flips or random crops, and average
the gradients of the augmentations before clipping them. Similarly to Dockhorn et al. (2022), we extend
augmentation multiplicity to also sample multiple timesteps 𝑡 for each mini-batch observation in the estimation
of Equation 1, and average the corresponding gradients before clipping. In contrast to Dockhorn et al. (2022)
where only timestep multiplicity is considered, we combine it with traditional augmentation methods, namely
random crops and flipping. As a result, while Dockhorn et al. (2022) find that the FID plateaus for around 32
augmentations per image, we see increasing benefits the more augmentation samples are used (see Figure 9).
For computational reasons, we limit augmentation multiplicity to 128 samples.
Modified timestep sampling. The training objective for diffusion models in Equation 1 samples the timestep
𝑡 uniformly from [0, 𝑇 ] because the model must learn to de-noise images at every noise level. However, it is
not straightforward that uniform sampling is the best strategy, especially in the DP setting where the number
of model updates is limited by the privacy budget. In particular, in the fine-tuning scenario, a pre-trained
model has already learned that at small timesteps the task is to remove small amounts of Gaussian noise from
a natural-looking image, and at large timesteps the task is to project a completely noisy sample closer to
the manifold of natural-looking images. The model behavior at small and large timesteps is thus more likely
to transfer to different image distributions without further tuning. In contrast, for medium timesteps the
model must be aware of the data distribution at hand in order to compose a natural-looking image. A similar
observation has been recently made for membership inference attacks (Carlini et al., 2023; Hu and Pang,
2023): the adversary has been shown to more likely succeed in membership inference when it uses a diffusion
model to denoise images with medium amounts of noise compared to high- or low-variance noised images.
This motivates us to explore modifications of the training objective where the timestep sampling distribution is
not uniform, and instead focuses on training the regimes that contribute more to modelling the key content of
an image.
Motivated by this reasoning,
we considered replacing
the uniform timestep distribution with a mixture of
Í
Í
uniform distributions 𝑡 ∼ 𝑖𝐾=1 𝑤𝑖 U [ 𝑙 𝑖 , 𝑢𝑖 ] where 𝑖𝐾=1 𝑤𝑖 = 1, 0 ≤ 𝑙0 , 𝑢 𝐾 ≤ 𝑇 and 𝑢𝑘−1 ≤ 𝑙 𝑘 for 𝑘 ∈ {2, ..., 𝐾 }. On
CIFAR-10, we found the best performance for a distribution with probability mass focused within [30, 600]
for 𝑇 = 1, 000 where timesteps outside this interval are assigned a lower probability than timesteps within
this interval. We assume this is due to ImageNet-pre-trained diffusion models being able to denoise other
(potentially unseen) natural images if only a small amount of noise is added. Training with privacy for small
timesteps can then decrease the performance because more of the training budget is allocated to the timesteps
that are harder to learn and because of the noisy optimization procedure. Even when training from scratch
on MNIST, we observe that focusing the limited training capacity on the harder-to-learn moderate time steps
increases test performance.

5. Empirical Results on Differentially Private Image Generation
and their Evaluation
5.1. Current Evaluation Framework for DP Image Generators
The FID (Heusel et al., 2017) has been the most widely used metric for assessing the similarity between
artificially generated and real images (Dhariwal and Nichol, 2021), and has thus been widely applied in the
DP image generation literature as the first point of comparison (Dockhorn et al., 2022). While the FID is the
most popular metric, numerous other metrics have been proposed, including the Inception Score (Salimans
et al., 2016), the Kernel Inception Distance (Bińkowski et al., 2018), and Precision and Recall (Sajjadi et al.,
2018). For the calculation of these metrics, the synthetic and real images are typically fed through an inception
network that has been trained for image classification on ImageNet, and a distance between the two data
distributions is computed based on the feature embeddings of the final layer.
Even though these metrics have been designed to correlate with the visual quality of the images, they can
be misleading since they highly vary with image quality unrelated factors such as the number of observations,
or the version number of the inception network (Kynkäänniemi et al., 2022). They also reduce complex image
data distributions to single values that might not capture what practitioners are interested in when dealing
with DP synthetic data. Most importantly, they may not effectively capture the nuances in image quality the
further apart the observed data distribution is from ImageNet. For example, CIFAR-10 images have to be

6

Differentially Private Diffusion Models Generate Useful Synthetic Images

upscaled significantly, to be fed through the inception network which will then capture undesirable artifacts
introduced by upsampling. In contrast, MNIST images are digit-based and thus exhibit other variations than
natural images, further diminishing the effectiveness of the evaluation.
An alternative way of comparing DP image generation models is by looking at the test performance of a
downstream classifier trained on a synthetic dataset of the same size as the real dataset (Dockhorn et al., 2022;
Xie et al., 2018), and tested on the real dataset. We argue that the way DP generative models are currently
evaluated downstream, i.e. by evaluating a single model or metric on a limited data set, needs to be revisited.
Instead, we propose to explore how synthetic data can be most effectively used for prediction model training
and hyperparameter tuning.
This line of thinking motivates the proposal of an evaluation framework that focuses on how DP generative
models are used by practitioners. As such, our experiments focus on two specific use cases for private synthetic
data: downstream prediction tasks, and model selection. Downstream prediction tasks include classification or
regression models trained on synthetic data and evaluated on real samples. This corresponds to the setting
where a data curator aims to build a production-ready model that achieves the highest possible performance at
test time while preserving the privacy of the training samples. Model selection refers to a use case where the
data curator shares the generative model with a third party that trains a series of models on the synthetic data
with the goal to provide guidance on the model ranking when evaluated on the sensitive real data records. We
hope that with these two experiments we cover the most important downstream tasks and set an example for
future research on the development of DP generative models. After presenting our results within the evaluation
framework that is commonly used in the current literature, we investigate the utility of the DP image diffusion
model trained on CIFAR-10 for the aforementioned tasks in section 6.
5.2. Experimental Setup
We now evaluate the empirical efficiency of our proposed methods on three image datasets: MNIST, CIFAR-10
and Camelyon17. Please refer to Table 1 for an overview on our main results. For CIFAR-10, we provide
additional experiments to prove the utility of the synthetic data for model selection in section 6. We emphasize
that while these benchmarks may be considered small-scale by modern non-private machine learning standards,
they remain an outstanding challenge for image generation with differential privacy at this time.
We train diffusion models with a U-Net architecture as used by Ho et al. (2020) and Dhariwal and Nichol
(2021). In contrast to Dockhorn et al. (2022), we found that classifier guidance led to a drop in performance.
Unless otherwise specified, all diffusion models are trained with a privatized version of Adam (Kingma and Ba,
2014; McMahan et al., 2018). The clipping norm is set to 10−3 , since we observed that the gradient norm for
diffusion models is usually small, especially in the fine-tuning regime. The privacy budget is set to 𝜖 = 10, as
commonly considered in the DP image generation literature. The same architecture is used for the diffusion
model across all datasets. More specifically, the diffusion is performed over 1,000 timesteps with a linear noise
Table 1 | A summary of the best results provided in this paper when training diffusion models with DP-SGD.
We report the test accuracy of classifiers trained on different data sets. The highest reported current SOTA
corresponds to classifiers trained on DP synthetic data, as reported in the literature. Here, [Do22] refers to
Appendix F Rebuttal Discussions in Dockhorn et al. (2022), and [Ha22] to Harder et al. (2022). Our test
accuracy (Ours) denotes the accuracy of a classifier trained on a synthetic dataset that was generated by a DP
diffusion model and is of the same size as the original training data. Note that we also report the model size of
our generative models (Diffusion M. Size). The Non-synth test accuracy corresponds to the test accuracy of a
DP classifier trained on the real dataset, using the techniques introduced by De et al. (2022). ∗ [De22] This
number is taken from De et al. (2022) for 𝜖 = 8.
Dataset
MNIST
CIFAR-10
Camelyon17

Image
Resolution

Diffusion
M. Size

Pre-Training
Data

28 × 28
32 × 32 × 3
32 × 32 × 3

4.2M
80.4M
80.4M

–
ImageNet32
ImageNet32

Test Accuracy (%)

Privacy

SOTA

Ours

Non-synth

( 𝜖, 𝛿)

98.1 [Do22]
51.0 [Ha22]
-

98.6
88.0
91.1

99.1
96.6 ∗[De22]
90.5

(10, 10−5 )
(10, 10−5 )
(10, 3 · 10−6 )

7

Differentially Private Diffusion Models Generate Useful Synthetic Images

schedule, and the convolutional architecture employs the following details: a variable width with 2 residual
blocks per resolution, 192 base channels, 1 attention head with 16 channels per head, attention at the 16x16
resolution, channel multipliers of (1,2,2,2), and adaptive group normalization layers for injecting timestep and
class embeddings into residual blocks as introduced by Dhariwal and Nichol (2021). When fine-tuning, the
model is pre-trained on ImageNet32 (Chrabaszcz et al., 2017) for 200,000 iterations. All hyperparameters can
be found in Table 3.
As a baseline, we also train DP classifiers directly on the sensitive data, using the training pipeline introduced
by De et al. (2022), and additionally hyperparameter tuning the learning rate. Please refer to Table 5 for more
details. It is not surprising that these results partly outperform the image generators as the training of the DP
classifiers is targeted towards maximising downstream performance.
5.3. Training from Scratch (∅ → MNIST)
The MNIST dataset (LeCun, 1998) consists out of 60,000 28 × 28 training images depicting the 10 digit classes
in grayscale. Since it is the most commonly used dataset in the DP image generation literature, it is included
here for the sake of completeness.
Method. We use a DP diffusion model of 4.2M parameters without any pre-training, with in particular 64
channels, and channel multipliers (1,2,2). The diffusion model is trained for 4,000 iterations at a constant
learning-rate of 5 · 10−4 at batch-size 4,096, with augmult set to 128, and a noise multiplier of 2.852. The
timesteps are sampled uniformly within [0, 200] with probability 0.05, within [200, 800] with probability
0.9, and within [800, 1000] with probability 0.05. To evaluate the quality of the images, we generate 50,000
samples from the diffusion model. Then we train a WRN-40-4 on these synthetic images (hyperparameters
given in Table 4), and evaluate it on the MNIST test set.
Results. As reported in Table 1, this yields a state-of-the-art top-1 accuracy of 98.6%, to be compared to the
98.1% accuracy obtained by Dockhorn et al. (2022). Crucially, we find that to obtain this state-of-the-art result,
it is important to bias the timestep sampling of the diffusion model at training time: this allows the model to
get more training signal from the challenging phases of the generation process through diffusion. Without this
biasing, we obtain a classification accuracy of only 98.2%.
5.4. Fine-tuning on a Medical Application (ImageNet32 → Camelyon17)
To show that fine-tuning works even in settings characterised by dataset shift from the pre-training distribution,
we fine-tune a DP diffusion model on a medical dataset. Camelyon17 (Bandi et al., 2018; Koh et al., 2021)
comprises 455,954 histopathological 96 × 96 image patches of lymph node tissue from five different hospitals.
The label signifies whether at least one pixel in the center 32 × 32 pixels has been identified as a tumor cell.
Camelyon17 is also part of the WILDS (Koh et al., 2021) leaderboard as a domain generalization task: The
training dataset contains 302,436 images from three different hospitals whereas the validation and test set
contain respectively 34,904 and 85,054 images from a fourth and fifth hospital. Since every hospital uses a
different staining technique, it is easy to overfit to the hues of the training data with empirical risk minimisation.
At the time of writing, the highest accuracy reported in the leaderboard of official submissions is 92.1% with a
classifier that uses a special augmentation approach (Gao et al., 2022). The SOTA that does not fulfill the formal
submission guidelines achieves up to 96.5% by pre-training on a large web image data set and finetuning only
specific layers of the classification network (Kumar et al., 2022).
Method. First we pre-train an image diffusion model on ImageNet32, before finetuning it with (10, 3 · 10−6 )DP on the training data, downsampled to 32 × 32, with a batch size of 16,384 for 200 steps. We tuned the
hyperparameters to achieve the lowest FID on the training data, and used the out-of-distribution validation
data to tune the downstream classifiers. The timestep is sampled with 0.015 probability from [0, 30], with
a probability of 0.785 in [30, 600], and with 0.2 in [600, 1000].Since the diffusion model is pre-trained on
ImageNet, we assume that the data is also available for pre-training the classifier. The pre-trained classifier is
then further fine-tuned on a synthetic dataset of the same size as the original training dataset, which we find

8

Differentially Private Diffusion Models Generate Useful Synthetic Images

Method
Ours
Harder et al., 2022

32.0

25

25.2

25.9

26.8
23.3

FID

15.1
9.8

10

87.0

7.9

85.5
83.1

82

5

10

78 78.7
75.6
200K

32

Figure 2 | FID on CIFAR-10 for different privacy
budgets. Our performance at 𝜖 = 5 beats the SOTA,
when pre-training on ImageNet, for 𝜖 = ∞. Results
for Harder et al. (2022) are taken from their paper.

82.4

80.5

80

76
1

84.4

84

5
0

88.0
86.0

86

20
15

88

Test accuracy (%)

30

Ensemble of 5 classifiers
Single classifier
400K
600K
800K
1M
Number of synthetic samples

Figure 3 | Downstream Top-1 accuracy of a CIFAR10 WRN-40-4 as function of the number of synthetic
data samples used to train it. The accuracy increases
considerably as a function of the dataset size.

to systematically improve results. For both the classifier trained on augmented data where the augmentations
include flipping, rotation and color-jittering.
Results. We achieve close to state-of-the-art classification performance with 91.1% by training only on the
synthetic data whereas the best DP classifier we trained on the real dataset achieved only 90.5%. So while
the synthetic dataset is not only useful for in-distribution classification, training on synthetic data is also an
effective strategy to combat overfitting to domain artifacts and generalise in out-of-distribution classification
tasks, as noted elsewhere in the literature (Gheorghit, ă et al., 2022; Zhou et al., 2020).
5.5. Fine-tuning on Natural Images (ImageNet32 → CIFAR-10)
CIFAR-10 (Krizhevsky et al., 2009) is a natural image dataset of 10 different classes with 50,000 RGB images
of size 32 × 32 during training time.
Method. We use the same pre-trained model as for Camelyon17, that is an image diffusion model with more
than 80M parameters trained on ImageNet32. We tune the remaining hyperparameters by splitting the training
data into a set of 45,000 images for training, and 5,000 images for assessing the validation performance based
on FID. As for Camelyon17, we found that sampling the timestep with probability 0.15 in [0,30], with 0.785 in
[30, 60] and 0.2 in [600, 1,000] gives us the lowest FID. More details can be found in Table 3.
Results. We improve the SOTA FID with ImageNet pre-training (Harder et al., 2022) from 26.8 to 9.8, which
is a drop of more than 50%, and increase the downstream accuracy from 51.0% to 72.9% without pre-training
the classifier. With pre-training the classifier on ImageNet32, we can achieve a classification accuracy of 86.6%
with a single WRN. Modifying the timestep distribution led to a reduction in the FID from 11.6 to 9.8. Note that
we achieved the results for different privacy levels by linearly scaling the number of iterations proportionally
with 𝜖, and adjusting the noise level to the given privacy budget, keeping all parameters the same.
As detailed in Figure 2, we obtain state-of-the-art accuracy for a variety of privacy levels. Even for a budget
as small as 𝜖 = 1, the FID obtained with our method is smaller than the current SOTA for 𝜖 = 10. These results
can also be compared with the very recent work by Dockhorn et al. (2022), who report an FID of 97.7 when
training diffusion models with differential privacy on CIFAR-10 without any pre-training. Due to the difficulty
of the task of learning the diffusion model with differential privacy from scratch, the model did not learn
to generate CIFAR-10 like samples, and the generated images do not seem to display any clear CIFAR-10
class instances at all. We believe that such mixed results are a clear motivation for our proposed method of

9

Differentially Private Diffusion Models Generate Useful Synthetic Images

pre-training on public data, which make the learning problem significantly more tractable and realistic, and
allows to obtain useful image generation.
To further confirm that the diffusion model has correctly learned the distribution shift, we trained a ResNet18
model to discriminate images from CIFAR-10 and ImageNet32 achieving a test accuracy of 98.0% on that task.
We then evaluated it on 50,000 synthetically generated images out of which 92% were classified as CIFAR-10
images. This supports our hypothesis that the fine-tuned diffusion model does generate images that are more
similar to CIFAR-10 than to the pre-training data of ImageNet.
5.6. Maximizing Downstream Prediction Performance by Sampling Arbitrary Many Data Records
(ImageNet32 → CIFAR-10)
Dataset sample size. One benefit of synthetic data generators is their ability to render infinitely many
synthetic images. As such, there is no reason why the comparison of real and synthetic samples should be
limited to predictive models trained on the same number of training samples. We, therefore, investigate whether
the performance of a downstream predictor increases with more training images. In Figure 3, we observe
that the downstream classification accuracy constantly increases the more synthetic training observations
are generated. In particular, we increase the downstream classification accuracy from 72.9% to 86.0% by
sampling 1M instead of 50K images– without pretraining the classifier. We note that this difference is much
more significant on the more challenging dataset of CIFAR-10 than e.g. MNIST, where we find that increasing
the number of samples offers virtually no benefit in terms of downstream accuracy.
We note that the classifier can also be pre-trained on the pre-training distribution for performance increases
for smaller data set sizes. On 50k samples, we achieve a classification accuracy of 86.6%. The predictive
performance does not increase significantly when fine-tuning on more samples (see the appendix - Figure 10).
The benefit of pre-training the downstream classification performance thus diminishes for 1M synthetic samples.
Ensembling. We observe that we can further improve the classification accuracy given by a single WRN
classifier by instead ensembling five different networks that differ only in the subsampling of the minibatches.
As reported in Table 1, we can achieve a test accuracy of 88% on CIFAR-10 using this approach (see Figure 3).

6. Model Selection (ImageNet32 → CIFAR-10)
One important benefit of training a DP image generator over a DP classifier is the potential to use the generated
data repeatedly for training a range of different prediction models and choosing the best one across them. Each
experiment training a model on the data comes with a privacy cost, thus tuning a large number of DP classifiers
increases the required privacy budget (Papernot and Steinke, 2021).In this section we consider how synthetic
data can be used to gain initial insights on the choice of model, and to reduce the number of experiments run
on the sensitive data records. The goal here is to identify the model that performs best on real data, while
only having access to synthetic data. This becomes particularly relevant and useful when synthetic data needs
to be released for research purposes (Chambon et al., 2022b), or data challenges (de Montjoye et al., 2014;
Feuerverger et al., 2012; McFee et al., 2012).
For this purpose, we train 3 different model architectures that are commonly employed on CIFAR-10 (Bu
et al., 2022a; De et al., 2022): a WRN-28-8 (Zagoruyko and Komodakis, 2016), a ResNet50 (He et al., 2016),
and a VGG (Simonyan and Zisserman, 2014). For each architecture, we sweep over combinations of three to
five different learning rates and three different values of weight decay. Please refer to Table 6 for more details.
We now assess the utility of synthetic data for model selection in two stages of increasing difficulty. First, we
check whether models – trained on the synthetic data – rank similarly on the real and synthetic test data. This
corresponds to the application setting where a third party tunes a model on the synthetic data, and releases a
model that is trained on the same data.
Once we have established that the test performance on real and synthetic data is sufficiently correlated
to ensure that a model ranks similarly no matter which data set it was evaluated on, we train each model
separately on 50K real and on the same number of synthetic samples. In both cases, models are tested on
the same source of data they have been learned on (with sources being real or synthetic here). We then
assess whether the test performance on the real and synthetic data is still correlated between models of the

10

Differentially Private Diffusion Models Generate Useful Synthetic Images

Each dot is a different model with the same hyperparameters,
Private & Conﬁdential
Evaluated on real samples,
Trained on different data sets

Private & Conﬁdential

Test Accuracy,
Trained on real samples

Test Accuracy on real samples

Each dot is the same model,
Trained on synthetic samples,
Evaluated on different data sets

Test Accuracy on synthetic samples

(a) Test accuracy on synthetic data vs real data of models trained
on synthetic data

Test Accuracy, Trained on synthetic samples

(b) Test accuracy of models trained and evaluated on synthetic
data vs models trained and evaluated on real data

Figure 4 | We observe that models rank similarly when evaluated on synthetic and real data. This suggests that
findings on hyperparameter selection made on synthetic data can be transferred to the private dataset.
same architecture and hyperparameter constellation. This corresponds to the setting where a third party is
responsible for finding the best model pipeline, but the data curator trains the final model with the optimal
hyperparameters for their own internal use.
In Figure 4, we report our findings. We see that we can select the best or second best hyperparameter
constellation in both application settings. More generally, we find that models rank similarly on both synthetic
and real data, suggesting that findings with respect to relative model performance might transfer from the DP
data to the original real data. However, we also notice that models overfit to the synthetic data distribution and
that within one model group it is not obvious which hyperparameter constellation is the best. We therefore
advice that synthetic data is used for a high-level orientation of the research direction. Note that the absolute
test performance is not of interest here, only the correlation between the test performance on real and synthetic
data as the best model will be released externally.

7. Conclusion
DP image generation has long attracted interest as a way of sharing synthetic data sets in sensitive application
domains. Because of the degradation in performance introduced by DP-SGD, successful results on DP image
generation have been limited to small and low-complexity data sets, like MNIST. In this paper, we set out to scale
DP image generation to 32 × 32 × 3 RGB image datasets. We proposed a methodology for DP diffusion models
based on pre-training, timestep augmentation multiplicity, and a modified timestep sampling scheme. We are
the first to train a DP image generator on a medical dataset where we achieved a downstream classification
accuracy of 91.1% that is close to the SOTA of 96.5% with training on the real data. What is more, we
also increased the SOTA downstream classification accuracy on CIFAR-10 from 51.0% to 88.0%. Recently
proposed methods like latent diffusion models (Rombach et al., 2022) constitute a promising model class for
DP fine-tuning on higher dimensional datasets, and we hope that our findings can contribute to future research
exploring this direction.
Finally, we questioned how DP synthetic data has been currently evaluated in the literature, and proposed
an evaluation framework that is more suited to the needs of practitioners who would use the DP synthetic data
as a replacement of the private dataset. For this purpose, we first considered maximising the downstream
prediction performance by generating up to 1M data samples, and training ensembles. Second, we showed
that findings from hyperparameter tuning on synthetic data translate to the corresponding findings on the real
data.

11

Differentially Private Diffusion Models Generate Useful Synthetic Images

Acknowledgements
The authors would like to thank Sylvestre-Alvise Rebuffi for feedback on an earlier version of this manuscript;
Zahra Ahmed for project management support; and Judy Shen, David Stutz, Isabela Albuquerque, Simon
Geisler, Arnaud Doucet, Taylan Cemgil and Pushmeet Kohli for useful discussions throughout the project.

References
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with
differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications
security, pages 308–318, 2016.
H. Alatrista-Salas, P. Montalvo-Garcia, M. Nunez-del Prado, and J. Salas. Geolocated data generation and
protection using generative adversarial networks. In International Conference on Modeling Decisions for
Artificial Intelligence, pages 80–91. Springer, 2022.
H. Ali, S. Murad, and Z. Shah. Spot the fake lungs: Generating synthetic medical images using neural diffusion
models. arXiv preprint arXiv:2211.00902, 2022.
S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, et al. Generative
models for effective ml on private, decentralized datasets. arXiv preprint arXiv:1911.06679, 2019.
P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng,
A. Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient
level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018.
A. Bie, G. Kamath, and G. Zhang. Private gans, revisited. In NeurIPS 2022 Workshop on Synthetic Data for
Empowering ML Research, 2022.
M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton.
arXiv:1801.01401, 2018.

Demystifying mmd gans.

arXiv preprint

Z. Bu, J. Mao, and S. Xu. Scalable and efficient training of large convolutional neural networks with differential
privacy. arXiv preprint arXiv:2205.10683, 2022a.
Z. Bu, Y.-X. Wang, S. Zha, and G. Karypis. Automatic clipping: Differentially private deep learning made easier
and stronger. arXiv preprint arXiv:2206.07136, 2022b.
T. Cao, A. Bie, A. Vahdat, S. Fidler, and K. Kreis. Don’t generate me: Training differentially private generative
models with sinkhorn divergence. Advances in Neural Information Processing Systems, 34:12480–12492,
2021.
N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, and E. Wallace. Extracting
training data from diffusion models, 2023. URL https://arxiv.org/abs/2301.13188.
Y. Cattan, C. A. Choquette-Choo, N. Papernot, and A. Thakurta. Fine-tuning with differential privacy necessitates
an additional hyperparameter search. arXiv preprint arXiv:2210.02156, 2022.
P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, M. Połacin, J. M. Z. Chaves, T. M. Abraham,
S. Purohit, C. P. Langlotz, and A. Chaudhari. Roentgen: Vision-language foundation model for chest x-ray
generation. arXiv preprint arXiv:2211.12737, 2022a.
P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari. Adapting pretrained vision-language foundational
models to medical imaging domains. arXiv preprint arXiv:2210.04133, 2022b.
D. Chen, T. Orekondy, and M. Fritz. Gs-wgan: A gradient-sanitized approach for learning differentially private
generators. Advances in Neural Information Processing Systems, 33:12673–12684, 2020.
D. Chen, S.-c. S. Cheung, C.-N. Chuah, and S. Ozonoff. Differentially private generative adversarial networks
with model inversion. In 2021 IEEE International Workshop on Information Forensics and Security (WIFS),
pages 1–6. IEEE, 2021a.

12

Differentially Private Diffusion Models Generate Useful Synthetic Images

D. Chen, R. Kerkouche, and M. Fritz. Private set generation with discriminative information. arXiv preprint
arXiv:2211.04446, 2022a.
J.-W. Chen, C.-M. Yu, C.-C. Kao, T.-W. Pang, and C.-S. Lu. Dpgen: Differentially private generative energy-guided
network for natural image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 8387–8396, 2022b.
R. J. Chen, M. Y. Lu, T. Y. Chen, D. F. Williamson, and F. Mahmood. Synthetic data in machine learning for
medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021b.
P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of imagenet as an alternative to the cifar
datasets. arXiv preprint arXiv:1707.08819, 2017.
W. L. Croft, J.-R. Sack, and W. Shi. Differentially private facial obfuscation via generative adversarial networks.
Future Generation Computer Systems, 129:358–379, 2022.
F. K. Dankar and M. Ibrahim. Fake it till you make it: Guidelines for effective synthetic data generation. Applied
Sciences, 11(5):2158, 2021.
S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private image
classification through scale. arXiv preprint arXiv:2204.13650, 2022.
Y.-A. de Montjoye, Z. Smoreda, R. Trinquart, C. Ziemlicki, and V. D. Blondel. D4d-senegal: the second mobile
phone data for development challenge. arXiv preprint arXiv:1407.4885, 2014.
P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information
Processing Systems, 34:8780–8794, 2021.
T. Dockhorn, T. Cao, A. Vahdat, and K. Kreis. Differentially private diffusion models. arXiv preprint
arXiv:2210.09929, 2022. URL https://openreview.net/pdf?id=pX21pH4CsNB.
J. Duan, F. Kong, S. Wang, X. Shi, and K. Xu. Are diffusion models vulnerable to membership inference attacks?,
2023. URL https://arxiv.org/abs/2302.01316.
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In
Theory of cryptography conference, pages 265–284. Springer, 2006.
L. Fan. A survey of differentially private generative adversarial networks. In The AAAI Workshop on PrivacyPreserving Artificial Intelligence, page 8, 2020.
L. Fan and A. Pokkunuru. Dpnet: Differentially private network traffic synthesis with generative adversarial
networks. In IFIP Annual Conference on Data and Applications Security and Privacy, pages 3–21. Springer,
2021.
M. L. Fang, D. S. Dhami, and K. Kersting. Dp-ctgan: Differentially private medical data generation using ctgans.
In International Conference on Artificial Intelligence in Medicine, pages 178–188. Springer, 2022.
A. Feuerverger, Y. He, and S. Khatri. Statistical significance of the netflix challenge. Statistical Science, 27(2):
202–231, 2012.
S. Fort, A. Brock, R. Pascanu, S. De, and S. L. Smith. Drawing multiple augmentation samples per image during
training efficiently decreases test error. arXiv preprint arXiv:2105.13343, 2021.
I. Gao, S. Sagawa, P. W. Koh, T. Hashimoto, and P. Liang. Out-of-distribution robustness via targeted augmentations. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
B. A. Gheorghit, ă, L. M. Itu, P. Sharma, C. Suciu, J. Wetzl, C. Geppert, M. A. A. Ali, A. M. Lee, S. K. Piechnik,
S. Neubauer, et al. Improving robustness of automatic cardiac function quantification from cine magnetic
resonance imaging using synthetic image data. Scientific reports, 12(1):1–12, 2022.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.

13

Differentially Private Diffusion Models Generate Useful Synthetic Images

F. Harder, K. Adamczewski, and M. Park. Dp-merf: Differentially private mean embeddings with randomfeatures
for practical privacy-preserving data generation. In International conference on artificial intelligence and
statistics, pages 1819–1827. PMLR, 2021.
F. Harder, M. J. Asadabadi, D. J. Sutherland, and M. Park. Differentially private data generation needs better
features. arXiv preprint arXiv:2205.12900, 2022.
J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro. Logan: evaluating privacy leakage of generative models
using generative adversarial networks. arXiv preprint arXiv:1705.07663, pages 506–519, 2017.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–778, 2016.
D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and uncertainty. In
International Conference on Machine Learning, pages 2712–2721. PMLR, 2019.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
B. Hilprecht, M. Härterich, and D. Bernau. Monte carlo and reconstruction membership inference attacks
against generative models. Proc. Priv. Enhancing Technol., 2019(4):232–249, 2019.
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing
Systems, 33:6840–6851, 2020.
H. Hu and J. Pang. Membership inference of diffusion models, 2023. URL https://arxiv.org/abs/2301
.09956.
D. Jiang, G. Zhang, M. Karami, X. Chen, Y. Shao, and Y. Yu. Dp 2 -vae: Differentially private pre-trained
variational autoencoders. arXiv preprint arXiv:2208.03409, 2022.
J. Jordon, J. Yoon, and M. Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy
guarantees. In International conference on learning representations, 2018.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips,
I. Gao, T. Lee, E. David, I. Stavness, W. Guo, B. A. Earnshaw, I. S. Haque, S. Beery, J. Leskovec, A. Kundaje,
E. Pierson, S. Levine, C. Finn, and P. Liang. WILDS: A benchmark of in-the-wild distribution shifts. In
International Conference on Machine Learning (ICML), 2021.
A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Pre-print, 2009.
A. Kumar, R. Shen, S. Bubeck, and S. Gunasekar. How to fine-tune vision models with sgd. arXiv preprint
arXiv:2211.09359, 2022.
T. Kynkäänniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen. The role of imagenet classes in fréchet inception
distance. arXiv preprint arXiv:2203.06026, 2022.
Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
S. P. Liew, T. Takahashi, and M. Ueno. Pearl: Data synthesis via private embeddings and adversarial reconstruction learning. arXiv preprint arXiv:2106.04590, 2021.
B. Liu, Y. Zhu, K. Song, and A. Elgammal. Towards faster and stabilized gan training for high-fidelity few-shot
image synthesis. In International Conference on Learning Representations, 2020.
Y. Liu, J. Peng, J. James, and Y. Wu. Ppgan: Privacy-preserving generative adversarial network. In 2019 IEEE
25Th international conference on parallel and distributed systems (ICPADS), pages 985–989. IEEE, 2019.

14

Differentially Private Diffusion Models Generate Useful Synthetic Images

A. Lugmayr, M. Danelljan, R. Timofte, K.-w. Kim, Y. Kim, J.-y. Lee, Z. Li, J. Pan, D. Shim, K.-U. Song, et al.
Ntire 2022 challenge on learning the super-resolution space. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 786–797, 2022.
B. McFee, T. Bertin-Mahieux, D. P. Ellis, and G. R. Lanckriet. The million song dataset challenge. In Proceedings
of the 21st International Conference on World Wide Web, pages 909–916, 2012.
R. McKenna, B. Mullins, D. Sheldon, and G. Miklau. Aim: An adaptive and iterative mechanism for differentially
private synthetic data. arXiv preprint arXiv:2201.12677, 2022.
H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz. A general approach
to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210, 2018.
N. Papernot and T. Steinke.
arXiv:2110.03620, 2021.

Hyperparameter tuning with renyi differential privacy.

arXiv preprint

N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA), pages 399–410. IEEE, 2016.
B. Pfitzner and B. Arnrich. Dpd-fvae: Synthetic data generation using federated variational autoencoders with
differentially-private decoder. arXiv preprint arXiv:2211.11591, 2022.
W. H. Pinaya, P.-D. Tudosiu, J. Dafflon, P. F. Da Costa, V. Fernandez, P. Nachev, S. Ourselin, and M. J. Cardoso.
Brain imaging generation with latent diffusion models. In MICCAI Workshop on Deep Generative Models,
pages 117–126. Springer, 2022.
P. M. Radiuk. Impact of training set batch size on the performance of convolutional neural networks for diverse
datasets. Information Technology and Management Science, 20(1):20–24, 2017.
M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio. Transfusion: Understanding transfer learning for medical
imaging. Advances in neural information processing systems, 32, 2019.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent
diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 10684–10695, 2022.
P. Rouzrokh, B. Khosravi, S. Faghani, M. Moassefi, S. Vahdati, and B. J. Erickson. Multitask brain tumor
inpainting with diffusion models: A methodological report. arXiv preprint arXiv:2210.12113, 2022.
M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and
recall. Advances in neural information processing systems, 31, 2018.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training
gans. Advances in neural information processing systems, 29, 2016.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein. Diffusion art or digital forgery? investigating
data replication in diffusion models. arXiv preprint arXiv:2212.03860, 2022.
Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural
Information Processing Systems, 32, 2019.
T. Stadler and C. Troncoso. Why the search for a privacy-preserving data sharing mechanism is failing. Nature
Computational Science, 2(4):208–210, 2022.
T. Stadler, B. Oprisanu, and C. Troncoso. Synthetic data–anonymisation groundhog day. arXiv preprint
arXiv:2011.07018, 2021.

15

Differentially Private Diffusion Models Generate Useful Synthetic Images

A. Torfi, E. A. Fox, and C. K. Reddy. Differentially private synthetic medical data generation using convolutional
gans. Information Sciences, 586:485–500, 2022.
R. Torkzadehmahani, P. Kairouz, and B. Paten. Dp-cgan: Differentially private synthetic data and label
generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,
pages 0–0, 2019.
F. Tramèr, G. Kamath, and N. Carlini. Considerations for differentially private learning with large-scale public
pretraining. arXiv preprint arXiv:2212.06470, 2022.
B. Wang, F. Wu, Y. Long, L. Rimanic, C. Zhang, and B. Li. Datalens: Scalable privacy preserving training via
gradient compression and aggregation. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and
Communications Security, pages 2146–2168, 2021.
D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general.
Advances in Neural Information Processing Systems, 30, 2017.
L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. arXiv
preprint arXiv:1802.06739, 2018.
C. Xu, J. Ren, D. Zhang, Y. Zhang, Z. Qin, and K. Ren. Ganobfuscator: Mitigating information leakage under
gan via differential privacy. IEEE Transactions on Information Forensics and Security, 14(9):2358–2371, 2019.
Z. Xu, M. Collins, Y. Wang, L. Panait, S. Oh, S. Augenstein, T. Liu, F. Schroff, and H. B. McMahan. Learning to
generate image embeddings with user-level differential privacy. arXiv preprint arXiv:2211.10844, 2022.
C. Yan, Y. Yan, Z. Wan, Z. Zhang, L. Omberg, J. Guinney, S. D. Mooney, and B. A. Malin. A multifaceted
benchmarking of synthetic electronic health record generation models. Nature Communications, 13(1):1–18,
2022.
J. Yoon, L. N. Drumright, and M. Van Der Schaar. Anonymization through data synthesis using generative
adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378–2388, 2020.
D. Yu, H. Zhang, W. Chen, J. Yin, and T.-Y. Liu. Large scale private learning via low-rank reparametrization. In
International Conference on Machine Learning, pages 12208–12218. PMLR, 2021.
S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
L. Zhang, B. Shen, A. Barnawi, S. Xi, N. Kumar, and Y. Wu. Feddpgan: federated differentially private generative
adversarial networks framework for the detection of covid-19 pneumonia. Information Systems Frontiers, 23
(6):1403–1415, 2021.
X. Zhang, H. Gu, L. Fan, K. Chen, and Q. Yang. No free lunch theorem for security and utility in federated
learning. arXiv preprint arXiv:2203.05816, 2022.
K. Zhou, Y. Yang, T. Hospedales, and T. Xiang. Deep domain-adversarial image generation for domain
generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13025–13032,
2020.

16

Differentially Private Diffusion Models Generate Useful Synthetic Images

Supplementary Materials
A. Comparison to Related Work
Please refer to Table 2 for a summary of details that differentiate our work from that of Dockhorn et al. (2022).
Ours

Dockhorn et al. (2022)
Analysed for finetuning
Maximal parameter count
Time step scale
Classifier guidance
Modifications to time step sampling
Augmentation strategies
Number of samples for augmentation multiplicity
Maximal batch size
Evaluation

False
1.8M
discrete
True
False
time step
32
2,048

True
80.4M
continuous
False
True
time step, random crop, flipping
128
16,384
FID, downstream accuracy,
FID, downstream accuracy maximal downstream performance
hyperparameter tuning

Table 2 | Comparison of our work and Dockhorn et al. (2022).

B. Additional Experimental Details and Results
Here we present additional experimental details to back up the findings presented in the main paper. We
present our hyperparameters in Tables 3, 4, 5 and 6. Some of our DP synthetic images can be found in Figures
5, 6 and 7. As a baseline, we also added samples from Dockhorn et al. (2022) when training from scratch on
CIFAR-10 in Figure 8. This illustrates the importance of pre-training in the large scale DP image generation
setting. Please refer to Figure 9 for an illustration of the utility of augmentation multiplicity. Finally, we show
with class-wise metrics in Figure 11 that our model is not overfitting to a single class.
ImageNet32
Pre-training data set
Privacy budget ( 𝜖, 𝛿)
Iterations
Clipping norm
Noise schedule
Model size
Channels
Depth
Channels multiple
Heads channels
Attention resolution
Batch size
Learning rate
Optimizer
Scheduler

MNIST

CIFAR-10

Camelyon17

∞
200k
linear
80.4M
192
2
1,2,2,2
64
16
1,024
Adam

ImageNet32
ImageNet32
(10, 10−5 )
(*varies, 10−5 )
(10, 10−5 )
4,000
200
200
10−3
10−3
10−3
linear
linear
linear
4.2M
80.4M
80.4M
64
192
192
1
2
2
1,2,2
1,2,2,2
1,2,2,2
64
64
64
16
16
16
4,096
16,384
16,384
5×10−4
10−3
10−3
Adam
Adam
Adam
linear(from 0 to LR in 5K steps)
constant
constant
constant
𝑤1 , 𝑤2 , 𝑤3
0.03, 0.77, 0.2
0.05, 0.9, 0.05
0.015, 0.785, 0.2
0.015, 0.785, 0.2
𝑙 1 , 𝑢1 = 𝑙 2 , 𝑢2 = 𝑙 3 , 𝑢3
0, 30, 800, 1000
0, 200, 800, 1000
0, 30, 600, 1000
0, 30, 600, 1000
# Augmentation samples
0
128 (timestep)
128 (timestep, flip) 128 (timestep, flip)
Exponential moving average
0.9999
0.9999
0.9999
0.9999

Table 3 | Hyperparameters for diffusion models. The scale of the gradient noise is adjusted to ensure the desired
privacy budget. *We report results for 𝜖 ∈ {1, 5, 10, 32}.

17

Differentially Private Diffusion Models Generate Useful Synthetic Images

Figure 5 | Random samples drawn from a DP image diffusion model trained on MNIST.

Figure 6 | Random samples drawn from a DP image diffusion model trained on Camelyon17.

18

Differentially Private Diffusion Models Generate Useful Synthetic Images

Figure 7 | Random samples drawn from a DP image diffusion model trained on CIFAR-10.

Figure 8 | Samples from CIFAR-10 as provided by (Dockhorn et al., 2022) in Appendix F Rebuttal Discussions.

19

Differentially Private Diffusion Models Generate Useful Synthetic Images

88.5

12.7

FID

Test accuracy (%)

88.0 87.8 87.8

11.8
11.7

87.5

87.2

16 32

64
128
#Samples for Augmultation Multiplicity

86.5

85.5

256

Figure 9 | FID on CIFAR-10 for different values of
augmentation multiplicity samples. We see that the
FID is decreasing for values up to 256.

87.3

87.0
86.8

86.8

86.8
86.6

86.6

86.0

11.5
11.4

87.4

200K

Ensemble of 5 classifiers (pre-trained)
Single classifier (pre-trained)
400K
600K
800K
1M
Number of synthetic samples

Figure 10 | Downstream accuracy of an ImageNet32
pre-trained CIFAR-10 WRN-40-4 as function of the
number of synthetic data samples used to train it.

30
25

FID

20
15
10
5
0

airplane automobile

bird

cat

deer

class

dog

frog

horse

ship

truck

(a) Percentage of samples per class classified as CIFAR-10 images by ResNet18 model trained to discriminate CIFAR-10 and ImageNet32
images.

% classified as CIFAR-10

1.0
0.8
0.6
0.4
0.2
0.0

airplane automobile

bird

cat

deer

class

dog

frog

horse

ship

truck

(b) FID per class.

Figure 11 | Class-wise metrics on CIFAR-10.

20

Differentially Private Diffusion Models Generate Useful Synthetic Images

MNIST

CIFAR-10

Camelyon17

Pre-training data set
ImageNet32
Iterations
10,000
20,000
4,000
Depth
40
40
40
Width
4
4
4
Dropout
0.5
0.0
0.0
Weight decay
5 · 10−4
5 · 10−4
1 · 10−5
Label smoothing
0.05
0.0
0.0
Batch size
64
4,096
512
Learning rate
0.1
cosine decay(start at 0.1 with 𝛼 = 0, 1 decay step)
10−4
Optimizer
SGD
SGD
SGD
Nesterov’s momentum
0.9
0.9
0.9
Samples augmentation multiplicity
0
0
16
Exponential moving average
0.9999
0.9999
0.9999
# Augmentation samples
1 (crop)
16 (crop, flip, color)
16 (color, flip, crop)

Table 4 | Hyperparameters for downstream classification WRNs trained on synthetic data.

Privacy budget ( 𝜖, 𝛿)
Pre-training data set
Clipping norm
Noise standard deviation
Depth
Width
Batch size
Learning rate
Optimizer
Samples augmentation multiplicity
Exponential moving average

MNIST

Camelyon17

(10, 10−5 )

(10, 10−5 )
ImageNet32
1
4.0
40
4
4,096
0.5
SGD
16 (color, flip, crop)
0.9999

1
3.0
16
4
16,384
4.0
SGD
0
0.9999

Table 5 | Hyperparameters for DP WRN classifiers trained on the sensitive data records. Training was stopped
once 𝜖 = 10 was reached.

ResNet50

VGG

WRN-16-4

Learning rate
5 · 10−4 , 2 · 10−3 , 4 · 10−3 0.07, 0.04, 0.025, 0.01, 5 · 10−3 0.01, 0.02, 0.03
Weight decay
0, 0.1, 0.01
0.0, 10−3 , 5 · 10−3
0.0, 0.1, 0.01
Iterations
15,000
15,000
15,000
Label smoothing
0.05
0.0
0.0
Batch size
128
128
128
Optimizer
SGD
SGD
SGD
Momentum
0.0
0.9
0.9
Scheduler
cosine decay*
constant
cosine decay*
Samples augmentation multiplicity
16 (flip, crop)
0
16 (flip, crop)
Exponential moving average
0.9999
0.9999
0.9999

Table 6 | Hyperparameters for classifiers for the model selection experiment. All combinations of different
values displayed for learning rate and weight decay were trained. *Cosine decay schedule with initial learning
rate swept over, 1 decay step, and 𝛼 = 0.0.

21