Differentially Private Diffusion Models Generate Useful Synthetic Images Sahra Ghalebikesabi1,+ , Leonard Berrada2 , Sven Gowal2 , Ira Ktena2 , Robert Stanforth2 , Jamie Hayes2 , Soham De2 , Samuel L. Smith2 , Olivia Wiles2 and Borja Balle2 arXiv:2302.13861v1 [cs.LG] 27 Feb 2023 1 University of Oxford, 2 DeepMind, + Work done at DeepMind. The ability to generate privacy-preserving synthetic versions of sensitive image datasets could unlock numerous ML applications currently constrained by data availability. Due to their astonishing image generation quality, diffusion models are a prime candidate for generating high-quality synthetic data. However, recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy. By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both FID and the accuracy of downstream classifiers trained on synthetic data. We decrease the SOTA FID on CIFAR-10 from 26.8 to 9.8, and increase the accuracy from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real data. We leverage the ability of generative models to create infinite amounts of data to maximise the downstream prediction performance, and further show how to use synthetic data for hyperparameter tuning. Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data, even in applications with significant distribution shift between the pre-training and fine-tuning distributions. 1. Introduction Delivering impactful ML-based solutions for real-world applications in domains like health care and recommendation systems requires access to sensitive personal data that cannot be readily used or shared without risk of introducing ethical and legal implications. Replacing real sensitive data with private synthetic data following the same distribution is a clear pathway to mitigating these concerns (Chen et al., 2021b; Dankar and Ibrahim, 2021; Patki et al., 2016). However, despite their theoretical appeal, general-purpose methods for generating useful and provably private synthetic data remain a subject of active research (Dockhorn et al., 2022; McKenna et al., 2022; Torfi et al., 2022). The central challenge in this line of work is how to obtain truly privacy-preserving synthetic data free of the common pitfalls faced by classical anonymization approaches (Stadler et al., 2021), while at the same time ensuring the resulting datasets remain useful for a wide variety of downstream tasks, including statistical and exploratory analysis as well as machine learning model selection, training and testing. It is tempting to obtain synthetic data by training and then sampling from well-known generative models like variational auto-encoders (Kingma and Welling, 2013), generative adversarial nets (Goodfellow et al., 2020), and denoising diffusion probabilistic models (Ho et al., 2020; Song and Ermon, 2019). Unfortunately, it is well-known that out-of-the-box generative models can potentially memorise and regenerate their training data points1 and, thus, reveal private information. This holds for variational autoencoders (Hilprecht et al., 2019), generative adversarial nets (Hayes et al., 2017), and also diffusion models (Carlini et al., 2023; Duan et al., 2023; Hu and Pang, 2023; Somepalli et al., 2022). In particular, diffusion models have recently gained a lot of attention, with pre-trained models made available online (Dhariwal and Nichol, 2021; Rombach et al., 2022), and being fine-tuned on applications involving potentially sensitive data such as chest X-rays (Ali et al., 2022; Chambon et al., 2022a,b) and brain MRIs (Pinaya et al., 2022; Rouzrokh et al., 2022). Mitigating the privacy loss from sharing synthetic data produced by generative models trained on sensitive 1 A model does not contain its training data, but rather has “memorised” training data when the model is able to use the rules and attributes it has learned about the training data to generate elements of that training data. Corresponding author(s): sahra.ghalebikesabi@univ.ox.ac.uk, {lberrada | bballe}@deepmind.com © 2023 DeepMind. All rights reserved Differentially Private Diffusion Models Generate Useful Synthetic Images Camelyon17 Synthetic samples Real samples CIFAR-10 Figure 1 | DP diffusion models are capable of producing high-quality images. More images can be found in Figures 5, 6, 7. data is not straightforward. Differential privacy (DP) (Dwork et al., 2006) has emerged as the gold standard privacy mitigation when training ML models, and its application to generative models would provide guarantees on the information the model (and synthetic data sampled from it) can leak about individual training examples. Yet, scaling up DP training methods to modern large-scale models remains a significant challenge due to the slow down incurred by DP-SGD (the standard workhorse of DP for deep learning) (Wang et al., 2017) and the stark utility degradation often observed when training large models from scratch with DP (Stadler and Troncoso, 2022; Stadler et al., 2021; Zhang et al., 2022). Most previous works on DP generative models worked around these issues by focusing on small models, low-complexity data (Harder et al., 2021; Torkzadehmahani et al., 2019; Xie et al., 2018) or using non-standard models (Harder et al., 2022). However, for DP applications to image classification it is known that using models pre-trained on public data is a method for attaining good utility which is compatible with large-scale models and complex datasets (Bu et al., 2022a,b; Cattan et al., 2022; De et al., 2022; Tramèr et al., 2022; Xu et al., 2022). Contributions. In this paper we demonstrate how to accurately train standard diffusion models with differential privacy. Despite the inherent difficulty of this task, we propose a simple methodology that allows us to generate high-quality synthetic image datasets that are useful for a variety of important downstream tasks. In particular, we privately train denoising diffusion probabilistic models (Ho et al., 2020) with more than 80M parameters on CIFAR-10 and Camelyon17 (Koh et al., 2021), and evaluate the usefulness of synthetic images for downstream model training and evaluation. Crucially, we show that by pre-training on publicly available data (i.e. ImageNet), it is possible to considerably outperform the results of extremely recent work on a similar topic (Dockhorn et al., 2022). With this method, we are able to accurately train models 45x larger than Dockhorn et al. (2022) and to achieve a high utility on datasets that are significantly more challenging (e.g. CIFAR-10 and a medical dataset – instead of MNIST). Please refer to Table 2 for a detailed comparison of our works. Our contributions can be summarized as follows: • We demonstrate that diffusion models can be trained with differential privacy to sufficient quality that we can create accurate classifiers based on the synthesized data only. To do so, we leverage pre-training, and we demonstrate large state-of-the-art improvements even when there exists a significant distribution shift between the pre-training and the fine-tuning data sets. • We propose simple and practical improvements over existing methods to further boost the performance of the model. Namely, we employ both image and timestep augmentations when using augmentation multiplicity, and we bias the diffusion timestep sampling so as to encourage learning of the most challenging phase of the diffusion process. • With this approach, we fine-tune a differentially private diffusion model with more than 80 million parameters on CIFAR-10, and beat the previous state-of-the-art by more than 50%, decreasing the Fréchet Inception Divergence (FID) from 26.8 to 9.8. Furthermore, we privately fine-tune the same model 2 Differentially Private Diffusion Models Generate Useful Synthetic Images on histopathological scans of lymph node tissue available in the Camelyon17 dataset and show that a classifier trained on synthetic data produced by this model achieves 91.1% accuracy (the highest accuracy reported on the WILDS leaderboard (Koh et al., 2021) is 96.5% for a non-private model trained on real data). • We demonstrate that the accuracy of downstream classifiers can be further improved to a significant extent by leveraging larger synthetic datasets and ensembling, which comes at no additional privacy cost. Finally, we show that hyperparameter tuning downstream classifiers on the synthetic data reveals trends that are also reflected when tuning on the private data set directly. Paper outline. We start by comparing to related work in section 2, before we provide a brief introduction to diffusion models, differential privacy, and DP-SGD in section 3. In section 4, we describe effective strategies to fine-tune DP diffusion models, and then present our results on CIFAR-10 and Camelyon17 in section 5. In section 6, we assess the utility of synthetic data for model selection. 2. Related Work Differentially private synthetic image generation. DP image generation is an active area of research (Chen et al., 2022b; Croft et al., 2022; Fan, 2020). Most efforts have focused on applying a differentially private stochastic gradient procedure on popular generative models, i.e. generative adversarial networks (Augenstein et al., 2019; Chen et al., 2020, 2021a; Jordon et al., 2018; Liu et al., 2019; Torkzadehmahani et al., 2019; Xie et al., 2018; Xu et al., 2019; Yoon et al., 2020), or variational autoencoders (Jiang et al., 2022; Pfitzner and Arnrich, 2022). Only one other work has so far analysed the application of differentially private gradient descent on diffusion models (Dockhorn et al., 2022) which we contrast against in Table 2. Others have instead proposed custom architectures (Cao et al., 2021; Chen et al., 2022a; Harder et al., 2021, 2022; Wang et al., 2021). Harder et al. (2022), for instance, pre-train a perceptual feature extractor using public data, then privatize the mean of the feature embeddings of the sensitive data records, and use the privatised mean to train a generative model. Limitations of previous work. DP image generation based on custom training pipelines and architectures that are not used outside of the DP literature do not profit from the constant research progress on public image generation. Other works that instead build upon popular public generative approaches have been shown to not be differentially private despite such claims. This could be either due to faulty implementations or proofs. See Stadler et al. (2021) for successful privacy attacks on DP GANs, or Appendix B of Dockhorn et al. (2022) for an illustration on why DPGEN (Chen et al., 2022b) does not actually satisfy DP guarantees. Limited success on natural images. DP synthesizers have found applications on tabular electronic healthcare records (Fang et al., 2022; Torfi et al., 2022; Yan et al., 2022; Zhang et al., 2021), mobility trajectories (AlatristaSalas et al., 2022) and network traffic data (Fan and Pokkunuru, 2021). In the space of image generation, positive results have only been reported on MNIST, FashionMNIST and CelebA (downsampled to 32 × 32) (Bie et al., 2022; Harder et al., 2021; Liew et al., 2021; Wang et al., 2021). These datasets are relatively easy to learn thanks to plain backgrounds, class separability, and repetitive image features within and even across classes. Meanwhile, CIFAR-10 has been established as a considerably harder generation task than MNIST, FashionMNIST or CelebA (Radiuk, 2017). The images are not only higher dimensional than MNIST and FashionMNIST (32 × 32 × 3 compared to 28 × 28 feature dimensions), but the dataset has wider diversity and complexity. This is reflected by more complex features and textures, such as lightning conditions, object orientations, and complex backgrounds (Radiuk, 2017). Moreover, MNIST and FashionMNIST are considerably lower dimensional than CIFAR-10 and Camelyon (28 × 28 vs 32 × 32 × 3 features), and CelebA is downsampled to the same feature dimensionality as CIFAR-10 but has more than 3 times as many samples as CIFAR-10 (50k vs 162k) which considerably reduces the information loss introduced by DP training. As far as we know, only two other concurrent works have attempted DP image generation on CIFAR-10. While Dockhorn et al. (2022) achieve a FID of only 97.7 by training a DP diffusion model from scratch, Harder et al. (2022) used pre-training on ImageNet and achieved the SOTA with a FID of 26.8, and a downstream accuracy of only 51%. 3 Differentially Private Diffusion Models Generate Useful Synthetic Images Limited targeted evaluation. The evaluation carried out on DP synthetic datasets is often not sufficiently targeted towards their utility in practice. The performance of DP image synthesizers is commonly evaluated on two types of metrics: 1) perceptual difference measures between the synthetic and real data distribution, such as FID, and 2) predictive performance of classifiers that are trained on a synthetic dataset of the size of the original training dataset and tested on the real test data. The former metric is known to be easy to manipulate with factors not related to the image quality, such as the number of samples, or the version number of the inception network (Kynkäänniemi et al., 2022). At the same time, it is not obvious how to jointly incorporate the information from both metrics given that they may individually imply different conclusions. Dockhorn et al. (2022), for instance, identify different diffusion model samplers to minimise either the FID or the downstream test loss. Further, recent research has identified use cases where synthetic data is not able to capture important first or second order statistics despite reportedly scoring highly on those metrics (Stadler and Troncoso, 2022; Stadler et al., 2021). In this paper, we set out to provide examples of additional downstream evaluations. 3. Background 3.1. Denoising Diffusion Probabilistic Models Denoising diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song and Ermon, 2019) are a class of likelihood-based generative models that have recently established new SOTA results across diverse computer vision problems (Dhariwal and Nichol, 2021; Lugmayr et al., 2022; Rombach et al., 2022). Given a forward Markov chain that sequentially perturbs a real data sample 𝑥0 to obtain a pure noise distribution 𝑥𝑇 , diffusion models parameterize the transition kernels of a backward chain by deep neural networks to denoise 𝑥𝑇 back to 𝑥0 . Given 𝑥0 ∼ 𝑞 ( 𝑥0 ), one defines a forward process that generates gradually noisier samples 𝑥1 , ..., 𝑥𝑇 using a transition kernel 𝑞 ( 𝑥𝑡 | 𝑥𝑡−1 ) typically chosen as a Gaussian perturbation. At inference time, 𝑥𝑇 , an observation sampled from a noise distribution, is then gradually denoised 𝑥𝑇 −1 , 𝑥𝑇 −2 , ... until the final sample 𝑥0 is reached. Ho et al. (2020) parameterize this process by a function 𝜖𝜃 ( 𝑥𝑡 , 𝑡 ) which predicts the noise component 𝜖 of a noisy sample 𝑥𝑡 given timestep 𝑡 . They then propose a simplified training objective to learn 𝜃, namely 𝐿 ( 𝜃) = 𝔼𝑡,𝑥0 ,𝜖 [k 𝜖 − 𝜖𝜃 ( 𝑥𝑡 , 𝑡 ) k 2 ] , (1) with 𝑡 ∼ U [0, 𝑇 ], where 𝑇 is the pre-specified maximum timestep, √ and U√[ 𝑎, 𝑏] is the discrete uniform distribution bounded by 𝑎 and 𝑏. The noisy sample 𝑥𝑡 is computed by 𝑥𝑡 = 𝛼 ¯𝑡 𝑥0 + 1 − 𝛼 ¯𝑡 𝜖 where 𝛼 ¯𝑡 is defined such that 𝑥𝑡 follows the pre-specified forward process. Most importantly, 𝛼 ¯𝑡 is a decreasing function of timestep 𝑡 . Thus, the larger 𝑡 is, the noisier 𝑥𝑡 will be. 3.2. Differential Privacy Differential Privacy (DP) is a formal privacy notion that, in informal terms, bounds how much a single observation can change the output distribution of a randomised algorithm. More formally: Definition 3.1 (Differential Privacy (Dwork et al., 2006)). Let 𝐴 be a randomized algorithm, and let 𝜀 > 0, 𝛿 ∈ [0, 1]. We say that 𝐴 is ( 𝜀, 𝛿)-DP if for any two neighboring datasets 𝐷, 𝐷 0 differing by a single element, we have that ∀ 𝑆 ⊂ S , ℙ[ 𝐴 ( 𝐷) ∈ 𝑆] ≤ exp( 𝜀)ℙ[ 𝐴 ( 𝐷 0) ∈ 𝑆] + 𝛿, where S denotes the support of 𝐴. The privacy guarantee is thus controlled by two parameters, 𝜀 and 𝛿. While 𝜀 bounds the log-likelihood ratio of any particular output that can be obtained when running the algorithm on two datasets differing in a single data point, 𝛿 is a small probability which bounds the occurrence of infrequent outputs that violate this bound (typically 1/𝑛, where 𝑛 is the number of training examples). The smaller these two parameters get, the higher is the privacy guarantee. We therefore refer to the tuple ( 𝜀, 𝛿) as privacy budget. 4 Differentially Private Diffusion Models Generate Useful Synthetic Images 3.3. Differentially Private Stochastic Gradient Descent Neural networks are typically privatised with Differentially Private Stochastic Gradient Descent (DP-SGD) (Abadi et al., 2016), or alternatively a different DP optimizer like DP-Adam (McMahan et al., 2018). At each training iteration, the mini-batch gradient is clipped per example, and Gaussian noise is added to it. More formally, let 𝑙 𝑖 ( 𝑤) := L ( 𝑤, 𝑥 𝑖 , 𝑦𝑖 ) denote the learning objective given model parameters 𝑤 ∈ ℝ 𝑝 , input features n o 𝑥 𝑖 and label 𝑦𝑖 . Let clip𝐶 ( 𝑣) : 𝑣 ∈ ℝ 𝑝 ↦→ min 1, k 𝑣𝐶k · 𝑣 ∈ ℝ 𝑝 denote the clipping function which re-scales 2 its input to have a maximal 2 norm of 𝐶 . For a minibatch B with |B| = 𝐵 samples, the "privatised" minibatch gradient ˆ𝑔 takes on the form ˆ𝑔 = 1 ∑︁ 𝐵 𝑖 ∈B clip𝐶 (∇𝑙 𝑖 ( 𝑤)) + 𝜎𝐶 𝜉, 𝐵 with 𝜉 ∼ N (0, 𝐼 𝑝 ) and 𝐼 𝑝 ∈ ℝ 𝑝× 𝑝 being the identity matrix. In practice, the choice of noise variance 𝜎 > 0, batch-size 𝐵 and maximum number of training iterations are constrained by the predetermined privacy budget ( 𝜀, 𝛿). Crucially, the choice of hyper-parameters can have a large impact on the accuracy of the resulting model, and overall DP-SGD makes it challenging to accurately train deep neural networks. On CIFAR-10 for example, the highest reported test accuracy for a DP model trained with 𝜖 = 8 was 63.4% in 2021 (Yu et al., 2021). De et al. (2022) improved performance on image classification tasks and in particular obtained nearly 95% test accuracy for 𝜖 = 1 on CIFAR-10, using notably 1) pre-training, 2) large batch sizes, and 3) augmentation multiplicity. As part of this paper, we analyze to what extent these performance gains transfer from DP image classification to DP image generation. Diffusion models are inherently different model architectures that exhibit different training dynamics than standard classifiers which makes introduces additional difficulties in adapting DP training. First, diffusion models are significantly more computationally expensive to train. Indeed, they operate on higher dimensional representations than image classifiers, so that they can output full images instead of a single label. This makes each update step much more computationally expensive for diffusion models than for classification ones. In addition, diffusion models also need more epochs to converge in public settings compared to classifiers. For example, for a batch size of 128 samples Ho et al. (2020) train a diffusion model for 800k steps on CIFAR-10, while Zagoruyko and Komodakis (2016) train a Wide ResNet for classification in less than 80k steps. This high computational cost of training a diffusion model makes it difficult to finetune the hyperparameters, which is known to be both challenging and crucial for good performance (De et al., 2022). Second, and related to sample inefficiency, the noise inherent to the training of diffusion models introduces an additional variance that compounds with the one injected by DP-SGD, which makes training all the more challenging. Thus overall, it is currently not obvious how to efficiently and accurately train diffusion models with differential privacy. 4. Improvements towards Fine-Tuning Diffusion Models with Privacy Recommendations from previous work. De et al. (2022) identify pre-training, large batch sizes, and augmentation multiplicity as effective strategies in DP image classification. We adopted their recommendations in the training of DP diffusion models, and confirmed the effectiveness of their strategies to the task of DP image generation. In contrast to the work of Dockhorn et al. (2022), where the batch-size is only scaled up to 2,048 samples, we implemented virtual batching which helps us to scale to up to 32,768 samples per batch. Pre-training. Pre-training is especially integral to generating realistic image datasets, even if there is a considerable distribution shift between the pre-training and fine-tuning distributions. Unless otherwise specified, we thus pre-train all of our models on ImageNet32 (Chrabaszcz et al., 2017). ImageNet has been a popular pre-training dataset used when little data is available (Raghu et al., 2019), to accelerate convergence times (Liu et al., 2020), or to increase robustness (Hendrycks et al., 2019). Augmentation multiplicity with timesteps. De et al. (2022) observe that data augmentation, as it is commonly implemented in public training, has a detrimental effect on the accuracy in DP image classification. Instead they propose the use of augmentation multiplicity (Fort et al., 2021). In more detail, they augment 5 Differentially Private Diffusion Models Generate Useful Synthetic Images each unique training observation within a batch, e.g. with horizontal flips or random crops, and average the gradients of the augmentations before clipping them. Similarly to Dockhorn et al. (2022), we extend augmentation multiplicity to also sample multiple timesteps 𝑡 for each mini-batch observation in the estimation of Equation 1, and average the corresponding gradients before clipping. In contrast to Dockhorn et al. (2022) where only timestep multiplicity is considered, we combine it with traditional augmentation methods, namely random crops and flipping. As a result, while Dockhorn et al. (2022) find that the FID plateaus for around 32 augmentations per image, we see increasing benefits the more augmentation samples are used (see Figure 9). For computational reasons, we limit augmentation multiplicity to 128 samples. Modified timestep sampling. The training objective for diffusion models in Equation 1 samples the timestep 𝑡 uniformly from [0, 𝑇 ] because the model must learn to de-noise images at every noise level. However, it is not straightforward that uniform sampling is the best strategy, especially in the DP setting where the number of model updates is limited by the privacy budget. In particular, in the fine-tuning scenario, a pre-trained model has already learned that at small timesteps the task is to remove small amounts of Gaussian noise from a natural-looking image, and at large timesteps the task is to project a completely noisy sample closer to the manifold of natural-looking images. The model behavior at small and large timesteps is thus more likely to transfer to different image distributions without further tuning. In contrast, for medium timesteps the model must be aware of the data distribution at hand in order to compose a natural-looking image. A similar observation has been recently made for membership inference attacks (Carlini et al., 2023; Hu and Pang, 2023): the adversary has been shown to more likely succeed in membership inference when it uses a diffusion model to denoise images with medium amounts of noise compared to high- or low-variance noised images. This motivates us to explore modifications of the training objective where the timestep sampling distribution is not uniform, and instead focuses on training the regimes that contribute more to modelling the key content of an image. Motivated by this reasoning, we considered replacing the uniform timestep distribution with a mixture of Í Í uniform distributions 𝑡 ∼ 𝑖𝐾=1 𝑤𝑖 U [ 𝑙 𝑖 , 𝑢𝑖 ] where 𝑖𝐾=1 𝑤𝑖 = 1, 0 ≤ 𝑙0 , 𝑢 𝐾 ≤ 𝑇 and 𝑢𝑘−1 ≤ 𝑙 𝑘 for 𝑘 ∈ {2, ..., 𝐾 }. On CIFAR-10, we found the best performance for a distribution with probability mass focused within [30, 600] for 𝑇 = 1, 000 where timesteps outside this interval are assigned a lower probability than timesteps within this interval. We assume this is due to ImageNet-pre-trained diffusion models being able to denoise other (potentially unseen) natural images if only a small amount of noise is added. Training with privacy for small timesteps can then decrease the performance because more of the training budget is allocated to the timesteps that are harder to learn and because of the noisy optimization procedure. Even when training from scratch on MNIST, we observe that focusing the limited training capacity on the harder-to-learn moderate time steps increases test performance. 5. Empirical Results on Differentially Private Image Generation and their Evaluation 5.1. Current Evaluation Framework for DP Image Generators The FID (Heusel et al., 2017) has been the most widely used metric for assessing the similarity between artificially generated and real images (Dhariwal and Nichol, 2021), and has thus been widely applied in the DP image generation literature as the first point of comparison (Dockhorn et al., 2022). While the FID is the most popular metric, numerous other metrics have been proposed, including the Inception Score (Salimans et al., 2016), the Kernel Inception Distance (Bińkowski et al., 2018), and Precision and Recall (Sajjadi et al., 2018). For the calculation of these metrics, the synthetic and real images are typically fed through an inception network that has been trained for image classification on ImageNet, and a distance between the two data distributions is computed based on the feature embeddings of the final layer. Even though these metrics have been designed to correlate with the visual quality of the images, they can be misleading since they highly vary with image quality unrelated factors such as the number of observations, or the version number of the inception network (Kynkäänniemi et al., 2022). They also reduce complex image data distributions to single values that might not capture what practitioners are interested in when dealing with DP synthetic data. Most importantly, they may not effectively capture the nuances in image quality the further apart the observed data distribution is from ImageNet. For example, CIFAR-10 images have to be 6 Differentially Private Diffusion Models Generate Useful Synthetic Images upscaled significantly, to be fed through the inception network which will then capture undesirable artifacts introduced by upsampling. In contrast, MNIST images are digit-based and thus exhibit other variations than natural images, further diminishing the effectiveness of the evaluation. An alternative way of comparing DP image generation models is by looking at the test performance of a downstream classifier trained on a synthetic dataset of the same size as the real dataset (Dockhorn et al., 2022; Xie et al., 2018), and tested on the real dataset. We argue that the way DP generative models are currently evaluated downstream, i.e. by evaluating a single model or metric on a limited data set, needs to be revisited. Instead, we propose to explore how synthetic data can be most effectively used for prediction model training and hyperparameter tuning. This line of thinking motivates the proposal of an evaluation framework that focuses on how DP generative models are used by practitioners. As such, our experiments focus on two specific use cases for private synthetic data: downstream prediction tasks, and model selection. Downstream prediction tasks include classification or regression models trained on synthetic data and evaluated on real samples. This corresponds to the setting where a data curator aims to build a production-ready model that achieves the highest possible performance at test time while preserving the privacy of the training samples. Model selection refers to a use case where the data curator shares the generative model with a third party that trains a series of models on the synthetic data with the goal to provide guidance on the model ranking when evaluated on the sensitive real data records. We hope that with these two experiments we cover the most important downstream tasks and set an example for future research on the development of DP generative models. After presenting our results within the evaluation framework that is commonly used in the current literature, we investigate the utility of the DP image diffusion model trained on CIFAR-10 for the aforementioned tasks in section 6. 5.2. Experimental Setup We now evaluate the empirical efficiency of our proposed methods on three image datasets: MNIST, CIFAR-10 and Camelyon17. Please refer to Table 1 for an overview on our main results. For CIFAR-10, we provide additional experiments to prove the utility of the synthetic data for model selection in section 6. We emphasize that while these benchmarks may be considered small-scale by modern non-private machine learning standards, they remain an outstanding challenge for image generation with differential privacy at this time. We train diffusion models with a U-Net architecture as used by Ho et al. (2020) and Dhariwal and Nichol (2021). In contrast to Dockhorn et al. (2022), we found that classifier guidance led to a drop in performance. Unless otherwise specified, all diffusion models are trained with a privatized version of Adam (Kingma and Ba, 2014; McMahan et al., 2018). The clipping norm is set to 10−3 , since we observed that the gradient norm for diffusion models is usually small, especially in the fine-tuning regime. The privacy budget is set to 𝜖 = 10, as commonly considered in the DP image generation literature. The same architecture is used for the diffusion model across all datasets. More specifically, the diffusion is performed over 1,000 timesteps with a linear noise Table 1 | A summary of the best results provided in this paper when training diffusion models with DP-SGD. We report the test accuracy of classifiers trained on different data sets. The highest reported current SOTA corresponds to classifiers trained on DP synthetic data, as reported in the literature. Here, [Do22] refers to Appendix F Rebuttal Discussions in Dockhorn et al. (2022), and [Ha22] to Harder et al. (2022). Our test accuracy (Ours) denotes the accuracy of a classifier trained on a synthetic dataset that was generated by a DP diffusion model and is of the same size as the original training data. Note that we also report the model size of our generative models (Diffusion M. Size). The Non-synth test accuracy corresponds to the test accuracy of a DP classifier trained on the real dataset, using the techniques introduced by De et al. (2022). ∗ [De22] This number is taken from De et al. (2022) for 𝜖 = 8. Dataset MNIST CIFAR-10 Camelyon17 Image Resolution Diffusion M. Size Pre-Training Data 28 × 28 32 × 32 × 3 32 × 32 × 3 4.2M 80.4M 80.4M – ImageNet32 ImageNet32 Test Accuracy (%) Privacy SOTA Ours Non-synth ( 𝜖, 𝛿) 98.1 [Do22] 51.0 [Ha22] - 98.6 88.0 91.1 99.1 96.6 ∗[De22] 90.5 (10, 10−5 ) (10, 10−5 ) (10, 3 · 10−6 ) 7 Differentially Private Diffusion Models Generate Useful Synthetic Images schedule, and the convolutional architecture employs the following details: a variable width with 2 residual blocks per resolution, 192 base channels, 1 attention head with 16 channels per head, attention at the 16x16 resolution, channel multipliers of (1,2,2,2), and adaptive group normalization layers for injecting timestep and class embeddings into residual blocks as introduced by Dhariwal and Nichol (2021). When fine-tuning, the model is pre-trained on ImageNet32 (Chrabaszcz et al., 2017) for 200,000 iterations. All hyperparameters can be found in Table 3. As a baseline, we also train DP classifiers directly on the sensitive data, using the training pipeline introduced by De et al. (2022), and additionally hyperparameter tuning the learning rate. Please refer to Table 5 for more details. It is not surprising that these results partly outperform the image generators as the training of the DP classifiers is targeted towards maximising downstream performance. 5.3. Training from Scratch (∅ → MNIST) The MNIST dataset (LeCun, 1998) consists out of 60,000 28 × 28 training images depicting the 10 digit classes in grayscale. Since it is the most commonly used dataset in the DP image generation literature, it is included here for the sake of completeness. Method. We use a DP diffusion model of 4.2M parameters without any pre-training, with in particular 64 channels, and channel multipliers (1,2,2). The diffusion model is trained for 4,000 iterations at a constant learning-rate of 5 · 10−4 at batch-size 4,096, with augmult set to 128, and a noise multiplier of 2.852. The timesteps are sampled uniformly within [0, 200] with probability 0.05, within [200, 800] with probability 0.9, and within [800, 1000] with probability 0.05. To evaluate the quality of the images, we generate 50,000 samples from the diffusion model. Then we train a WRN-40-4 on these synthetic images (hyperparameters given in Table 4), and evaluate it on the MNIST test set. Results. As reported in Table 1, this yields a state-of-the-art top-1 accuracy of 98.6%, to be compared to the 98.1% accuracy obtained by Dockhorn et al. (2022). Crucially, we find that to obtain this state-of-the-art result, it is important to bias the timestep sampling of the diffusion model at training time: this allows the model to get more training signal from the challenging phases of the generation process through diffusion. Without this biasing, we obtain a classification accuracy of only 98.2%. 5.4. Fine-tuning on a Medical Application (ImageNet32 → Camelyon17) To show that fine-tuning works even in settings characterised by dataset shift from the pre-training distribution, we fine-tune a DP diffusion model on a medical dataset. Camelyon17 (Bandi et al., 2018; Koh et al., 2021) comprises 455,954 histopathological 96 × 96 image patches of lymph node tissue from five different hospitals. The label signifies whether at least one pixel in the center 32 × 32 pixels has been identified as a tumor cell. Camelyon17 is also part of the WILDS (Koh et al., 2021) leaderboard as a domain generalization task: The training dataset contains 302,436 images from three different hospitals whereas the validation and test set contain respectively 34,904 and 85,054 images from a fourth and fifth hospital. Since every hospital uses a different staining technique, it is easy to overfit to the hues of the training data with empirical risk minimisation. At the time of writing, the highest accuracy reported in the leaderboard of official submissions is 92.1% with a classifier that uses a special augmentation approach (Gao et al., 2022). The SOTA that does not fulfill the formal submission guidelines achieves up to 96.5% by pre-training on a large web image data set and finetuning only specific layers of the classification network (Kumar et al., 2022). Method. First we pre-train an image diffusion model on ImageNet32, before finetuning it with (10, 3 · 10−6 )DP on the training data, downsampled to 32 × 32, with a batch size of 16,384 for 200 steps. We tuned the hyperparameters to achieve the lowest FID on the training data, and used the out-of-distribution validation data to tune the downstream classifiers. The timestep is sampled with 0.015 probability from [0, 30], with a probability of 0.785 in [30, 600], and with 0.2 in [600, 1000].Since the diffusion model is pre-trained on ImageNet, we assume that the data is also available for pre-training the classifier. The pre-trained classifier is then further fine-tuned on a synthetic dataset of the same size as the original training dataset, which we find 8 Differentially Private Diffusion Models Generate Useful Synthetic Images Method Ours Harder et al., 2022 32.0 25 25.2 25.9 26.8 23.3 FID 15.1 9.8 10 87.0 7.9 85.5 83.1 82 5 10 78 78.7 75.6 200K 32 Figure 2 | FID on CIFAR-10 for different privacy budgets. Our performance at 𝜖 = 5 beats the SOTA, when pre-training on ImageNet, for 𝜖 = ∞. Results for Harder et al. (2022) are taken from their paper. 82.4 80.5 80 76 1 84.4 84 5 0 88.0 86.0 86 20 15 88 Test accuracy (%) 30 Ensemble of 5 classifiers Single classifier 400K 600K 800K 1M Number of synthetic samples Figure 3 | Downstream Top-1 accuracy of a CIFAR10 WRN-40-4 as function of the number of synthetic data samples used to train it. The accuracy increases considerably as a function of the dataset size. to systematically improve results. For both the classifier trained on augmented data where the augmentations include flipping, rotation and color-jittering. Results. We achieve close to state-of-the-art classification performance with 91.1% by training only on the synthetic data whereas the best DP classifier we trained on the real dataset achieved only 90.5%. So while the synthetic dataset is not only useful for in-distribution classification, training on synthetic data is also an effective strategy to combat overfitting to domain artifacts and generalise in out-of-distribution classification tasks, as noted elsewhere in the literature (Gheorghit, ă et al., 2022; Zhou et al., 2020). 5.5. Fine-tuning on Natural Images (ImageNet32 → CIFAR-10) CIFAR-10 (Krizhevsky et al., 2009) is a natural image dataset of 10 different classes with 50,000 RGB images of size 32 × 32 during training time. Method. We use the same pre-trained model as for Camelyon17, that is an image diffusion model with more than 80M parameters trained on ImageNet32. We tune the remaining hyperparameters by splitting the training data into a set of 45,000 images for training, and 5,000 images for assessing the validation performance based on FID. As for Camelyon17, we found that sampling the timestep with probability 0.15 in [0,30], with 0.785 in [30, 60] and 0.2 in [600, 1,000] gives us the lowest FID. More details can be found in Table 3. Results. We improve the SOTA FID with ImageNet pre-training (Harder et al., 2022) from 26.8 to 9.8, which is a drop of more than 50%, and increase the downstream accuracy from 51.0% to 72.9% without pre-training the classifier. With pre-training the classifier on ImageNet32, we can achieve a classification accuracy of 86.6% with a single WRN. Modifying the timestep distribution led to a reduction in the FID from 11.6 to 9.8. Note that we achieved the results for different privacy levels by linearly scaling the number of iterations proportionally with 𝜖, and adjusting the noise level to the given privacy budget, keeping all parameters the same. As detailed in Figure 2, we obtain state-of-the-art accuracy for a variety of privacy levels. Even for a budget as small as 𝜖 = 1, the FID obtained with our method is smaller than the current SOTA for 𝜖 = 10. These results can also be compared with the very recent work by Dockhorn et al. (2022), who report an FID of 97.7 when training diffusion models with differential privacy on CIFAR-10 without any pre-training. Due to the difficulty of the task of learning the diffusion model with differential privacy from scratch, the model did not learn to generate CIFAR-10 like samples, and the generated images do not seem to display any clear CIFAR-10 class instances at all. We believe that such mixed results are a clear motivation for our proposed method of 9 Differentially Private Diffusion Models Generate Useful Synthetic Images pre-training on public data, which make the learning problem significantly more tractable and realistic, and allows to obtain useful image generation. To further confirm that the diffusion model has correctly learned the distribution shift, we trained a ResNet18 model to discriminate images from CIFAR-10 and ImageNet32 achieving a test accuracy of 98.0% on that task. We then evaluated it on 50,000 synthetically generated images out of which 92% were classified as CIFAR-10 images. This supports our hypothesis that the fine-tuned diffusion model does generate images that are more similar to CIFAR-10 than to the pre-training data of ImageNet. 5.6. Maximizing Downstream Prediction Performance by Sampling Arbitrary Many Data Records (ImageNet32 → CIFAR-10) Dataset sample size. One benefit of synthetic data generators is their ability to render infinitely many synthetic images. As such, there is no reason why the comparison of real and synthetic samples should be limited to predictive models trained on the same number of training samples. We, therefore, investigate whether the performance of a downstream predictor increases with more training images. In Figure 3, we observe that the downstream classification accuracy constantly increases the more synthetic training observations are generated. In particular, we increase the downstream classification accuracy from 72.9% to 86.0% by sampling 1M instead of 50K images– without pretraining the classifier. We note that this difference is much more significant on the more challenging dataset of CIFAR-10 than e.g. MNIST, where we find that increasing the number of samples offers virtually no benefit in terms of downstream accuracy. We note that the classifier can also be pre-trained on the pre-training distribution for performance increases for smaller data set sizes. On 50k samples, we achieve a classification accuracy of 86.6%. The predictive performance does not increase significantly when fine-tuning on more samples (see the appendix - Figure 10). The benefit of pre-training the downstream classification performance thus diminishes for 1M synthetic samples. Ensembling. We observe that we can further improve the classification accuracy given by a single WRN classifier by instead ensembling five different networks that differ only in the subsampling of the minibatches. As reported in Table 1, we can achieve a test accuracy of 88% on CIFAR-10 using this approach (see Figure 3). 6. Model Selection (ImageNet32 → CIFAR-10) One important benefit of training a DP image generator over a DP classifier is the potential to use the generated data repeatedly for training a range of different prediction models and choosing the best one across them. Each experiment training a model on the data comes with a privacy cost, thus tuning a large number of DP classifiers increases the required privacy budget (Papernot and Steinke, 2021).In this section we consider how synthetic data can be used to gain initial insights on the choice of model, and to reduce the number of experiments run on the sensitive data records. The goal here is to identify the model that performs best on real data, while only having access to synthetic data. This becomes particularly relevant and useful when synthetic data needs to be released for research purposes (Chambon et al., 2022b), or data challenges (de Montjoye et al., 2014; Feuerverger et al., 2012; McFee et al., 2012). For this purpose, we train 3 different model architectures that are commonly employed on CIFAR-10 (Bu et al., 2022a; De et al., 2022): a WRN-28-8 (Zagoruyko and Komodakis, 2016), a ResNet50 (He et al., 2016), and a VGG (Simonyan and Zisserman, 2014). For each architecture, we sweep over combinations of three to five different learning rates and three different values of weight decay. Please refer to Table 6 for more details. We now assess the utility of synthetic data for model selection in two stages of increasing difficulty. First, we check whether models – trained on the synthetic data – rank similarly on the real and synthetic test data. This corresponds to the application setting where a third party tunes a model on the synthetic data, and releases a model that is trained on the same data. Once we have established that the test performance on real and synthetic data is sufficiently correlated to ensure that a model ranks similarly no matter which data set it was evaluated on, we train each model separately on 50K real and on the same number of synthetic samples. In both cases, models are tested on the same source of data they have been learned on (with sources being real or synthetic here). We then assess whether the test performance on the real and synthetic data is still correlated between models of the 10 Differentially Private Diffusion Models Generate Useful Synthetic Images Each dot is a different model with the same hyperparameters, Private & Confidential Evaluated on real samples, Trained on different data sets Private & Confidential Test Accuracy, Trained on real samples Test Accuracy on real samples Each dot is the same model, Trained on synthetic samples, Evaluated on different data sets Test Accuracy on synthetic samples (a) Test accuracy on synthetic data vs real data of models trained on synthetic data Test Accuracy, Trained on synthetic samples (b) Test accuracy of models trained and evaluated on synthetic data vs models trained and evaluated on real data Figure 4 | We observe that models rank similarly when evaluated on synthetic and real data. This suggests that findings on hyperparameter selection made on synthetic data can be transferred to the private dataset. same architecture and hyperparameter constellation. This corresponds to the setting where a third party is responsible for finding the best model pipeline, but the data curator trains the final model with the optimal hyperparameters for their own internal use. In Figure 4, we report our findings. We see that we can select the best or second best hyperparameter constellation in both application settings. More generally, we find that models rank similarly on both synthetic and real data, suggesting that findings with respect to relative model performance might transfer from the DP data to the original real data. However, we also notice that models overfit to the synthetic data distribution and that within one model group it is not obvious which hyperparameter constellation is the best. We therefore advice that synthetic data is used for a high-level orientation of the research direction. Note that the absolute test performance is not of interest here, only the correlation between the test performance on real and synthetic data as the best model will be released externally. 7. Conclusion DP image generation has long attracted interest as a way of sharing synthetic data sets in sensitive application domains. Because of the degradation in performance introduced by DP-SGD, successful results on DP image generation have been limited to small and low-complexity data sets, like MNIST. In this paper, we set out to scale DP image generation to 32 × 32 × 3 RGB image datasets. We proposed a methodology for DP diffusion models based on pre-training, timestep augmentation multiplicity, and a modified timestep sampling scheme. We are the first to train a DP image generator on a medical dataset where we achieved a downstream classification accuracy of 91.1% that is close to the SOTA of 96.5% with training on the real data. What is more, we also increased the SOTA downstream classification accuracy on CIFAR-10 from 51.0% to 88.0%. Recently proposed methods like latent diffusion models (Rombach et al., 2022) constitute a promising model class for DP fine-tuning on higher dimensional datasets, and we hope that our findings can contribute to future research exploring this direction. Finally, we questioned how DP synthetic data has been currently evaluated in the literature, and proposed an evaluation framework that is more suited to the needs of practitioners who would use the DP synthetic data as a replacement of the private dataset. For this purpose, we first considered maximising the downstream prediction performance by generating up to 1M data samples, and training ensembles. Second, we showed that findings from hyperparameter tuning on synthetic data translate to the corresponding findings on the real data. 11 Differentially Private Diffusion Models Generate Useful Synthetic Images Acknowledgements The authors would like to thank Sylvestre-Alvise Rebuffi for feedback on an earlier version of this manuscript; Zahra Ahmed for project management support; and Judy Shen, David Stutz, Isabela Albuquerque, Simon Geisler, Arnaud Doucet, Taylan Cemgil and Pushmeet Kohli for useful discussions throughout the project. References M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016. H. Alatrista-Salas, P. Montalvo-Garcia, M. Nunez-del Prado, and J. Salas. Geolocated data generation and protection using generative adversarial networks. In International Conference on Modeling Decisions for Artificial Intelligence, pages 80–91. Springer, 2022. H. Ali, S. Murad, and Z. Shah. Spot the fake lungs: Generating synthetic medical images using neural diffusion models. arXiv preprint arXiv:2211.00902, 2022. S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, et al. Generative models for effective ml on private, decentralized datasets. arXiv preprint arXiv:1911.06679, 2019. P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018. A. Bie, G. Kamath, and G. Zhang. Private gans, revisited. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. arXiv:1801.01401, 2018. Demystifying mmd gans. arXiv preprint Z. Bu, J. Mao, and S. Xu. Scalable and efficient training of large convolutional neural networks with differential privacy. arXiv preprint arXiv:2205.10683, 2022a. Z. Bu, Y.-X. Wang, S. Zha, and G. Karypis. Automatic clipping: Differentially private deep learning made easier and stronger. arXiv preprint arXiv:2206.07136, 2022b. T. Cao, A. Bie, A. Vahdat, S. Fidler, and K. Kreis. Don’t generate me: Training differentially private generative models with sinkhorn divergence. Advances in Neural Information Processing Systems, 34:12480–12492, 2021. N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, and E. Wallace. Extracting training data from diffusion models, 2023. URL https://arxiv.org/abs/2301.13188. Y. Cattan, C. A. Choquette-Choo, N. Papernot, and A. Thakurta. Fine-tuning with differential privacy necessitates an additional hyperparameter search. arXiv preprint arXiv:2210.02156, 2022. P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, and A. Chaudhari. Roentgen: Vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737, 2022a. P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari. Adapting pretrained vision-language foundational models to medical imaging domains. arXiv preprint arXiv:2210.04133, 2022b. D. Chen, T. Orekondy, and M. Fritz. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. Advances in Neural Information Processing Systems, 33:12673–12684, 2020. D. Chen, S.-c. S. Cheung, C.-N. Chuah, and S. Ozonoff. Differentially private generative adversarial networks with model inversion. In 2021 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2021a. 12 Differentially Private Diffusion Models Generate Useful Synthetic Images D. Chen, R. Kerkouche, and M. Fritz. Private set generation with discriminative information. arXiv preprint arXiv:2211.04446, 2022a. J.-W. Chen, C.-M. Yu, C.-C. Kao, T.-W. Pang, and C.-S. Lu. Dpgen: Differentially private generative energy-guided network for natural image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8387–8396, 2022b. R. J. Chen, M. Y. Lu, T. Y. Chen, D. F. Williamson, and F. Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021b. P. Chrabaszcz, I. Loshchilov, and F. Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017. W. L. Croft, J.-R. Sack, and W. Shi. Differentially private facial obfuscation via generative adversarial networks. Future Generation Computer Systems, 129:358–379, 2022. F. K. Dankar and M. Ibrahim. Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences, 11(5):2158, 2021. S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022. Y.-A. de Montjoye, Z. Smoreda, R. Trinquart, C. Ziemlicki, and V. D. Blondel. D4d-senegal: the second mobile phone data for development challenge. arXiv preprint arXiv:1407.4885, 2014. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. T. Dockhorn, T. Cao, A. Vahdat, and K. Kreis. Differentially private diffusion models. arXiv preprint arXiv:2210.09929, 2022. URL https://openreview.net/pdf?id=pX21pH4CsNB. J. Duan, F. Kong, S. Wang, X. Shi, and K. Xu. Are diffusion models vulnerable to membership inference attacks?, 2023. URL https://arxiv.org/abs/2302.01316. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006. L. Fan. A survey of differentially private generative adversarial networks. In The AAAI Workshop on PrivacyPreserving Artificial Intelligence, page 8, 2020. L. Fan and A. Pokkunuru. Dpnet: Differentially private network traffic synthesis with generative adversarial networks. In IFIP Annual Conference on Data and Applications Security and Privacy, pages 3–21. Springer, 2021. M. L. Fang, D. S. Dhami, and K. Kersting. Dp-ctgan: Differentially private medical data generation using ctgans. In International Conference on Artificial Intelligence in Medicine, pages 178–188. Springer, 2022. A. Feuerverger, Y. He, and S. Khatri. Statistical significance of the netflix challenge. Statistical Science, 27(2): 202–231, 2012. S. Fort, A. Brock, R. Pascanu, S. De, and S. L. Smith. Drawing multiple augmentation samples per image during training efficiently decreases test error. arXiv preprint arXiv:2105.13343, 2021. I. Gao, S. Sagawa, P. W. Koh, T. Hashimoto, and P. Liang. Out-of-distribution robustness via targeted augmentations. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. B. A. Gheorghit, ă, L. M. Itu, P. Sharma, C. Suciu, J. Wetzl, C. Geppert, M. A. A. Ali, A. M. Lee, S. K. Piechnik, S. Neubauer, et al. Improving robustness of automatic cardiac function quantification from cine magnetic resonance imaging using synthetic image data. Scientific reports, 12(1):1–12, 2022. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020. 13 Differentially Private Diffusion Models Generate Useful Synthetic Images F. Harder, K. Adamczewski, and M. Park. Dp-merf: Differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation. In International conference on artificial intelligence and statistics, pages 1819–1827. PMLR, 2021. F. Harder, M. J. Asadabadi, D. J. Sutherland, and M. Park. Differentially private data generation needs better features. arXiv preprint arXiv:2205.12900, 2022. J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro. Logan: evaluating privacy leakage of generative models using generative adversarial networks. arXiv preprint arXiv:1705.07663, pages 506–519, 2017. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning, pages 2712–2721. PMLR, 2019. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. B. Hilprecht, M. Härterich, and D. Bernau. Monte carlo and reconstruction membership inference attacks against generative models. Proc. Priv. Enhancing Technol., 2019(4):232–249, 2019. J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. H. Hu and J. Pang. Membership inference of diffusion models, 2023. URL https://arxiv.org/abs/2301 .09956. D. Jiang, G. Zhang, M. Karami, X. Chen, Y. Shao, and Y. Yu. Dp 2 -vae: Differentially private pre-trained variational autoencoders. arXiv preprint arXiv:2208.03409, 2022. J. Jordon, J. Yoon, and M. Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, T. Lee, E. David, I. Stavness, W. Guo, B. A. Earnshaw, I. S. Haque, S. Beery, J. Leskovec, A. Kundaje, E. Pierson, S. Levine, C. Finn, and P. Liang. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021. A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Pre-print, 2009. A. Kumar, R. Shen, S. Bubeck, and S. Gunasekar. How to fine-tune vision models with sgd. arXiv preprint arXiv:2211.09359, 2022. T. Kynkäänniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen. The role of imagenet classes in fréchet inception distance. arXiv preprint arXiv:2203.06026, 2022. Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. S. P. Liew, T. Takahashi, and M. Ueno. Pearl: Data synthesis via private embeddings and adversarial reconstruction learning. arXiv preprint arXiv:2106.04590, 2021. B. Liu, Y. Zhu, K. Song, and A. Elgammal. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In International Conference on Learning Representations, 2020. Y. Liu, J. Peng, J. James, and Y. Wu. Ppgan: Privacy-preserving generative adversarial network. In 2019 IEEE 25Th international conference on parallel and distributed systems (ICPADS), pages 985–989. IEEE, 2019. 14 Differentially Private Diffusion Models Generate Useful Synthetic Images A. Lugmayr, M. Danelljan, R. Timofte, K.-w. Kim, Y. Kim, J.-y. Lee, Z. Li, J. Pan, D. Shim, K.-U. Song, et al. Ntire 2022 challenge on learning the super-resolution space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 786–797, 2022. B. McFee, T. Bertin-Mahieux, D. P. Ellis, and G. R. Lanckriet. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web, pages 909–916, 2012. R. McKenna, B. Mullins, D. Sheldon, and G. Miklau. Aim: An adaptive and iterative mechanism for differentially private synthetic data. arXiv preprint arXiv:2201.12677, 2022. H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz. A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210, 2018. N. Papernot and T. Steinke. arXiv:2110.03620, 2021. Hyperparameter tuning with renyi differential privacy. arXiv preprint N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410. IEEE, 2016. B. Pfitzner and B. Arnrich. Dpd-fvae: Synthetic data generation using federated variational autoencoders with differentially-private decoder. arXiv preprint arXiv:2211.11591, 2022. W. H. Pinaya, P.-D. Tudosiu, J. Dafflon, P. F. Da Costa, V. Fernandez, P. Nachev, S. Ourselin, and M. J. Cardoso. Brain imaging generation with latent diffusion models. In MICCAI Workshop on Deep Generative Models, pages 117–126. Springer, 2022. P. M. Radiuk. Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Information Technology and Management Science, 20(1):20–24, 2017. M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio. Transfusion: Understanding transfer learning for medical imaging. Advances in neural information processing systems, 32, 2019. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. P. Rouzrokh, B. Khosravi, S. Faghani, M. Moassefi, S. Vahdati, and B. J. Erickson. Multitask brain tumor inpainting with diffusion models: A methodological report. arXiv preprint arXiv:2210.12113, 2022. M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and recall. Advances in neural information processing systems, 31, 2018. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. arXiv preprint arXiv:2212.03860, 2022. Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. T. Stadler and C. Troncoso. Why the search for a privacy-preserving data sharing mechanism is failing. Nature Computational Science, 2(4):208–210, 2022. T. Stadler, B. Oprisanu, and C. Troncoso. Synthetic data–anonymisation groundhog day. arXiv preprint arXiv:2011.07018, 2021. 15 Differentially Private Diffusion Models Generate Useful Synthetic Images A. Torfi, E. A. Fox, and C. K. Reddy. Differentially private synthetic medical data generation using convolutional gans. Information Sciences, 586:485–500, 2022. R. Torkzadehmahani, P. Kairouz, and B. Paten. Dp-cgan: Differentially private synthetic data and label generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. F. Tramèr, G. Kamath, and N. Carlini. Considerations for differentially private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470, 2022. B. Wang, F. Wu, Y. Long, L. Rimanic, C. Zhang, and B. Li. Datalens: Scalable privacy preserving training via gradient compression and aggregation. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 2146–2168, 2021. D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. Advances in Neural Information Processing Systems, 30, 2017. L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018. C. Xu, J. Ren, D. Zhang, Y. Zhang, Z. Qin, and K. Ren. Ganobfuscator: Mitigating information leakage under gan via differential privacy. IEEE Transactions on Information Forensics and Security, 14(9):2358–2371, 2019. Z. Xu, M. Collins, Y. Wang, L. Panait, S. Oh, S. Augenstein, T. Liu, F. Schroff, and H. B. McMahan. Learning to generate image embeddings with user-level differential privacy. arXiv preprint arXiv:2211.10844, 2022. C. Yan, Y. Yan, Z. Wan, Z. Zhang, L. Omberg, J. Guinney, S. D. Mooney, and B. A. Malin. A multifaceted benchmarking of synthetic electronic health record generation models. Nature Communications, 13(1):1–18, 2022. J. Yoon, L. N. Drumright, and M. Van Der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8):2378–2388, 2020. D. Yu, H. Zhang, W. Chen, J. Yin, and T.-Y. Liu. Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning, pages 12208–12218. PMLR, 2021. S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. L. Zhang, B. Shen, A. Barnawi, S. Xi, N. Kumar, and Y. Wu. Feddpgan: federated differentially private generative adversarial networks framework for the detection of covid-19 pneumonia. Information Systems Frontiers, 23 (6):1403–1415, 2021. X. Zhang, H. Gu, L. Fan, K. Chen, and Q. Yang. No free lunch theorem for security and utility in federated learning. arXiv preprint arXiv:2203.05816, 2022. K. Zhou, Y. Yang, T. Hospedales, and T. Xiang. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13025–13032, 2020. 16 Differentially Private Diffusion Models Generate Useful Synthetic Images Supplementary Materials A. Comparison to Related Work Please refer to Table 2 for a summary of details that differentiate our work from that of Dockhorn et al. (2022). Ours Dockhorn et al. (2022) Analysed for finetuning Maximal parameter count Time step scale Classifier guidance Modifications to time step sampling Augmentation strategies Number of samples for augmentation multiplicity Maximal batch size Evaluation False 1.8M discrete True False time step 32 2,048 True 80.4M continuous False True time step, random crop, flipping 128 16,384 FID, downstream accuracy, FID, downstream accuracy maximal downstream performance hyperparameter tuning Table 2 | Comparison of our work and Dockhorn et al. (2022). B. Additional Experimental Details and Results Here we present additional experimental details to back up the findings presented in the main paper. We present our hyperparameters in Tables 3, 4, 5 and 6. Some of our DP synthetic images can be found in Figures 5, 6 and 7. As a baseline, we also added samples from Dockhorn et al. (2022) when training from scratch on CIFAR-10 in Figure 8. This illustrates the importance of pre-training in the large scale DP image generation setting. Please refer to Figure 9 for an illustration of the utility of augmentation multiplicity. Finally, we show with class-wise metrics in Figure 11 that our model is not overfitting to a single class. ImageNet32 Pre-training data set Privacy budget ( 𝜖, 𝛿) Iterations Clipping norm Noise schedule Model size Channels Depth Channels multiple Heads channels Attention resolution Batch size Learning rate Optimizer Scheduler MNIST CIFAR-10 Camelyon17 ∞ 200k linear 80.4M 192 2 1,2,2,2 64 16 1,024 Adam ImageNet32 ImageNet32 (10, 10−5 ) (*varies, 10−5 ) (10, 10−5 ) 4,000 200 200 10−3 10−3 10−3 linear linear linear 4.2M 80.4M 80.4M 64 192 192 1 2 2 1,2,2 1,2,2,2 1,2,2,2 64 64 64 16 16 16 4,096 16,384 16,384 5×10−4 10−3 10−3 Adam Adam Adam linear(from 0 to LR in 5K steps) constant constant constant 𝑤1 , 𝑤2 , 𝑤3 0.03, 0.77, 0.2 0.05, 0.9, 0.05 0.015, 0.785, 0.2 0.015, 0.785, 0.2 𝑙 1 , 𝑢1 = 𝑙 2 , 𝑢2 = 𝑙 3 , 𝑢3 0, 30, 800, 1000 0, 200, 800, 1000 0, 30, 600, 1000 0, 30, 600, 1000 # Augmentation samples 0 128 (timestep) 128 (timestep, flip) 128 (timestep, flip) Exponential moving average 0.9999 0.9999 0.9999 0.9999 Table 3 | Hyperparameters for diffusion models. The scale of the gradient noise is adjusted to ensure the desired privacy budget. *We report results for 𝜖 ∈ {1, 5, 10, 32}. 17 Differentially Private Diffusion Models Generate Useful Synthetic Images Figure 5 | Random samples drawn from a DP image diffusion model trained on MNIST. Figure 6 | Random samples drawn from a DP image diffusion model trained on Camelyon17. 18 Differentially Private Diffusion Models Generate Useful Synthetic Images Figure 7 | Random samples drawn from a DP image diffusion model trained on CIFAR-10. Figure 8 | Samples from CIFAR-10 as provided by (Dockhorn et al., 2022) in Appendix F Rebuttal Discussions. 19 Differentially Private Diffusion Models Generate Useful Synthetic Images 88.5 12.7 FID Test accuracy (%) 88.0 87.8 87.8 11.8 11.7 87.5 87.2 16 32 64 128 #Samples for Augmultation Multiplicity 86.5 85.5 256 Figure 9 | FID on CIFAR-10 for different values of augmentation multiplicity samples. We see that the FID is decreasing for values up to 256. 87.3 87.0 86.8 86.8 86.8 86.6 86.6 86.0 11.5 11.4 87.4 200K Ensemble of 5 classifiers (pre-trained) Single classifier (pre-trained) 400K 600K 800K 1M Number of synthetic samples Figure 10 | Downstream accuracy of an ImageNet32 pre-trained CIFAR-10 WRN-40-4 as function of the number of synthetic data samples used to train it. 30 25 FID 20 15 10 5 0 airplane automobile bird cat deer class dog frog horse ship truck (a) Percentage of samples per class classified as CIFAR-10 images by ResNet18 model trained to discriminate CIFAR-10 and ImageNet32 images. % classified as CIFAR-10 1.0 0.8 0.6 0.4 0.2 0.0 airplane automobile bird cat deer class dog frog horse ship truck (b) FID per class. Figure 11 | Class-wise metrics on CIFAR-10. 20 Differentially Private Diffusion Models Generate Useful Synthetic Images MNIST CIFAR-10 Camelyon17 Pre-training data set ImageNet32 Iterations 10,000 20,000 4,000 Depth 40 40 40 Width 4 4 4 Dropout 0.5 0.0 0.0 Weight decay 5 · 10−4 5 · 10−4 1 · 10−5 Label smoothing 0.05 0.0 0.0 Batch size 64 4,096 512 Learning rate 0.1 cosine decay(start at 0.1 with 𝛼 = 0, 1 decay step) 10−4 Optimizer SGD SGD SGD Nesterov’s momentum 0.9 0.9 0.9 Samples augmentation multiplicity 0 0 16 Exponential moving average 0.9999 0.9999 0.9999 # Augmentation samples 1 (crop) 16 (crop, flip, color) 16 (color, flip, crop) Table 4 | Hyperparameters for downstream classification WRNs trained on synthetic data. Privacy budget ( 𝜖, 𝛿) Pre-training data set Clipping norm Noise standard deviation Depth Width Batch size Learning rate Optimizer Samples augmentation multiplicity Exponential moving average MNIST Camelyon17 (10, 10−5 ) (10, 10−5 ) ImageNet32 1 4.0 40 4 4,096 0.5 SGD 16 (color, flip, crop) 0.9999 1 3.0 16 4 16,384 4.0 SGD 0 0.9999 Table 5 | Hyperparameters for DP WRN classifiers trained on the sensitive data records. Training was stopped once 𝜖 = 10 was reached. ResNet50 VGG WRN-16-4 Learning rate 5 · 10−4 , 2 · 10−3 , 4 · 10−3 0.07, 0.04, 0.025, 0.01, 5 · 10−3 0.01, 0.02, 0.03 Weight decay 0, 0.1, 0.01 0.0, 10−3 , 5 · 10−3 0.0, 0.1, 0.01 Iterations 15,000 15,000 15,000 Label smoothing 0.05 0.0 0.0 Batch size 128 128 128 Optimizer SGD SGD SGD Momentum 0.0 0.9 0.9 Scheduler cosine decay* constant cosine decay* Samples augmentation multiplicity 16 (flip, crop) 0 16 (flip, crop) Exponential moving average 0.9999 0.9999 0.9999 Table 6 | Hyperparameters for classifiers for the model selection experiment. All combinations of different values displayed for learning rate and weight decay were trained. *Cosine decay schedule with initial learning rate swept over, 1 decay step, and 𝛼 = 0.0. 21