Unlocking Accuracy and Fairness in Differentially Private Image Classification Leonard Berrada*,1 , Soham De*,1 , Judy Hanwen Shen*,2,โ€  , Jamie Hayes1 , Robert Stanforth1 , David Stutz1 , Pushmeet Kohli1 , Samuel L. Smith1 and Borja Balle1 * Equal contributions, 1 Google DeepMind, London, UK, 2 Computer Science Department, Stanford University, Palo Alto, California, arXiv:2308.10888v1 [cs.LG] 21 Aug 2023 USA, โ€  Work done while at Google DeepMind Privacy-preserving machine learning aims to train models on private data without leaking sensitive information. Differential privacy (DP) is considered the gold standard framework for privacy-preserving training, as it provides formal privacy guarantees. However, compared to their non-private counterparts, models trained with DP often have significantly reduced accuracy. Private classifiers are also believed to exhibit larger performance disparities across subpopulations, raising fairness concerns. The poor performance of classifiers trained with DP has prevented the widespread adoption of privacy preserving machine learning in industry. Here we show that pre-trained foundation models fine-tuned with DP can achieve similar accuracy to non-private classifiers, even in the presence of significant distribution shifts between pre-training data and downstream tasks. We achieve private accuracies within a few percent of the non-private state of the art across four datasets, including two medical imaging benchmarks. Furthermore, our private medical classifiers do not exhibit larger performance disparities across demographic groups than non-private models. This milestone to make DP training a practical and reliable technology has the potential to widely enable machine learning practitioners to train safely on sensitive datasets while protecting individualsโ€™ privacy. 1. Introduction Neural networks containing hundreds of millions or even billions of parameters achieve state-ofthe-art performance across a wide range of machine learning tasks [71, 15, 124]. However, these large models are known to memorize parts of their training data [125], opening the door to leakage of sensitive information. Privacy attacks capable of extracting memorized training data have been demonstrated on language models [20], diffusion models [19] and image classification models [8], while membership attacks which detect whether a datapoint was used to train a particular model are successful on multiple architectures and data modalities [17]. Mitigating the leakage of sensitive training data is a critical concern in applications where access to private, confidential or proprietary data is key to improving machine learning capabilities (e.g. healthcare, recommendation systems, mobility, etc). Models trained on such data cannot be safely deployed unless these concerns are addressed. Designing effective mitigations to prevent leakage of sensitive training data is far from straightforward. The failures of classical anonymization techniques against re-identification attacks when adversaries have access to side-knowledge [83] highlight the importance of considering exceedingly pessimistic threat models to ensure privacy mitigations are future-proof. Differential privacy (DP) [33] has emerged as the gold standard for protecting individual privacy in data processing algorithms, including machine learning methods. The information-theoretic guarantee provided by DP limits the amount of information any attacker will be able to extract about individual input datapoints after observing the algorithmโ€™s output, regardless of what side knowledge about the data they obtain from other sources or how much computational power they have access to. The increasing adoption of DP across governmental [16] and industrial [130, 105, 77] applications underlines its versatility and Corresponding author(s): {lberrada | bballe}@google.com Unlocking Accuracy and Fairness in Differentially Private Image Classification stems, in part, from its ability to provide quantifiable worst-case privacy guarantees. However, there is no free lunch: models with strong DP guarantees achieve lower accuracy than their non-private counterparts. Unfortunately for deep learning, achieving strong privacy protections with DP is harder when models are larger and when the data dimension is higher: more information needs to be hidden and there are more parameters which might reveal this information. As a consequence, existing practical deployments of DP learning are currently restricted to relatively small models and simple tasks. The large gap between the accuracy attainable with private and non-private learning is one of two main obstacles to unlock the routine use of DP for training models on sensitive data. The second obstacle is the observation that DP training can exacerbate disparities in model accuracy across subpopulations. Challenges in data underrepresentation and quality are known to induce unjustified disparities in accuracy across certain subpopulations in non-private models, and researchers have expressed concerns that DP training can further exacerbate such disparities [6]. The disparities observed in models trained with DP highlight a potential, undesirable trade-off between privacy and fairness; especially in applications like healthcare where models inform consequential decisions and require access to private training data. Our work takes a major step towards resolving these two problems in the context of image classification, a task where deep neural networks are ubiquitously used and often handle sensitive data (e.g. personal pictures, scanned documents or medical diagnostic images). To achieve this, we follow the dominant paradigm in the deep learning community consisting of fine-tuning networks pretrained on large general purpose datasets [12]. Our first contribution is a reliable and accurate method for DP fine-tuning of large vision models pre-trained on non-sensitive data, which we evaluate on four challenging image classification benchmarks. We demonstrate that our method achieves substantially better accuracy than previously thought possible, often reaching the practical performance of previously deployed (non-private) models. On all four tasks, we achieve private accuracies within a few percent of the non-private state of the art (Figure 1). Our evaluation includes two medical imaging classification benchmarks where we are not only able to substantially reduce the gap between the best private and non-private models, but also illustrate our second contribution: demonstrating that our highlyaccurate private models exhibit disparities across subpopulations which are no larger than those we observe in non-private models with comparable accuracy. Moreover, our results on medical imaging also highlight the remarkable effectiveness of privately fine-tuning foundation models pre-trained on non-sensitive data from a distribution significantly different to that of the private data. This observation is critical, since there is often very little public data from a similar distribution to private data in practical applications [107]. Demonstrating the effectiveness of DP training techniques in image classification has far-reaching implications to ensure the benefits provided by models trained on sensitive data can be leveraged without compromising the privacy of the training data. Importantly, our approach aligns closely with standard practices used in industrial deep learning applications, including the use of models, algorithms and pre-training protocols which follow the same paradigms as standard deep learning frameworks [5, 86]. This ensures that our methods will remain relevant by being able to incorporate future advances in deep learning, including improved network architectures or pre-trained models. Altogether, this represents a significant milestone towards a new paradigm where deep learning practitioners can routinely leverage the formal privacy guarantees offered by DP to protect sensitive data when training machine learning models. 2 Unlocking Accuracy and Fairness in Differentially Private Image Classification ๐œบ-indistinguishable Original image Reconstruction from non-private model ๐œบ-DP Training 100 80 60 40 20 0 89.2 93.0 88.5 91.1 ImageNet Places365 (Top-1 Accuracy) (Top-1 Accuracy) Private (Ours) MIMIC-CXR 0.10 0.05 0.00 0.05 0.10 Sex 79.5 83.4 58.2 60.8 AgeRange Private Race Insurance Non-private CheXpert MIMIC-CXR (Macro AUC) (Macro AUC) Non-private SOTA Overall AUC - Group AUC Overall AUC - Group AUC Accuracy / AUC (%) Reconstruction from differentially private model (๐œบ = 8) CheXpert 0.05 0.00 0.05 Sex Private AgeRange Non-private Figure 1 | Overview of differential privacy and our results. (Top/Left) Differential privacy protects individuals in the training dataset by enforcing training outputs to be indistinguishable (up to a maximum privacy loss parameter ๐œ€) on pairs of datasets differing in a single individual. (Top/Right) Effect of the DP protection on reconstruction attacks against a classification model trained with and without DP on medical chest X-ray images. (Middle) We demonstrate it is possible to obtain highly accurate models with DP (at ๐œ€=8) that closely match the accuracy of the best models trained without privacy constraint. Results evaluated on the standard test set for each task. (Bottom) Our most accurate private models for chest X-ray classification exhibit AUC disparities (i.e. differences between population and group AUC) across demographic attributes, such as sex and age range, comparable to those of non-private classifiers. Distributions over subgroups and training randomness (20 seeds), metrics evaluated on internal test set split. Details in Appendices. 3 Unlocking Accuracy and Fairness in Differentially Private Image Classification 2. Challenges of learning under differential privacy The strength of the DP guarantee is controlled through a privacy parameter ๐œ€; the smaller the value of ๐œ€, the smaller the risk that information about a training example will be revealed or that it is even memorized.1 This parameter controls the worst-case privacy loss experienced by any individual in the training dataset; typical values used in practice are in the range 1 โ‰ค ๐œ€ โ‰ค 10 [28]. A DP guarantee can only be obtained if the training algorithm is stochastic, and the formal definition of the guarantee involves the log-likelihood ratio of any potential output model over pairs of datasets differing in a single individual, but more intuitive interpretations bounding the success of worst-case privacy attacks can also be obtained. We provide one such interpretation in Figure 2, where we interpret the guarantee in terms of the strength of the side knowledge the adversary must obtain from other sources before being able to identify a training example (a detailed discussion of the privacy guarantee provided by DP is offered in the Appendices). The most popular DP training technique in deep learning is differentially private stochastic gradient descent (DP-SGD) [1], an iterative gradient-based method where gradients used to update the model parameters are privatized by obfuscating the contribution of each individual example in the mini-batch through clipping and noise addition (see Figure 2). The scale of the noise controls how much the contribution of each example to the mean gradient is obfuscated, and the strength of the privacy guarantee ๐œ€ depends on this noise scale, the batch-size, the number of training samples and the number of training iterations. Since DP-SGD is almost a drop-in replacement for SGD, it is currently the best candidate to train a wide range of machine learning models with meaningful privacy guarantees with minimal modifications to existing pipelines. However, it presents two major difficulties that have so far precluded its widespread adoption by practitioners. The first difficulty is to attain high accuracy with models trained with differential privacy. Achieving the DP guarantee requires the injection of carefully crafted random noise into the training algorithm, and the maximum number of training iterations needs to be constrained. Training with noisy gradients in combination with a limited number of iterations makes optimization very challenging [106]. This noisy regime also affects the adequacy of hyper-parameter choices and other best practices that have been carefully selected by deep learning practitioners to train accurate models without privacy constraints [2]. Standard hyper-parameter choices need to be re-thought to be able to obtain high accuracy under the challenging conditions of DP training. Furthermore, since the noise is added independently to each coordinate of the gradient, its Euclidean norm grows with the number of parameters in the model. As a result, privacy researchers believe that DP-SGD will perform increasingly poorly as the model size increases [106, 121]. This is a major challenge, since highly over-parameterized deep neural networks are currently dominant in the artificial intelligence community, achieving excellent performance across a wide range of tasks and data domains. The second major difficulty is overcoming fairness issues associated with deep neural networks, one type of which is manifested through unjustified accuracy disparities across subpopulations. Even without differential privacy, when automatically detecting diseases in chest X-rays, underserved patient groups based on race, age, sex, or insurance type can experience underdiagnosis by machine learning models [95]. In the context of private models, prior works have suggested that differentially privacy mechanisms can incur a significant additional fairness penalty by increasing the accuracy disparity between different subgroups [6, 93]. While multiple hypotheses have been proposed to explain the precise cause of these disparities, unbalanced and complex subgroup data have been commonly 1 The term โ€œmemorizationโ€ is used throughout as a shorthand convenience to describe a broad range of phenomena whereby the weights of a model capture enough information to enable inferences about some individual training data points. There is active discussion within the technical and legal communities about whether the presence of this type of โ€œmemorizationโ€ suggests that neural networks โ€œcontainโ€ their training data. 4 Unlocking Accuracy and Fairness in Differentially Private Image Classification Privatization Step Update by DP-SGD at iteration T Clip each gradient to max L2 norm C Optimizer update (e.g. SGD / ADAM) Sum Mini-batch of examples at iteration T Neural network at iteration T Gradient for each example + Privatized Neural network gradient at iteration T+1 Prior size Gaussian Noise 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Epsilon Figure 2 | The DP-SGD algorithm and how to interpret its privacy guarantees. (Top) Differentially private SGD updates model parameters using a similar procedure to standard SGD training, with the difference that gradients are privatized through per-example clipping and the addition of isotropic Gaussian noise. These privatized gradients can also be used with other standard optimizers such as Adam. After training for T iterations โˆš with noise ๐œŽ and batch-size B on a dataset with N individuals, DP-SGD provides a DP guarantee with ๐œ€ โ‰ˆ ๐ต ๐‘‡ /๐œŽ๐‘ . (Bottom) We can interpret the privacy loss ๐œ€ of DP-SGD as the amount of side knowledge required by an adversary whose goal is to successfully identify which individual from a number of (equally likely) choices was part of the training data. More side knowledge means the adversary managed to narrow down the choice to a smaller set of candidate individuals - we plot the minimal number of choices that provably prevent successful identification (with probability > 50%) as a function of ๐œ€. For example, at ๐œ€=8 a target individual in a dataset is protected as long as the adversary is unable to narrow down the individualโ€™s identity to a group with fewer than 18 individuals. 5 Unlocking Accuracy and Fairness in Differentially Private Image Classification highlighted as obstacles to achieving both privacy and fairness [35, 94]. This phenomenon is perhaps not surprising since DP limits the amount of information that can be extracted from individual data points while still enabling learning across larger populations; strong differential privacy guarantees may therefore be in tension with the ability to accurately learn from small subgroups [24]. In real world medical datasets where minority groups are often underrepresented and intersectional groups can be arbitrarily small, disparities that arise in non-private models may be exacerbated when using differentially private training [101]. 3. Training highly accurate models with differential privacy Our first goal is to demonstrate that DP training can achieve practical levels of accuracy, i.e., accuracies close to the state-of-the-art achieved by non-private training. We identify four main elements that improve the accuracy of models trained with DP. In order of importance, these are (1) ensuring good signal propagation in models without batch-normalization [52] (a popular component of vision models which is incompatible with DP-SGD), (2) using significantly larger batch sizes than standard in non-private training (as corroborated in other contexts by [74, 2]), (3) carefully tuning the noise multiplier hyperparameter in DP-SGD (but not the clipping norm), and (4) improving model convergence with parameter averaging (see details in Appendices). We first demonstrate that our approach is able to learn highly accurate image classifiers on medical images with strong privacy guarantees (a domain where privacy concerns naturally arise). This landmark result shows the ability of DP training to mitigate privacy concerns while offering high levels of accuracy in a practical and challenging scenario. To that end, we pre-train an NFNet-F0 model (72M parameters) [14] on ImageNet-21K, a public superset of the popular ImageNet dataset containing 14 million images [90]. The NFNet model family is ideally suited to private training, since it was specifically designed to achieve high accuracies without batch-normalization. We fine-tune the model with DP-SGD on CheXpert [53] and MIMIC-CXR [60]. CheXpert is a public classification benchmark containing 224k chest X-rays, labeled with up to 14 clinically relevant observations. MIMIC-CXR is a similar dataset with 377k chest X-rays with the same label categories as CheXpert. As shown in Figure 3, our fine-tuned image classifiers are able to reach high accuracies. Specifically, on CheXpert, our model fine-tuned with differential privacy at ๐œ€=8 obtains 89.24% AUC, to be compared to 89.97% for the pre-trained model fine-tuned without privacy, and to 93.0% for the state-of-the-art, achieved by ensembles of models trained without any privacy constraints [123]. Similar results also hold for MIMIC-CXR where our private model obtains 79.53% AUC at ๐œ€=8, 81.14% without privacy, and for which the published state-of-the-art obtains 84.04% [63] by using a more complicated architecture that segments the images as part of its processing. Remarkably, when imposing stringent privacy guarantees, e.g. a privacy budget of only ๐œ€=1, it is still possible to obtain useful levels of accuracy, with an AUC of 86.34% on CheXpert and 76.40% on MIMICCXR. These results demonstrate that it is possible to achieve practical levels of accuracy with strong differentially private guarantees on large-scale medical image classification tasks that are significantly more challenging than previous work [129, 94]. To demonstrate that these results are not confined to the medical image domain, we further consider two popular academic benchmarks: ImageNet and Places-365. ImageNet is a dataset of 1.3 million images from 1000 classes, corresponding to simple objects like โ€œcarโ€, โ€œdolphinโ€ or โ€œocean linerโ€ [91]. It is widely recognized as one of the most important benchmarks in computer vision, providing extremely strong baselines for image classifiers trained without differential privacy against which we can compare our private models. Places-365 is a large-scale dataset for the recognition of 365 different scenes [127], which requires the model to identify diverse locations such as a science 6 Unlocking Accuracy and Fairness in Differentially Private Image Classification museum or a martial arts gym. We again use the NFNet family of deep convolutional networks [14]. We pre-train two versions of our models: one using ImageNet-21K as above, and another using JFT [100], a proprietary labeled dataset comprising 4 billion images collected from public internet pages. We then fine-tune these models with DP-SGD on Places-365 and ImageNet, obtaining formal privacy guarantees on those downstream datasets. When using the NFNet-F7+ model (947M parameters) pre-trained on JFT, we achieve a top-1 accuracy of 88.5% on ImageNet under a DP guarantee of ๐œ€=8, which is only 1.4% lower than the accuracy of this same pre-trained model when fine-tuned without differential privacy, and just 2.6% below the overall ImageNet state-of-the-art [23]. This also slightly exceeds the previously best existing results with DP [79], which reach 88% at ๐œ€=8. We also achieve high accuracy under stricter privacy guarantees (lower ๐œ€). For example, at ๐œ€ = 1, we achieve 86.8% top-1 accuracy. To put this in perspective, this is significantly higher than the 77% accuracy of the ResNet-50 architecture trained without any privacy guarantees [47], a popular model that has been widely used in computer vision systems. This is also significantly better than fine-tuning the smaller NFNet-F3 (255M parameters) with DP, which achieves 87% top-1 accuracy under ๐œ€=8, and thereby further demonstrates the value of using strong pre-trained foundation models. We achieve similarly strong results on Places-365. Using the NFNet-F3 pre-trained on JFT, we achieve an accuracy of 58.2% with ๐œ€=8.0; within 3% of our non-private baseline of 60.8%, for which we fine-tune the same pre-trained model without privacy. This non-private baseline slightly outperforms the non-private state of the art on this task (60.7%). We also achieve 56.6% with ๐œ€=8.0 with the same model when pre-trained on the significantly smaller public dataset ImageNet-21K, demonstrating that strong private classifiers can be obtained without access to proprietary pre-training data. 53.8 55.6 56.1 56.9 58.2 60.8 60.7 60 74.5 76.4 77.9 79.0 79.5 81.1 83.4 84.9 86.3 87.5 88.4 89.2 90.0 93.0 80 85.8 86.8 87.6 87.9 88.5 89.9 91.1 Accuracy / AUC (%) Previous research only demonstrated that DP fine-tuning of image classifiers is effective on data that is very similar to the pre-training dataset [68, 79, 3]. In contrast, we attain very high accuracies on two chest X-ray datasets and a scene recognition dataset despite pre-training only on natural images of simple objects. This demonstrates the applicability of DP fine-tuning to situations where it is impossible to find public or non-sensitive pre-training data closely related to the private task. 40 20 0 ImageNet Places =0.5 (Ours) =1.0 (Ours) =2.0 (Ours) CheXpert =4.0 (Ours) =8.0 (Ours) MIMIC-CXR Non-Private (Ours) Non-Private SOTA Figure 3 | Performance of our approach across benchmarks and privacy levels. We achieve practical levels of performance, almost matching our non-private baselines, across several challenging image classification benchmarks and values for the privacy budget ๐œ€. 7 Unlocking Accuracy and Fairness in Differentially Private Image Classification 4. Accurate private training need not exacerbate disparities between subgroups Our second goal is to illustrate that private models do not necessarily exacerbate existing disparities across subgroups. Focusing on the setting of models for X-ray diagnostics, we examine demographic intersectional subgroups defined on age, sex, and race in the MIMIC-CXR dataset. We follow prior work in measuring AUC disparity as the difference in AUC between the overall population and the subgroup [70, 126]. In Figure 4, we compare the disparities exhibited by our private models at ๐œ€=8 and the non-private baselines presented in the previous section. We observe that private and non-private models have a tendency to exhibit similar disparities across the considered subgroups, and that, overall, AUC disparities are not systematically worse for private than non-private models. Furthermore, while we observe that the differences in disparities between subgroups exhibit larger variation between the private and non-private models on smaller subgroups, the average difference in disparities is strongly concentrated around zero regardless of group size. We also observe a similar pattern of comparable disparity between private and non-private models in the CheXpert dataset [53], another commonly studied dataset for this task (details in Appendices). In contrast to what prior works suggest, we observe that our accurate private models do not cause worse group fairness outcomes than non-private baselines. This result also extends to smaller values of ๐œ€ as well as intersectional subgroups. Our experiments also demonstrate that this finding is stable across both fine-tuned medical imaging models presented in the main text as well as models trained from scratch on smaller datasets used in prior works (see details in Appendices). These results motivate our hypothesis that the substantially increased disparities observed in private models in prior works are not intrinsic to private models, especially not to those with accuracy comparable to strong non-private models. We believe that these results represent a significant advancement for differentially private training, since they open the door to providing strong privacy guarantees without further negative impacts on disparities between subgroups. We note, however, that fairness in medical decision-making remains a complex pursuit, since multiple different fairness metrics should be considered concurrently depending on how a system is deployed in practice [95]. 5. Discussion In this work, we trained large-scale image classifiers with strong privacy guarantees, while achieving accuracies competitive with state-of-the-art non-private models and observing levels of disparity across subpopulations comparable to those of accurate non-private models. By demonstrating such results on widely used benchmarks and public medical imaging datasets, we provide compelling evidence that differentially private training is a practical tool for deploying accurate machine learning models while providing strong privacy guarantees for their training data. Importantly, we show that differentially private fine-tuning for image classification can leverage large-scale foundation models pre-trained on public datasets, which has proved to be an effective recipe for practical advances in machine learning. These pre-trained models significantly aid private training, even when the private data and the pre-training dataset come from very different distributions. We believe our methodology is relevant to a large number of scenarios where task-specific sensitive datasets can be used to fine-tune readily available pre-trained models. From the perspective of practitioners looking to adopt our methodology, there are two important considerations to be aware of. First, the privacy guarantees only apply to the fine-tuning data: data used when pre-training the model is not covered by DP guarantees. This is not a major limitation in our view, since non-sensitive public datasets are widely available, and pre-training and fine-tuning data can come from different distributions. However, it implies that care needs to be taken when 8 Unlocking Accuracy and Fairness in Differentially Private Image Classification Age/Race Group 20-40 40-60 60-80 80+ <20 AMERICAN INDIAN/ALASKA NATIVE ASIAN BLACK/AFRICAN AMERICAN HISPANIC/LATINO WHITE AUC Disparity ( = 8.0) 0.3 0.2 0.1 0.0 Sex Group F M 0.1 0.1 0.0 0.1 0.2 AUC Disparity ( = ) 0.3 Age/Race Group 0.02 20-40 40-60 60-80 80+ <20 AMERICAN INDIAN/ALASKA NATIVE ASIAN BLACK/AFRICAN AMERICAN HISPANIC/LATINO WHITE AUC Disparity Gap 0.00 0.02 0.04 0.06 Sex Group F M 0.08 0.10 0 2000 4000 6000 Group Size 8000 Figure 4 | AUC disparities (i.e. population AUC - subgroup AUC) of private models trained on MIMIC-CXR are comparable to disparities observed on non-private models with comparable accuracy. (Top) For the private (๐œ€=8) and non-private baseline models from Figure 3, comparing AUC disparities (averaged over 20 independent runs) by subgroup. Gray crosses represent standard deviation. (Bottom) Stratification of differences in disparities (averaged over 20 independent runs) by subgroup size. Blue line represents OLS regression predicting disparity gap as a function of group size (and 95% confidence interval based on 1000-fold bootstrap). The slope of the regression model lies in the 95% confidence interval [5.7 ยท 10โˆ’7 , 3.23 ยท 10โˆ’6 ]. 9 Unlocking Accuracy and Fairness in Differentially Private Image Classification compiling pre-training datasets based on publicly available data. The second consideration is that our analysis on disparities focuses on pre-existing demographic subgroups from particular datasets. In general, a minimal subgroup size might be required to achieve similar disparities to non-private models, although such a threshold can also depend on the nature of the data and the similarities between individuals from different subpopulations. Our analysis suggests that accurate DP training will not necessarily exacerbate disparities, but we recommend that practitioners assess this issue on their particular application, taking into account the context and potential impact of such disparities. We hope our results open the door for impactful and responsible deployments of machine learning systems with rigorous privacy guarantees. We believe that DP is ready for widespread adoption in machine learning production systems, and governmental applications. Our codebase (including models pre-trained on ImageNet-21K) has been open-sourced to enable reproducibility of our results [7], verify the correctness of our DP-SGD implementation, and to help practitioners adopt our techniques in their machine learning pipelines. Acknowledgments We would like to thank: Ira Ktena, Stevie Bergman, Jessica Schrouff and Andrew Trask for detailed feedback that helped improve the present manuscript; Matthias Bauer and Sven Gowal for insightful comments that helped improve an earlier version of the paper; Zahra Ahmed and Kitty Stacpoole for project management support; Danielle Belgrave, Sahra Ghalebikesabi, Alan Karthikesalingam, Nenad Tomaลกev and Thomas Steinke for insightful discussions; Rudy Bunel for code reviews; John Aslanides for code quality reviews and open-sourcing support; Alison Reid and Jon Small for support during the open-sourcing process; Abhradeep Thakurta, Florian Tramรจr and Harsh Mehta for discussions on related works; and Andrew Brock, Taylan Cemgil, Raia Hadsell, Koray Kavukcuoglu, Razvan Pascanu and Yee Whye Teh for advice throughout the project. We also want to thank the Stanford Center for Artificial Intelligence in Medicine and Imaging for providing access to CheXpert. References [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, oct 2016. [2] Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. Large-scale differentially private BERT. Preprint arXiv:2108.01624 [cs.LG], 2021. [3] Soroosh Tayebi Arasteh, Mahshad Lotfinia, Teresa Nolte, Marwin Saehn, Peter Isfort, Christiane Kuhl, Sven Nebelung, Georgios Kaissis, and Daniel Truhn. Preserving privacy in domain transfer of medical ai models comes at no performance costs: The integral role of differential privacy. Preprint arXiv:2306.06503 [cs.LG], 2023. [4] Soroosh Tayebi Arasteh, Alexander Ziller, Christiane Kuhl, Marcus Makowski, Sven Nebelung, Rickmer Braren, Daniel Rueckert, Daniel Truhn, and Georgios Kaissis. Private, fair and accurate: Training large-scale, privacy-preserving ai models in medical imaging. Preprint arXiv:2302.01622 [eess.IV], 2023. [5] Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, 10 Unlocking Accuracy and Fairness in Differentially Private Image Classification Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Wojciech Stokowiec, and Fabio Viola, The DeepMind JAX Ecosystem, GitHub, 2020; http://github.com/deepmind. [6] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence dโ€™Alchรฉ-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 15453โ€“15462, 2019. [7] Borja Balle, Leonard Berrada, Soham De, Jamie Hayes, Samuel L Smith, and Robert Stanforth, JAX-Privacy: Algorithms for privacy-preserving machine learning in jax, 0.1.0, 2022; http: //github.com/deepmind/jax_privacy. [8] Borja Balle, Giovanni Cherubin, and Jamie Hayes. Reconstructing training data with informed adversaries. In Symposium on Security and Privacy (SP). IEEE, may 2022. [9] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2019. [10] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual Symposium on Foundations of Computer Science. IEEE, oct 2014. [11] Philipp Benz, Chaoning Zhang, Adil Karjauv, and In So Kweon. Robustness may be at odds with fairness: An empirical study on class-wise accuracy. In Luca Bertinetto, Joรฃo F. Henriques, Samuel Albanie, Michela Paganini, and Gรผl Varol, editors, NeurIPS 2020 Workshop on Preregistration in Machine Learning, 11 December 2020, Virtual Event, volume 148 of Proceedings of Machine Learning Research, pages 325โ€“342. PMLR, 2020. [12] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Rรฉ, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramรจr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258 [cs.LG], 2022. 11 Unlocking Accuracy and Fairness in Differentially Private Image Classification [13] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang, JAX: composable transformations of Python+NumPy programs, GitHub, 2018; http://github.com/google/jax. [14] Andy Brock, Soham De, Samuel L. Smith, and Karen Simonyan. High-performance largescale image recognition without normalization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 1059โ€“1071. PMLR, 2021. [15] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marcโ€™Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [16] U.S. Census Bureau. Disclosure avoidance for the 2020 census: An introduction. Technical report, U.S. Government Publishing Office, 2021; https://www2.census.gov/library/publications/decennial/2020/ 2020-census-disclosure-avoidance-handbook.pdf. [17] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramรจr. Membership inference attacks from first principles. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, pages 1897โ€“1914. IEEE, 2022. [18] Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. Distribution density, tails, and outliers in machine learning: Metrics and applications. Preprint arXiv:1910.13427 [cs.LG], 2019. [19] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramรจr, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. Preprint arXiv:2301.13188 [cs.CR], 2023. [20] Nicholas Carlini, Florian Tramรจr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, รšlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In Michael Bailey and Rachel Greenstadt, editors, 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 2633โ€“2650. USENIX Association, 2021. [21] Yannis Cattan, Christopher A Choquette-Choo, Nicolas Papernot, and Abhradeep Thakurta. Fine-tuning with differential privacy necessitates an additional hyperparameter search. Preprint arXiv:2210.02156 [cs.LG], 2023. [22] Hongyan Chang and Reza Shokri. On the privacy risks of algorithmic fairness. In European Symposium on Security and Privacy (EuroS&P), pages 292โ€“303. IEEE, IEEE, sep 2021. [23] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. Preprint arXiv:2302.06675 [cs.LG], 2023. 12 Unlocking Accuracy and Fairness in Differentially Private Image Classification [24] Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pages 309โ€“315. ACM, jun 2019. [25] Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking highaccuracy differentially private image classification through scale. Preprint arXiv:2204.13650 [cs.LG], 2022. [26] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetiฤ‡, Dustin Tran, Thomas Kipf, Mario Luฤiฤ‡, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, and Neil Houlsby. Scaling vision transformers to 22 billion parameters. Preprint arXiv:2302.05442 [cs.CV], 2023. [27] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition. IEEE, jun 2009. A list of real-world uses of differential privacy. Tech[28] Damien Desfontaines. nical report, Personal Blog, 2021; https://desfontain.es/privacy/ real-world-differential-privacy.html. [29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171โ€“4186. Association for Computational Linguistics, 2019. [30] Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3โ€“37, 2022. [31] Friedrich Dรถrmann, Osvald Frisk, Lars Nรธrvang Andersen, and Christian Fischer Pedersen. Not all noise is accounted equally: How differentially private learning benefits from large sampling rates. In 31st International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, oct 2021. [32] Vadym Doroshenko, Badih Ghazi, Pritish Kamath, Ravi Kumar, and Pasin Manurangsi. Connect the dots: Tighter discrete approximations of privacy loss distributions. In Proceedings on Privacy Enhancing Technologies, volume 2022, pages 552โ€“570. Privacy Enhancing Technologies Symposium Advisory Board, oct 2022. [33] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, volume 7, pages 17โ€“51. Journal of Privacy and Confidentiality, may 2006. [34] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trendsยฎ in Theoretical Computer Science, 9(3-4):211โ€“407, 2014. 13 Unlocking Accuracy and Fairness in Differentially Private Image Classification [35] Tom Farrand, Fatemehsadat Mireshghallah, Sahib Singh, and Andrew Trask. Neither private nor fair: Impact of data imbalance on utility and fairness in differential privacy. In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice, pages 15โ€“19. ACM, nov 2020. [36] Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 954โ€“959. ACM, 2020. [37] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. In Hugo Larochelle, Marcโ€™Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [38] Stanislav Fort, Andrew Brock, Razvan Pascanu, Soham De, and Samuel L. Smith. Drawing multiple augmentation samples per image during training efficiently decreases test error. Preprint arXiv:2105.13343 [cs.LG], 2021. [39] Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Why is public pretraining necessary for private model training? Preprint arXiv:2302.09483 [cs.LG], 2023. [40] Jonas Geiping, Hartmut Bauermeister, Hannah Drรถge, and Michael Moeller. Inverting gradients - how easy is it to break privacy in federated learning? In Hugo Larochelle, Marcโ€™Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [41] Google, Differential privacy accounting library, GitHub, 2023; https://github.com/ google/differential-privacy. [42] Chuan Guo, Alexandre Sablayrolles, and Maziar Sanjabi. Analyzing privacy leakage in machine learning via multiple hypothesis testing: A lesson from fano. Preprint arXiv:2210.13662 [cs.LG], 2022. [43] Rob Hall, Alessandro Rinaldo, and Larry A. Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14:703โ€“727, 2013. [44] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3315โ€“3323, 2016. [45] Jamie Hayes, Saeed Mahloujifar, and Borja Balle. Bounding training data reconstruction in DP-SGD. Preprint arXiv:2302.07225 [cs.CR], 2023. [46] Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, and Jiang Bian. Exploring the limits of differentially private deep learning with group-wise clipping. Preprint arXiv:2212.01539 [cs.LG], 2022. 14 Unlocking Accuracy and Fairness in Differentially Private Image Classification [47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016. [48] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: Better training with larger batches. Preprint arXiv:1901.09335 [cs.LG], 2019. [49] Florian A. Hรถlzl, Daniel Rueckert, and Georgios Kaissis. Equivariant differentially private deep learning. Preprint arXiv:2301.13104 [cs.CV], 2023. [50] Yangsibo Huang, Samyak Gupta, Zhao Song, Kai Li, and Sanjeev Arora. Evaluating gradient inversion attacks and defenses in federated learning. In Marcโ€™Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 7232โ€“7241, 2021. [51] Thomas Humphries, Matthew Rafuse, Lindsey Tulloch, Simon Oya, Ian Goldberg, Urs Hengartner, and Florian Kerschbaum. Differentially private learning does not bound membership inference. Preprint arXiv:2010.12112 [cs.CR], 2020. [52] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 448โ€“456. JMLR.org, 2015. [53] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn L. Ball, Katie S. Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI Conference on Artificial Intelligence, volume 33, pages 590โ€“597. Association for the Advancement of Artificial Intelligence (AAAI), jul 2019. [54] Matthew Jagielski, Jonathan R. Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd? In Hugo Larochelle, Marcโ€™Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [55] Jinwoo Jeon, Jaechang Kim, Kangwook Lee, Sewoong Oh, and Jungseul Ok. Gradient inversion with generative image prior. In Marcโ€™Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 29898โ€“29908, 2021. [56] Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. Characterizing structural regularities of labeled data in overparameterized models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5034โ€“5044. PMLR, 18โ€“24 Jul 2021. [57] Xiao Jin, Pin-Yu Chen, Chia-Yi Hsu, Chia-Mu Yu, and Tianyi Chen. Cafe: Catastrophic data leakage in vertical federated learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 994โ€“1006. Curran Associates, Inc., 2021. 15 Unlocking Accuracy and Fairness in Differentially Private Image Classification [58] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark, Mimic-iv, version 0.4, PhysioNet, 2020; https://doi.org/10.13026/a3wn-hq05. [59] A. Johnson, T. Pollard, R. Mark, S. Berkowitz, and S. Horng, Mimic-cxr database, version 2.0.0, PhysioNet, 2019; https://doi.org/10.13026/C2JT1Q. [60] Alistair E W Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-Ying Deng, Roger G Mark, and Steven Horng. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data, 6(1):317, December 2019. [61] Peter Kairouz, Mรณnica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. (nearly) dimension independent private ERM with adagrad rates via publicly estimated subspaces. In Mikhail Belkin and Samory Kpotufe, editors, Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pages 2717โ€“2746. PMLR, 2021. [62] Georgios Kaissis, Alexander Ziller, Jonathan Passerat-Palmbach, Thรฉo Ryffel, Dmitrii Usynin, Andrew Trask, Ionรฉsio Lima, Jason Mancuso, Friederike Jungmann, Marc-Matthias Steinborn, Andreas Saleh, Marcus Makowski, Daniel Rueckert, and Rickmer Braren. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nature Machine Intelligence, 3(6):473โ€“484, may 2021. [63] Uday Kamal, Mohammad Zunaed, Nusrat Binta Nizam, and Taufiq Hasan. Anatomy-XNet: An anatomy aware convolutional neural network for thoracic disease classification in chest x-rays. IEEE Journal of Biomedical and Health Informatics, 26(11):5518โ€“5528, nov 2022. [64] Gal Kaplun, Nikhil Ghosh, Saurabh Garg, Boaz Barak, and Preetum Nakkiran. Deconstructing distributions: A pointwise framework of learning. Preprint arXiv:2202.09931 [cs.LG], 2022. [65] Helena Klause, Alexander Ziller, Daniel Rueckert, Kerstin Hammernik, and Georgios Kaissis. Differentially private training of residual networks with scale normalisation. Preprint arXiv:2203.00324 [cs.LG], 2022. [66] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In European Conference on Computer Vision, pages 491โ€“507. Springer International Publishing, 2020. [67] Bogdan Kulynych, Mohammad Yaghini, Giovanni Cherubin, Michael Veale, and Carmela Troncoso. Disparate vulnerability to membership inference attacks. In Proc. Priv. Enhancing Technol., volume 2022, pages 460โ€“480, 2022. [68] Alexey Kurakin, Steve Chien, Shuang Song, Roxana Geambasu, Andreas Terzis, and Abhradeep Thakurta. Toward training at imagenet scale with differential privacy. Preprint arXiv:2201.12328 [cs.LG], 2022. [69] Kweku Kwegyir-Aggrey, Marissa Gerchick, Malika Mohan, Aaron Horowitz, and Suresh Venkatasubramanian. The misuse of auc: What high impact risk assessment gets wrong. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1570โ€“1583, 2023. [70] Agostina J Larrazabal, Nicolรกs Nieto, Victoria Peterson, Diego H Milone, and Enzo Ferrante. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592โ€“12594, 2020. 16 Unlocking Accuracy and Fairness in Differentially Private Image Classification [71] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nat., 521(7553):436โ€“444, 2015. [72] Yann LeCun, Lรฉon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278โ€“2324, 1998. [73] Xuechen Li, Daogao Liu, Tatsunori B. Hashimoto, Huseyin A. Inan, Janardhan Kulkarni, Yin-Tat Lee, and Abhradeep Guha Thakurta. When does differentially private learning not suffer in high dimensions? In NeurIPS, 2022. [74] Xuechen Li, Florian Tramรจr, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [75] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [76] Zelun Luo, Daniel J. Wu, Ehsan Adeli, and Li Fei-Fei. Scalable differential privacy with sparse network finetuning. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2021. [77] Brendan McMahan and Abhradeep Thakurta. Federated learning with formal differential privacy guarantees. Technical report, Google Research Blog, 2022; https://ai.googleblog. com/2022/02/federated-learning-with-formal.html. [78] Harsh Mehta, Walid Krichene, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Differentially private image classification from features. Preprint arXiv:2211.13403 [cs.LG], 2022. [79] Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Large scale transfer learning for differentially private image classification. Preprint arXiv:2205.02973 [cs.LG], 2022. [80] Hussein Mozannar, Mesrob I. Ohannessian, and Nathan Srebro. Fair learning with private demographic data. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 7066โ€“7075. PMLR, 2020. [81] Milad Nasr, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carliniz. Adversary instantiation: Lower bounds for differentially private machine learning. In Symposium on Security and Privacy (SP). IEEE, may 2021. [82] Frank Nielsen and Ke Sun. Guaranteed deterministic bounds on the total variation distance between univariate mixtures. In 28th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, sep 2018. [83] Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA l. Rev., 57:1701, 2009. [84] Edouard Oyallon and Stรฉphane Mallat. Deep roto-translation scattering for object classification. In Conference on Computer Vision and Pattern Recognition. IEEE, jun 2015. 17 Unlocking Accuracy and Fairness in Differentially Private Image Classification [85] Nicolas Papernot, Abhradeep Thakurta, Shuang Song, Steve Chien, and รšlfar Erlingsson. Tempered sigmoid activations for deep learning with differential privacy. In AAAI Conference on Artificial Intelligence, volume 35, pages 9312โ€“9321. Association for the Advancement of Artificial Intelligence (AAAI), may 2021. [86] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kรถpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence dโ€™Alchรฉ-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024โ€“8035, 2019. [87] Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2021. [88] Francesco Pinto, Yaxi Hu, Fanny Yang, and Amartya Sanyal. Pillar: How to make semi-private learning more effective. Preprint arXiv:2306.03962 [cs.LG], 2023. [89] Bashir Rastegarpanah, Mark Crovella, and Krishna P Gummadi. Fair inputs and fair outputs: The incompatibility of fairness in privacy and accuracy. In Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, pages 260โ€“267. ACM, jul 2020. [90] Tal Ridnik, Emanuel Ben Baruch, Asaf Noy, and Lihi Zelnik. Imagenet-21k pretraining for the masses. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. [91] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211โ€“252, apr 2015. [92] Tom Sander, Pierre Stock, and Alexandre Sablayrolles. Tan without a burn: Scaling laws of DP-SGD. Preprint arXiv:2210.03403 [cs.LG], 2022. [93] Alexis R Santos-Lozada, Jeffrey T Howard, and Ashton M Verdery. How differential privacy will affect our understanding of health disparities in the united states. Proc. Natl. Acad. Sci. U. S. A., 117(24):13405โ€“13412, June 2020. [94] Amartya Sanyal, Yaxi Hu, and Fanny Yang. How unfair is private learning? In James Cussens and Kun Zhang, editors, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180 of Proceedings of Machine Learning Research, pages 1738โ€“1748. PMLR, 01โ€“05 Aug 2022. [95] Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew B. A. McDermott, Irene Y. Chen, and Marzyeh Ghassemi. Chexclusion: Fairness gaps in deep chest x-ray classifiers. In Biocomputing 2021: Proceedings of the Pacific Symposium, Kohala Coast, Hawaii, USA, January 3-7, 2021. WorldScientific, 2021. [96] Laleh Seyyed-Kalantari, Haoran Zhang, Matthew McDermott, Irene Y Chen, and Marzyeh Ghassemi. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine, 27(12):2176โ€“2182, dec 2021. 18 Unlocking Accuracy and Fairness in Differentially Private Image Classification [97] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, jul 2019. [98] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [99] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 2638โ€“2646. PMLR, 2021. [100] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In International Conference on Computer Vision (ICCV). IEEE, oct 2017. [101] Vinith M Suriyakumar, Nicolas Papernot, Anna Goldenberg, and Marzyeh Ghassemi. Chasing your long tails: Differentially private prediction in health care settings. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. ACM, mar 2021. [102] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818โ€“2826. IEEE, jun 2016. [103] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6105โ€“6114. PMLR, 2019. [104] Xinyu Tang, Ashwinee Panda, Vikash Sehwag, and Prateek Mittal. Differentially private image classification by learning priors from random processes. Preprint arXiv:2306.06076 [cs.CV], 2023. [105] Differential Privacy Team. Learning with privacy at scale. Technical report, Machine Learning Research at Apple, 2017; https://machinelearning.apple.com/research/ learning-with-privacy-at-scale. [106] Florian Tramรจr and Dan Boneh. Differentially private learning needs better features (or much more data). In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [107] Florian Tramรจr, Gautam Kamath, and Nicholas Carlini. Considerations for differentially private learning with large-scale public pretraining. Preprint arXiv:2212.06470 [cs.LG], 2022. [108] Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, and Nicholas Carlini. Debugging differential privacy: A case study for privacy auditing. Preprint arXiv:2202.12219 [cs.LG], 2022. [109] Cuong Tran, My H Dinh, and Ferdinando Fioretto. Differentially private deep learning under the fairness lens. Preprint arXiv:2106.02674 [cs.LG], 2021. 19 Unlocking Accuracy and Fairness in Differentially Private Image Classification [110] Archit Uniyal, Rakshit Naidu, Sasikanth Kotti, Sahib Singh, Patrik Joslin Kenfack, Fatemehsadat Mireshghallah, and Andrew Trask. DP-SGD vs PATE: Which has less disparate impact on model accuracy? Preprint arXiv:2106.12576 [cs.LG], 2021. [111] Laurens van der Maaten and Awni Y. Hannun. The trade-offs of private prediction. Preprint arXiv:2007.05089 [cs.LG], 2020. [112] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018. [113] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998โ€“6008, 2017. [114] BP Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419โ€“420, aug 1962. [115] Depeng Xu, Wei Du, and Xintao Wu. Removing disparate impact of differentially private stochastic gradient descent on model accuracy. Preprint arXiv:2003.03699 [cs.LG], 2020. [116] Han Xu, Xiaorui Liu, Yaxin Li, Anil K. Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11492โ€“11501. PMLR, 2021. [117] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16337โ€“16346. IEEE, jun 2021. [118] Kaichao You, Mingsheng Long, Jianmin Wang, and Michael I Jordan. How does learning rate decay help modern neural networks? Preprint arXiv:1908.01878 [cs.LG], 2019. [119] Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [120] Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [121] Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Large scale private learning via low-rank reparametrization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12208โ€“12218. PMLR, 2021. [122] Yaodong Yu, Maziar Sanjabi, Yi Ma, Kamalika Chaudhuri, and Chuan Guo. Vip: A differentially private foundation model for computer vision. Preprint arXiv:2306.08842 [cs.CV], 2023. 20 Unlocking Accuracy and Fairness in Differentially Private Image Classification [123] Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 3020โ€“3029. IEEE, 2021. [124] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2022. [125] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107โ€“115, 2021. [126] Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Pfohl, and Marzyeh Ghassemi. Improving the fairness of chest x-ray classifiers. In Gerardo Flores, George H. Chen, Tom J. Pollard, Joyce C. Ho, and Tristan Naumann, editors, Conference on Health, Inference, and Learning, CHIL 2022, 7-8 April 2022, Virtual Event, volume 174 of Proceedings of Machine Learning Research, pages 204โ€“233. PMLR, 2022. [127] Bolei Zhou, ร€gata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452โ€“1464, 2018. [128] Ligeng Zhu and Song Han. Deep Leakage from Gradients, pages 17โ€“31. Springer International Publishing, 2020. [129] Alexander Ziller, Dmitrii Usynin, Rickmer Braren, Marcus Makowski, Daniel Rueckert, and Georgios Kaissis. Medical imaging deep learning with differential privacy. Scientific Reports, 11(1):13524, 2021. [130] รšlfar Erlingsson. Learning statistics with privacy, aided by the flip of a coin. Technical report, Google Research Blog, 2014; https://ai.googleblog.com/2014/10/ learning-statistics-with-privacy-aided.html. 21 Unlocking Accuracy and Fairness in Differentially Private Image Classification A. Background: Differential Privacy and DP-SGD A.1. Differential Privacy Differential privacy (DP) [33] is a formal privacy guarantee that applies to randomized data analysis algorithms. By construction, differentially private algorithms prevent an adversary that observes the output of a computation from inferring any property pertaining to individual data points in the input data used during the computation. The strength of this guarantee is controlled by two parameters: ๐œ€ > 0 and ๐›ฟ โˆˆ [0, 1]. Intuitively, ๐œ€ bounds the log-likelihood ratio of any particular output that can be obtained when running the algorithm on two datasets differing in a single data point, and ๐›ฟ is a small probability which bounds the occurrence of infrequent outputs that violate this bound. The privacy guarantee becomes stronger as both parameters get smaller. A standard rule of thumb states that, to obtain meaningful privacy, ๐œ€ should be a small constant while ๐›ฟ should be smaller than 1/ ๐‘ , where ๐‘ is the size of the input dataset. More formally, we have the following. Differential Privacy. Let ๐ด : D โ†’ S be a randomized algorithm, and let ๐œ€ > 0, ๐›ฟ โˆˆ [0, 1]. We say that ๐ด is ( ๐œ€, ๐›ฟ)-DP if for any two neighboring datasets ๐ท, ๐ทโ€ฒ โˆˆ D differing by a single element, we have that โˆ€ ๐‘† โŠ‚ S , Pr[ ๐ด ( ๐ท) โˆˆ ๐‘†] โ‰ค exp( ๐œ€) Pr[ ๐ด ( ๐ทโ€ฒ ) โˆˆ ๐‘†] + ๐›ฟ . The privacy protection afforded by DP holds under an exceedingly strong threat model: inferences about individuals are protected even in the face of an adversary that has full knowledge of the DP algorithm, unbounded computational power, and arbitrary side knowledge about the input data. Broadly speaking, by observing the output the adversary cannot learn anything about the input data which they did not already know from their side knowledge. Furthermore, DP satisfies a number of appealing properties from the algorithm design standpoint, including preservation under postprocessing and a smooth degradation with multiple accesses to the same data. These properties are exploited in the construction of complex DP algorithms based on the combination of small building blocks that inject carefully calibrated noise into operations that access the data. The magnitude of the noise required to satisfy the privacy guarantee increases with the strength of the privacy parameters, leading to an unavoidable trade-off between utility and privacy (e.g. as illustrated by the Fundamental Law of Information Recovery [34]), implying it is impossible to fully close the performance gap between private and non-private learning. A.2. Interpreting the Differential Privacy Guarantee Threat model. The standard formulation of DP given above is based on a strong notion of (informationtheoretic) indistinguishability between the probability distributions over outputs produced by ๐ด ( ๐ท) and ๐ด ( ๐ทโ€ฒ ). Implicit in the definition there is an adversary whose goal is to infer whether the individual in which the datasets ๐ท and ๐ทโ€ฒ differ was actually part of the computation that produced the output observed by the adversary. The set ๐‘† then corresponds to a test the adversary runs on the output โ€“ the adversaryโ€™s decision is based on whether the output is included or not in ๐‘† โ€“ and the quantification over all sets ๐‘† is akin to assuming the adversary will use the best possible test. In particular, the adversary is free to design the test based on full knowledge of the algorithm ๐ด and the pair of datasets ๐ท and ๐ทโ€ฒ under consideration. Such worst-case assumptions are what makes DP a robust and flexible notion of privacy, but at the same time make the guarantee hard to interpret for non-experts. We now describe two alternative interpretations of the privacy afforded by DP in terms of resilience to specific privacy attacks. Membership inference interpretation. Membership inference attacks are a class of privacy attacks where the adversaryโ€™s goal is to determine whether the data of a particular individual was 22 Unlocking Accuracy and Fairness in Differentially Private Image Classification used in a certain computation (e.g. training a machine learning model) based on the computationโ€™s output. This goal is identical to the one which DP is designed to protect against, but the attack can be formalized in a number of threat models depending on the knowledge available to the adversary, e.g. about the other individuals in the training dataset. In the threat model of DP where the adversary has full knowledge, it is possible to directly translate the ( ๐œ€, ๐›ฟ)-DP guarantee into protection against membership inference attacks by rephrasing the latter as a simple statistical hypothesis testing problem. Assuming that the adversaryโ€™s null hypothesis is that a target individual was not included in the dataset and the alternative hypothesis that the individual was included, the error rates of Type I (๐›ผ) and Type II ( ๐›ฝ ) correspond, respectively, to the adversary concluding the individual was included when they actually were not, and the adversary concluding the individual was not included when they actually were. The DP guarantee then implies a constraint that makes it impossible for the adversary to find an attack that simultaneously achieves small Type I and Type II error.  More  formally, any 1โˆ’ ๐›ฝ โˆ’ ๐›ฟ membership inference attack against an ( ๐œ€, ๐›ฟ)-DP algorithm must satisfy ln โ‰ค ๐œ€. One of the ๐›ผ benefits of the strong threat model implicit in the DP definition is that this constraint automatically applies to any other type of adversary with less access to privileged information. Figure 5 illustrates the regions of valid Type I and Type II error rates that are implied by this constraint for DP algorithms across a range of ๐œ€โ€™s. 1.0 = 1.0 = 2.0 = 4.0 = 8.0 Type II Error 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 Type I Error 0.8 1.0 Figure 5 | Constraints on the error rates of membership inference attacks implied by ( ๐œ€, 10โˆ’5 )-DP algorithms. For each value of ๐œ€, the corresponding shaded region identifies values for the Type I and Type II error rates allowed under the DP constraint. An adversary would like to design an attack that is close as possible to the lower left corner, and DP with small ๐œ€ precludes that possibility, thus forcing the adversary to make frequent mistakes no matter how strong their membership inference attack is. An important limitation of the connection between DP guarantees and resilience to membership inference is that the resulting guarantee becomes meaningless very quickly as ๐œ€ grows โ€“ note, for example, that at ๐œ€ = 8 the constraint illustrated in Figure 5 is nearly vacuous. This is not an artifact โ€“ it is known that such membership inference guarantees are tight [81] โ€“ but it does not mean that DP does not provide any protection for ๐œ€ = 8 either. It is, in fact, a limitation of relying on a threat model where the adversary possesses enough side information to narrow down the choices of the privacy attack to two options: deciding whether the target individual was included or not, or, equivalently, deciding whether the data associated with the target individual is one of two possibilities. Note that this is an extremely strong threat model where the adversary already has enough side information to be 50% confident that the target individual was included in the training set before observing the trained model. 23 Unlocking Accuracy and Fairness in Differentially Private Image Classification It is worth noting that some DP mechanisms satisfy stronger notions of privacy corresponding to multiple ( ๐œ–, ๐›ฟ) pairs simultaneously. In such cases the trade-off curves from Figure 5 can be refined to take all pairs of parameters into account [30]. 1.0 Num in prior: 2 Num in prior: 8 Num in prior: 32 Num in prior: 128 0.8 0.6 Prior size Reconstruction upper bound Multi-choice membership inference interpretation. To ascribe meaningful and quantifiable privacy protections to large values of ๐œ€ one can consider a mild relaxation of the threat model where the goal of the adversary is to guess which of ๐พ > 2 equally likely choices corresponds to the target individual. Such a ๐พ -choice membership inference attack captures a scenario where, although the adversary knows the full details of the algorithm and all the dataset except one individual, they have not been able to collect enough side information to narrow down the data of the unknown target individual to only two choices. For example, in a setting where the dataset contains the birth date of individuals, the adversary might already know the year when the target individual was born, but without additional side knowledge they still have 365 equally likely choices for what the correct day and month are. Methods recently developed in [42, 45] can be used to obtain the probability that the adversary guesses which of the ๐พ choices is correct based on the output of a DP algorithm as a function of its privacy parameters. These are illustrated in Figure 6, where we observe that, for example, at ๐œ€ = 8 the DP guarantee prevents the adversary from reliably guessing an individualโ€™s data correctly as soon as they have at least 18 equally likely options to choose from. Thus, a DP-classifier at ๐œ€ = 8 does provide meaningful protection, so long as the attacker is not able to obtain substantial side information from other sources. 0.4 0.2 0.0 2 4 6 Epsilon 8 10 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Epsilon Figure 6 | Constraints on the error rates of ๐พ -choice membership inference attacks implied by ( ๐œ€, 10โˆ’5 )-DP algorithms. The left plot shows the maximum accuracy of a membership inference attack for a range of ๐œ€ values, which depends on the number of candidate images ๐พ . The attacker has to identify the target image used during training out of the ๐พ candidates, which are assumed to be equally likely under the attackerโ€™s side knowledge. The right plot shows the maximum number of points that can be contained in the prior for the attack to be more than 50% accurate. For instance at ๐œ€ = 8, the membership attack will succeed with higher than 50% probability if the attacker obtains enough side knowledge to narrow the target image down to one of (at most) 18 equally likely candidates. A.3. Differentially Private Stochastic Gradient Descent In this work, we work with differentially private algorithms ๐ด for supervised deep learning. This means that ๐ด maps a training dataset ๐ท = {( ๐‘ฅ ๐‘– , ๐‘ฆ๐‘– )}1โ‰ค ๐‘– โ‰ค ๐‘ to a vector of learned neural network parameters ๐‘ค โˆˆ S = โ„ ๐‘ . Let L ( ๐‘ค, ๐‘ฅ, ๐‘ฆ ) denote the learning objective (e.g., the cross-entropy loss), given the model parameters ๐‘ค, input example ๐‘ฅ and label ๐‘ฆ . For convenience, we use the shorthand notation ๐‘™ ๐‘– ( ๐‘ค) = L ( ๐‘ค, ๐‘ฅ ๐‘– , ๐‘ฆ๐‘– ). In the non-private setting, Stochastic Gradient Descent (SGD) provides a standard iterative 24 Unlocking Accuracy and Fairness in Differentially Private Image Classification optimization approach to learning, whereby at each iteration ๐‘ก the algorithm draws ๐ต examples at random from the training dataset, and updates the model parameters according to: ๐‘ค ( ๐‘ก+1) = ๐‘ค ( ๐‘ก ) โˆ’ ๐œ‚๐‘ก 1 โˆ‘๏ธ ๐ต โˆ‡๐‘™ ๐‘– ( ๐‘ค (๐‘ก ) ) , ๐‘– โˆˆ B๐‘ก where ๐œ‚๐‘ก is the step-size for the ๐‘ก th update, โˆ‡ denotes the gradient operator, and B๐‘ก represents the set of examples sampled at iteration ๐‘ก with |B๐‘ก | = ๐ต. In order to make this algorithm differentially private, we apply the following modifications. First, the gradient for each example in the mini-batch is clipped to a maximal norm ๐ถ and normalized by ๐ถ , and second, Gaussian noise with an appropriate standard deviation is added to the mean of the n o ๐ถ clipped gradients. Let clip๐ถ ( ๐‘ฃ) = min 1, โˆฅ ๐‘ฃ โˆฅ 2 ยท ๐‘ฃ denote the clipping function which re-scales its input so that the output has a maximal โ„“2 norm of ๐ถ . The new update step is: ( )   ๐œŽ 1 โˆ‘๏ธ 1 ( ๐‘ก +1) (๐‘ก ) (๐‘ก ) ๐‘ค = ๐‘ค โˆ’ ๐œ‚๐‘ก clip๐ถ โˆ‡๐‘™ ๐‘– ( ๐‘ค ) + ๐œ‰ , ๐ต ๐‘– โˆˆ B๐‘ก ๐ถ ๐ต where ๐œ‰ โˆผ N (0, ๐ผ ๐‘ ) is a standard ๐‘-dimensional Gaussian random variable and ๐œŽ specifies the standard deviation of the added noise. The resulting algorithm is called Differentially Private-Stochastic Gradient Descent (DP-SGD) [1]. We note that this is a minor re-parameterization of the algorithm in [1] in which the learning rate ๐œ‚๐‘ก absorbs a factor of ๐ถ . This has no effect on the privacy guarantees, but ensures the clipping norm does not influence the scale of the update, which simplifies hyper-parameter tuning in practice. Privacy accounting. Intuitively, performing a model update using DP-SGD provides differential privacy because adding Gaussian noise with standard deviation proportional to ๐ถ is sufficient to mask the contribution of any single example whose clipped gradient has norm less than or equal to ๐ถ . The total privacy guarantee of DP-SGD is determined by three parameters: the standard deviation ๐œŽ, the sampling ratio ๐‘ž = ๐ต/ ๐‘ and the number of training iterations ๐‘‡ . In practice, the privacy budget ( ๐œ€, ๐›ฟ) is usually fixed, and these three hyper-parameters are chosen to provide the best possible performance within this budget. The privacy calibration process is performed using a privacy accountant: a numerical algorithm providing tight upper bounds for the privacy budget as a function of the hyper-parameters [1], which in turn can be combined with numerical optimization routines to optimize one hyper-parameter given the privacy budget and the other two hyper-parameters. The privacy accounting for DP-SGD relies on a โ€œcompositionโ€ analysis across iterations, which allows for the release not only of the final model, but also of every intermediate model obtained during training (under the same privacy budget). A.4. DP-SGD: Implementation Details All the experiments reported in the paper use a JAX [13] implementation of DP-SGD based on JAXline [5], a re-usable framework for distributed model training and evaluation. For privacy accounting, our implementation uses a method based on privacy loss distributions proposed in [32], and implemented in Googleโ€™s DP library [41]. Besides enabling reproducibility of our results, another important reason for open sourcing our code is to allow the differential privacy community to verify the correctness our implementation of DP-SGD. To help navigate the code base, we provide an in-depth description of how our code is structured to parallelize the computation of privatized gradients in DP-SGD across many devices when using virtual batching and multiple augmentations (see [25]). Furthermore, we describe a collection 25 Unlocking Accuracy and Fairness in Differentially Private Image Classification Algorithm 1: Private gradient computation across multiple devices with virtual-batching, multiple augmentations, synchronized noise and gradient normalization. Input: Current model parameters ๐‘ค, clipping norm ๐ถ , noise multiplier ๐œŽ, device id ๐‘‘ , per-device per-step batch-size ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ , number of gradient accumulation steps ๐‘๐‘Ž๐‘๐‘ , number of devices ๐‘๐‘‘๐‘’๐‘ฃ , number of per-example augmentations ๐พ , shared noise sample ๐œ‰ โˆผ N (0, ๐ผ ), training examples {( ๐‘ฅ๐‘‘,๐‘ ,๐‘– , ๐‘ฆ๐‘‘,๐‘ ,๐‘– ) : ๐‘  โˆˆ [ ๐‘๐‘Ž๐‘๐‘ ] , ๐‘– โˆˆ [ ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ]}. ๐ต โ† ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ยท ๐‘๐‘‘๐‘’๐‘ฃ ยท ๐‘๐‘Ž๐‘๐‘ ๐‘”โ†0 for ๐‘  โˆˆ {1, . . . , ๐‘๐‘Ž๐‘๐‘ } do for ๐‘– โˆˆ {1, . . . , ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ } do  ร ๐‘”๐‘‘,๐‘ ,๐‘– โ† ๐ถ1 clip๐ถ ๐พ1 ๐พ๐‘—=1 โˆ‡L ( ๐‘ค, augment ( ๐‘ฅ๐‘‘,๐‘ ,๐‘– ) , ๐‘ฆ๐‘‘,๐‘ ,๐‘– ) ร ๐‘”๐‘‘,๐‘  โ† ๐ต 1 // Average over local mini-batch ๐‘– ๐‘”๐‘‘,๐‘ ,๐‘– ๐‘™๐‘œ๐‘๐‘Ž๐‘™ ห†๐‘”๐‘‘,๐‘  โ† ๐‘”๐‘‘,๐‘  + ๐œŽ๐ต ๐œ‰ // Add the same noise at each aggregation step on each device ยฏ๐‘”๐‘  โ† ๐‘1๐‘‘๐‘’๐‘ฃ ร ๐‘”๐‘‘ โ€ฒ ,๐‘  ๐‘‘โ€ฒ ห† ยฏ๐‘” โˆ’ ๐‘” ๐‘” โ† ๐‘” + ๐‘ ๐‘  return ๐‘” // Synchronize average gradient across devices // Numerically stable averaging across accumulation steps // Each device gets the same gradient of privacy auditing experiments aimed at obtaining empirical lower bounds for the privacy of our implementation by leveraging membership inference attacks. Algorithm 1 provides a high-level description of how our DP-SGD implementation computes the privatized gradient used in each model update step. The structure of the code is informed by the way model training pipelines are implemented in JAXline. The implementation is parallelized across ๐‘๐‘‘๐‘’๐‘ฃ devices, where each device runs a copy of Algorithm 1. To extract the maximum possible throughput from the implementation, each device processes training examples in batches of size ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ , where this parameter is adjusted depending on the memory available in each device and the size of model gradients for the present architecture. To accommodate settings where the desired batch size for a single model update is larger than ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ยท ๐‘๐‘‘๐‘’๐‘ฃ , our implementation incorporates gradient accumulation (i.e. virtual batching) where ๐‘๐‘Ž๐‘๐‘ gradient accumulation steps are performed before each model update, giving a total batch size ๐ต = ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ยท ๐‘๐‘‘๐‘’๐‘ฃ ยท ๐‘๐‘Ž๐‘๐‘ . As input to the gradient computation step, each device receives the current model parameters ๐‘ค (which are identical across devices), the desired clipping norm ๐ถ and noise standard deviation ๐œŽ, and their device identifier ๐‘‘ โˆˆ {1, . . . , ๐‘๐‘‘๐‘’๐‘ฃ }. In addition, each device ๐‘‘ has access to ๐‘๐‘Ž๐‘๐‘ ยท ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ training examples {( ๐‘ฅ๐‘‘,๐‘ ,๐‘– , ๐‘ฆ๐‘‘,๐‘ ,๐‘– ) : ๐‘  โˆˆ [ ๐‘๐‘Ž๐‘๐‘ ] , ๐‘– โˆˆ [ ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ]}, and a copy of a sample ๐œ‰ from a standard multivariate Gaussian distribution. Crucially, the noise is resampled independently each time Algorithm 1 is executed, but it is shared across devices and aggregation steps during a single execution. This is enforced in our implementation by broadcasting the same pseudo-random number generator key to all devices โ€“ this is preferred over having a different PRNG key per-device because it makes the pipeline more reproducible across training infrastructures with different numbers of devices. In the innermost loop of Algorithm 1, client ๐‘‘ computes the individual contribution ๐‘”๐‘‘,๐‘ ,๐‘– of a single example ๐‘ฅ๐‘‘,๐‘ ,๐‘– (where ๐‘  indexes the accumulation step and ๐‘– the local batch-size). Our implementation 26 Unlocking Accuracy and Fairness in Differentially Private Image Classification enables the use of multiple random augmentations [38], with the ๐พ augmentations corresponding to a single data point being denoted by ๐‘ฅ๐‘‘,๐‘ ,๐‘– . These augmentations are returned by independent calls to the augment subroutine. The result of this average is then clipped to a maximal โ„“2 norm of ๐ถ by the clip๐ถ function and then normalized by ๐ถ . This results in a contribution ๐‘”๐‘‘,๐‘ ,๐‘– to the model update with norm bounded by 1, which is then averaged locally over ๐‘– on device ๐‘‘ to produce ๐‘”๐‘‘,๐‘  . We note that although the loop over ๐‘– โˆˆ [ ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ] is presented in Algorithm 1 as a sequential computation for convenience, in reality our implementation uses hardware parallelism offered by modern accelerators โ€“ this means that as long as the device can process ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ยท ๐พ examples in parallel, the cost of computing ๐‘”๐‘‘,๐‘  is constant in these parameters. After computing the average of clipped gradient contributions ๐‘”๐‘‘,๐‘  , each device ๐‘‘ adds appropriately calibrated Gaussian noise to obtain ห†๐‘”๐‘‘,๐‘  โ€“ we emphasize again that all device share the same random noise sample, so the noise adding process is not independent across devices. At this point devices synchronize their updates by computing the average of ห†๐‘”๐‘‘,๐‘  over ๐‘‘ โˆˆ [ ๐‘๐‘‘๐‘’๐‘ฃ ]. After this step, each device has the same averaged noisy gradient ยฏ๐‘”๐‘  . Finally, each device updates their local copy of the accumulated gradient ๐‘” by using Welfordโ€™s incremental averaging algorithm for numerical stability [114]. That Algorithm 1 provides a correct implementation of the privatized gradients required by DP-SGD follows by comparing the following result to the formal definition of DP-SGD. Correctness claim. Each device participating in Algorithm 1 obtains the same noisy gradient given by ๏ฃผ ๏ฃฑ ๐‘๐‘‘๐‘’๐‘ฃ โˆ‘๏ธ ๐‘๐‘Ž๐‘๐‘ ๐ตโˆ‘๏ธ ๐พ ๏ฃด ๏ฃด ๐‘™๐‘œ๐‘๐‘Ž๐‘™ ๏ฃฒ 1 โˆ‘๏ธ ๏ฃด ๏ฃฝ ๐œŽ 1 ยฉ 1 โˆ‘๏ธ ยช๏ฃด ๐‘”= clip๐ถ ยญ โˆ‡L ( ๐‘ค, augment ( ๐‘ฅ๐‘‘,๐‘ ,๐‘– ) , ๐‘ฆ๐‘‘,๐‘ ,๐‘– ) ยฎ + ๐œ‰ , ๏ฃด ๏ฃด ๐พ ๏ฃด ๐ต ๐‘‘ =1 ๐‘ =1 ๐‘–=1 ๐ถ ๏ฃด ๐ต ยซ ๐‘—=1 ยฌ๏ฃพ ๏ฃณ where ๐œ‰ โˆผ N (0, ๐ผ ). Correctness proof. By construction it is clear that each device gets the same gradient. Now let ๐‘” ( ๐‘  ) be the value of ๐‘” (on an arbitrary device) at the end of iteration ๐‘  of the outermost loop. By induction on ๐‘  we can show that ๐‘” ( ๐‘  ) = 1๐‘  ๐‘” ( ๐‘ +1) = ๐‘” ( ๐‘  ) + ร๐‘  ๐‘”๐‘  : it is clear that ๐‘” ๐‘ โ€ฒ =1 ยฏ โ€ฒ (1) = ยฏ ๐‘”1 , and ๐‘ +1 ยฏ๐‘”๐‘ +1 โˆ’ ๐‘” ( ๐‘  ) ๐‘ ๐‘” ( ๐‘  ) + ยฏ๐‘”๐‘ +1 1 โˆ‘๏ธ = = ยฏ๐‘”๐‘ โ€ฒ , ๐‘ +1 ๐‘ +1 ๐‘ +1 โ€ฒ ๐‘  =1 (1) where the last identity follows by the inductive hypothesis. Thus, at the end of the algorithm each ร ๐‘Ž๐‘๐‘ device gets ๐‘” = ๐‘” ( ๐‘๐‘Ž๐‘๐‘ ) = ๐‘1๐‘Ž๐‘๐‘ ๐‘ ๐‘=1 ยฏ๐‘”๐‘  . Unrolling the computations done at every accumulation step on every device we get: ๐‘” = ๐‘๐‘Ž๐‘๐‘ 1 โˆ‘๏ธ ๐‘๐‘Ž๐‘๐‘ ยฏ๐‘”๐‘  ๐‘ =1 = ๐‘๐‘‘๐‘’๐‘ฃ ๐‘๐‘Ž๐‘๐‘ โˆ‘๏ธ โˆ‘๏ธ 1 ห†๐‘”๐‘‘,๐‘  ๐‘๐‘Ž๐‘๐‘ ยท ๐‘๐‘‘๐‘’๐‘ฃ ๐‘ =1 ๐‘‘ =1 = ๐‘๐‘‘๐‘’๐‘ฃ ๐‘๐‘‘๐‘’๐‘ฃ ๐‘๐‘Ž๐‘๐‘ โˆ‘๏ธ ๐‘๐‘Ž๐‘๐‘ โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ 1 ๐œŽ ๐‘”๐‘‘,๐‘  + ๐œ‰ ๐‘๐‘Ž๐‘๐‘ ยท ๐‘๐‘‘๐‘’๐‘ฃ ๐‘๐‘Ž๐‘๐‘ ยท ๐‘๐‘‘๐‘’๐‘ฃ ยท ๐ต ๐‘ =1 ๐‘‘ =1 ๐‘ =1 ๐‘‘ =1 = 1 ๐‘๐‘Ž๐‘๐‘ ยท ๐‘๐‘‘๐‘’๐‘ฃ ยท ๐ต๐‘™๐‘œ๐‘๐‘Ž๐‘™ ๐‘๐‘‘๐‘’๐‘ฃ ๐ตโˆ‘๏ธ ๐‘๐‘Ž๐‘๐‘ โˆ‘๏ธ ๐‘™๐‘œ๐‘๐‘Ž๐‘™ โˆ‘๏ธ ๐‘ =1 ๐‘‘ =1 ๐‘–=1 ๐‘”๐‘‘,๐‘ ,๐‘– + ๐œŽ ๐œ‰ . ๐ต The result now follows by observing that the first term in the sum above equals the first term in the claimโ€™s equation. 27 Unlocking Accuracy and Fairness in Differentially Private Image Classification A.5. Computational Efficiency We consistently find that private training requires significantly more compute than non-private training to achieve optimal performance. The computational cost of training with DP-SGD can be broken down into two components: the cost of performing a single parameter update given a batch size, and the number of parameter updates that need to be performed for the model to reach a high accuracy. The cost of a single DP-SGD update is largely dominated by the cost of computing per-example gradients, which is slower than computing the averaged gradient and also requires more memory. While recent deep learning frameworks like JAX [13] have significantly reduced these overheads, DP-SGD remains slower than SGD in our experience. For example on CheXpert, we observe that the number of examples processed per second by an NFNet-F0 on a TPUv3 is about 6.5ร— slower with our implementation of DP-SGD than with standard SGD. Furthermore, we found that DP-SGD also required more training epochs than non-private training to achieve optimal performance. This observation is exacerbated by the use of large batch sizes and/or large amount of (within batch) data augmentation, both of which further increase the computational cost of training. We provide additional discussions on this in an earlier version of this work [25]. When tuning hyper-parameters, this cost is further added over the multiple experiments. However, [92] demonstrate that this can be mitigated by exploring the values of hyper-parameters in a dataefficient regime and by extrapolating these values to a final expensive run that is optimized for best accuracy. Finally, we note that the computational cost is significantly reduced on tasks where fine-tuning only the last layer is sufficient to obtain high accuracy. In our experiments however, we typically found fine-tuning the whole model to be optimal when transferring between datasets other than from JFT-300M/4B to ImageNet. A.6. Privacy Auditing Experiments Membership inference attacks provide a powerful signal for auditing the correctness of differentially private learning algorithms [54, 81, 108]. The extent to which these attacks are successful can be used to provide (empirical) lower bounds on the privacy guarantees afforded by an implementation, which can then be compared with the nominal upper bound. This section reports the results of applying this methodology to our training pipeline. Note that this type of test is incomplete; a failure to find a violation of the nominal upper bound does not rule out the possibility that one exists. However, it provides a signal that our implementation of DP-SGD does not contain major issues, which we complement with independent code reviews and unit testing of critical components to ensure overall correctness. DP lower bounds via hypothesis testing. Given two datasets, ๐ท and ๐ทโ€ฒ = ๐ท โˆช { ๐‘ง }, that differ by a single data point ๐‘ง, a model trained with a DP algorithm reduces an adversaryโ€™s ability to infer via a simple hypothesis test whether the model was trained on ๐ท or ๐ทโ€ฒ . More formally, if an algorithm is ( ๐œ€, ๐›ฟ)-DP then it must satisfy ln ( 1โˆ’๐›ผ๐›ฝ โˆ’๐›ฟ ) โ‰ค ๐œ€ [43], where ๐›ผ and ๐›ฝ denote the Type I and Type II errors of the hypothesis testing procedure. If we can construct datasets ๐ท and ๐ทโ€ฒ such that this bound is violated, we can be sure the algorithm does not provide ( ๐œ€, ๐›ฟ)-DP; evaluating ln ( 1โˆ’๐›ผ๐›ฝ โˆ’๐›ฟ ) through a membership inference attack will give a valid lower bound of ๐œ€. In practice, the test relies on comparing the (distribution over) model losses of point ๐‘ง under models trained with ๐ท and ๐ทโ€ฒ . Overview of the attack. Note that DP is a worst-case guarantee, and so we are free to design our datasets ๐ท and ๐ทโ€ฒ in any way we choose. In addition, our algorithm provides the same privacy guarantees for many setting of its hyper-parameters. Our attack operates in two phases: in phase I, we design a learning problem that we expect will maximize the ability to infer membership, while in phase 28 Unlocking Accuracy and Fairness in Differentially Private Image Classification II, we use the procedure set out by [108] to run a membership inference attack to find a (statistically valid) lower bound for ๐œ€. In phase I, we sweep over different hyper-parameter configurations and choices for ๐‘ง, to find a choice that is likely to maximize the ratio 1โˆ’๐›ผ๐›ฝ โˆ’๐›ฟ when the model is trained with Algorithm 1. In phase II, we use the best configuration from phase I and train a large number of models to ensure the lower bound is statistically valid with enough confidence. For completeness, we repeat the same procedure with two dataset sizes (| ๐ท | โˆˆ {100, 60 ๐พ }) and four settings of the nominal privacy parameters: ๐œ€ โˆˆ {1, 2, 4, 8}, where we set ๐›ฟ = | ๐ท1 | . Experimental details. Throughout the auditing procedure, we use a simple LeNet classifier [72] as the trained model and take ๐ท from the MNIST dataset. The hyper-parameters and choices of ๐‘ง we evaluated in phase I are given in Section A.6. Given a choice of ( ๐œ€, ๐›ฟ), we find a choice of hyperparameters that maximizes distinguishability between models trained on ๐ท and ๐ทโ€ฒ as follows: for each hyperparameter configuration, we train 1K models on ๐ท and 1K models on ๐ทโ€ฒ , where the only randomness originates from the noise added from DP-SGD. In particular, all models are initialized with the same parameters, as prior work has shown random initialization weakens privacy attacks [54, 8]. We then record the loss on ๐‘ง for each model trained on ๐ท and on ๐ทโ€ฒ , and fit two Gaussians over these histograms. After this, we record the total variation distance between these two Gaussians [82], and choose the hyperparameter configuration that maximizes this distance. The choice of learning rate, clipping norm, and number of model updates that maximize our ability to infer if ๐‘ง was included in training varied for each ๐œ€ โˆˆ {1, 2, 4, 8} and | ๐ท | โˆˆ {100, 60 ๐พ }. However, we found that selecting ๐‘ง to be a blank image was a consistently better choice than using uniform noise or mislabeling an MNIST test set example. In phase II, we repeat the procedure set out by [108] for each choice of privacy parameters, dataset size and best hyper-parameters from phase I. For both ๐ท and ๐ทโ€ฒ we train 500K models, reserving the first 10K models on each setting as the set from which we find a threshold that maximizes the ratio between 1 โˆ’ ๐›ฝ and ๐›ผ, and the remaining models are used to evaluate this chosen threshold and report the ๐œ€ lower bound. To gain statistical confidence in our ๐œ€ lower bounds we compute the lower bound of the true positive rate, 1 โˆ’ ๐›ฝ , and upper bound of the false positive rate, ๐›ผ, over the remaining 980K models using Clopper-Pearson confidence intervals. With an appropriate choice of significance level this gives us a probabilistic ๐œ€ lower bound with 99.9% confidence. Results. Our results are given in Table 2; we did not find any violation of our reported ( ๐œ€, ๐›ฟ)-DP guarantees. Note that our lower bounds when using | ๐ท | = 100 are substantially tighter than when training with | ๐ท | = 60 ๐พ โ€“ for example at nominal ๐œ€ > 1 training on a small dataset yields ๐œ€ lower bounds which are over 2ร— stronger. We also provide membership inference AUC and advantage (1 โˆ’ ๐›ฝ โˆ’ ๐›ผ) scores for inferring if ๐‘ง was or was not included in training over the 980K models, and give upper bounds to the membership advantage that can be derived in closed form from ๐œ€ and ๐›ฟ (cf. [51]). Hyperparameter Values ๐‘ง Uniform noise: Learning rate Clipping norm | ๐ท| Batch size Number of updates , Blank: , Label 7 as 8: , Label 6 as 7: 0.1, 0.5, 1.0 0.1, 1.0, 10.0 60K 4096 (accumulated over two steps of batch size 2048) 500, 1K , Label 0 as 1: 100 64 50, 100 Table 1 | Phase I of the ๐œ€ lower bound experiment: Finding the best choice of hyperparameters that distinguish models trained with and without ๐‘ง. 29 Unlocking Accuracy and Fairness in Differentially Private Image Classification | ๐ท| Nominal ๐œ€ ๐œ€ lower bound Membership AUC Membership advantage Membership advantage upper bound 60K 1 2 4 8 0.279 0.456 1.139 2.153 0.54 0.57 0.62 0.72 0.06 0.10 0.17 0.31 0.46 0.76 0.96 1.00 100 1 2 4 8 0.361 0.837 2.461 4.327 0.59 0.66 0.79 0.91 0.13 0.22 0.43 0.67 0.47 0.76 0.96 1.00 Table 2 | Phase II of the ๐œ€ lower bound experiment: Reporting probabilistic ๐œ€ lower bounds with 99.9% confidence and the membership inference (inferring if ๐‘ง was or was not used in training) AUC and advantage (1 โˆ’ ๐›ฝ โˆ’ ๐›ผ). A.7. Reconstruction of CheXpert Training Data: Experimental Details In Figure 1 we demonstrated the one can reconstruct training data given access to gradients from SGD while the attack fails for DP-SGD. Here we give details for how we designed the attack, when we train a small VGG-11 model [98] on the CheXpert training set for simplicity. We assume the adversary can access intermediate model updates, as permitted under the DP threat model. For a target training point ๐‘ง, the attackerโ€™s goal is to output a reconstruction ห†๐‘ง (that should be close to ๐‘ง for some distance metric) given access to ๐‘” ๐‘ง , where ๐‘” ๐‘ง is the gradient of the loss on ๐‘ง with respect to model parameters. Note that ๐‘” ๐‘ง could be privatized using DP, and so the goal of this experiment is to measure the difference in reconstruction quality given access to a privatized and non-privatized gradient. Inspired by recent work on gradient-based data reconstruction attacks [117, 50, 55, 57, 128, 40] we implement an optimization-based attack by initializing ห†๐‘ง to random noise and performing gradient descent based on the loss โˆฅ ๐‘” ๐‘ง โˆ’ ๐‘”ห†๐‘ง โˆฅ, where ๐‘”ห†๐‘ง is the gradient of the loss on ห†๐‘ง with respect to model parameters. We optimize this objective for 100,000 iterations, for the case when the gradient is privatized and when the gradient is non-private. As seen from Figure 1, in the non-private case the reconstruction ห†๐‘ง looks almost identical to ๐‘ง, while in the private setting, ห†๐‘ง resembles random noise even after the optimization process has terminated. We note that there is a choice over which update step to attack in the training process. We chose to attack the first update, as prior work has observed that it is easier to attack updates closer to initialization [55, 128]. B. Related Work B.1. Differentially Private Training Background. Differential Privacy (DP) was initially formalized in [33], and was first adapted to deep learning with the work of [1], who showed how DP-SGD [10] could be operationalized to train neural networks with differential privacy guarantees. Since then, multiple threads of research have contributed to the ability to train machine learning models with DP, out of which three categories are relevant to situate our work. The first relevant thread of research aims to provide tight upper bounds on the privacy loss for DP-SGD, so that for example, as little noise as possible can be injected during training to fit a given privacy budget of ( ๐œ€, ๐›ฟ). In our work, we rely on [32] to compute the ( ๐œ€, ๐›ฟ)-guarantees of our differentially private training runs. This research workstream is complementary to our work: any improvement in that direction can be integrated to our experiments to further improve accuracy at a given privacy budget โ€“ or conversely to reduce the privacy budget at a given level of accuracy. 30 Unlocking Accuracy and Fairness in Differentially Private Image Classification Another pertinent thread has been dedicated to theoretically understanding the effects of differential privacy on the learning ability of models. In particular, the highly influential work of [10] showed that there exist worst-case convex learning problems where the statistical error of any model trained with DP can be lower bounded by a quantity that grows linearly with the number of model parameters. This result has led many researchers to believe that DP training could not perform well in the highly dimensional regime of large-scale deep learning, and partially explains the focus of the empirical community on developing small models for DP training, as we detail further. Further theoretical investigations have shown that by imposing some additional assumptions on the learning problem (e.g. generalized linear models, gradients in a constant rank subspace) it is possible to bypass these worst-case lower bounds and achieve error bounds which are independent (or grow slowly) with the number of parameters [99, 61]. Other relevant theoretical investigations have focused on how public data can aid private training, either by working with models pre-trained on that data (as we do in our work), or using the public data for other purposes (e.g. estimating subspaces on which to project gradients obtained during optimizations) [73, 39]. Finally, the third and most relevant research thread is an empirical approach to DP training that aims to train models that are as private and as accurate as possible. This is the category most closely related to our work, thus we describe the prior state of the art in greater detail below. Image Classification from Scratch. As mentioned earlier, a popular belief in the research community was that smaller models would perform better than large ones when trained with differential privacy. For instance, examining recent work on the CIFAR-10 benchmark, which has been a popular yardstick to evaluate progress of DP training in recent years, [68], [85] and [31] use variants of shallow VGG models [98], while [106] use ScatterNets [84] to train linear models on handcrafted features, achieving 69.3% test accuracy under a tight privacy budget of (3, 10โˆ’5 )-DP. [65] achieved 71.7% test accuracy under (7.5, 10โˆ’5 )-DP by training a shallow 9-layer residual network. Building on an earlier version of this manuscript, [49] obtained 81.6% at ๐œ€ = 8 on CIFAR-10. In [104], the authors improved training accuracy by warm-starting the model parameters by pre-training on a small amount of synthetic data. Thanks to better scaling laws, [92] were able to scale up DP training from scratch and obtain 39.2% top-1 accuracy on ImageNet without additional data. Finally, [122] were able to train a foundational image classifier with DP-SGD, by pre-training on synthetic data and then training the model with differential privacy on a large-scale computer vision dataset. Image Classification Fine-Tuning. In the very work introducing differentially private deep learning, [1] already considered the task of fine-tuning with differential privacy. Indeed, they used a model pre-trained on the CIFAR-100 dataset without privacy, and they fine-tuned it with differential privacy on CIFAR-10. As mentioned earlier in the manuscript, this approach assumes that the pre-training data is โ€œpublicโ€, and that only the fine-tuning dataset is โ€œprivateโ€ and should be protected by the guarantee of differential privacy. This approach has since been used in multiple works [106, 120, 21], with many results focusing on small-scale datasets such as CIFAR-10 or CIFAR-100. Recently, DP fine-tuning image classifiers at larger scale got more interest. In particular, DP fine-tuning on ImageNet was first tackled in [68], where the authors obtained a top-1 accuracy of 47.8% at a privacy budget of (10, 10โˆ’6 )-DP. This was the only existing result on ImageNet before our earlier version of this work [25] appeared, where we obtained a considerably improved 81.1% top-1 accuracy under the tighter privacy budget of (8, 8 ยท 10โˆ’7 )-DP. Shortly after our work was initially released, [79] published a study on differentially private fine-tuning on ImageNet using a large ViT model pre-trained on JFT-4B [100], in which they achieved 81.1% top-1 accuracy under (1, 10โˆ’6 )-DP and 81.7% under (4, 10โˆ’6 )-DP. Subsequently, [78] further improved on these results by using a stronger pre-trained model and obtained 88% under (8, 8 ยท 10โˆ’7 )-DP. Finally, in this manuscript we obtain 88.5% top-1 accuracy on ImageNet at the same privacy budget of (8, 8 ยท 10โˆ’7 ) with powerful 31 Unlocking Accuracy and Fairness in Differentially Private Image Classification NFNets pre-trained on JFT-4B. We note however that some researchers have raised concerns about the relevance of these results for practical DP applications [107], since JFT and ImageNet images come from similar data distributions (both datasets were collected from natural images publicly available on the internet). Indeed, a crucial question is whether it is still possible to obtain strong results with DP-SGD when the pre-training data and fine-tuning data come from very different distributions. This situation is likely to be common in practical scenarios, since images similar in distribution to private images are often not publicly available. Our work sets out to precisely resolve this issue, by showing that DP fine-tuning can produce highly accurate models even in the presence of a significant shift between the pre-training and the fine-tuning images. While previous results have been shown on DP image classification on medical images [129, 88, 4, 3], to the best of our knowledge, this work is the first one to show high accuracy results (close to externally validated SOTA) on standard medical imaging datasets using only a public dataset of natural images for pre-training. Other Learning Tasks, Data Modalities and Privacy Threat Models. Differentially private training has also obtained promising results on natural language processing, with the notable successes of [2] training a BERT model [29], and [74, 119, 46] fine-tuning large-scale Transformer language models [113]. Small-scale investigations into federated learning approaches for medical imaging applications have been investigated in [62]. B.2. Disparities in Chest X-Ray Classification While disparate outcomes based on demographic groups in healthcare have been highlighted for many years, the first systematic study of disparities in chest x-ray diagnosis models in terms of true positive rate (TPR) was presented in [95]. They find that TPR disparities across subgroups exist in state-of-the-art classifiers for multiple datasets and clinical tasks. In followup work, they focus on under-diagnosis (e.g. a patient with a condition not receiving a diagnosis) and find that patients under 20 years old, Black patients, Hispanic patients, and patients with Medicaid insurance receive higher rates of under-diagnosis by machine learning models [96]. The role of data imbalance across sex was studied in [70], where it was found that there can be AUC disparities when a minimum balance of data is not achieved. Further, [126] analyses different techniques for improving predictive disparities and uses a comprehensive set of metrics including TPR and AUC from prior works as well as expected calibration error (ECE), cross entropy loss, recall, and specificity across groups. In the domain of private chest x-ray classifiers in particular, [101] find that predictions made by private models are influenced by larger subgroups in the population but they do not find systematic biases in classifier performance across sex. Since disparities have been studied and well documented in chest x-ray classification, we continue this line of inquiry when building privacy-preserving chest x-ray models. Recent work has also highlighted that using AUC alone is not sufficient for evaluating the quality and consequently, equity, of risk assessment tools [69]. Moreover, the cost of false positives and false negatives should be weighed in a context-specific manner when choosing a metric. However, threshold based metrics such as TPR and underdiagnosis require an additional optimization step. To isolate the effect of the model training step, we will focus on measure AUC and AUC disparities in our work. 32 Unlocking Accuracy and Fairness in Differentially Private Image Classification B.3. Privacy and Fairness Privacy and fairness trade-offs. The trade-offs between privacy and fairness have been studied by a number of prior works in machine learning. From differential privacy guarantees as the starting point, [24] prove that exact fairness in terms of equal opportunity [44] is incompatible with exact differential privacy for a classifier of non-trivial accuracy. From fairness guarantees as the starting point, [22] and [67] show that there are disparate privacy risks across populations in fair models using membership inference attacks. These works study models that have been trained with the objective of fair accuracy. Adjacent work has studied the notion of fair privacy; [89] show that an optimal classifier cannot ensure individuals reveal the same amount of information (fair-privacy) while achieving fair accuracy. Nevertheless, solutions for ameliorating fairness violations such as incorporating fairness constraints [109] and post-processing approaches [80] have also been proposed. Finally, [94] show that when there exists worst cases (in terms of label assignment or amount of minority subgroups) where a classifier cannot be simultaneously fair, differentially private and accurate. Measuring disparities in private deep learning models. Empirical investigations measuring the fairness of deep learning models trained with differential privacy have revealed various levels of disparity; disparity is most often measured by the difference between the worst-off group performance and the overall performance. The performance measure could be a variety of metrics including accuracy, AUC, True Positive Rate, and False Negative Rate depending on the application. It was empirically demonstrated in [6] that for underrepresented subgroups in age and gender classification in facial recognition datasets, sentiment analysis of tweets, and classifying species in the wild, the accuracy gap between a underrepresented class and the majority class widens with private training. Further, in [35] it was shown that a 70/30 imbalance in Celeb A male/female faces can cause large accuracy differences for the task of smile detection. Surprisingly, they also demonstrate that for their specific dataset and task there is a dataset split and clipping bound where a lower epsilon model has a smaller disparity between the male and female groups. Further, with unbalanced MNIST classes it was observed in [110] that the accuracy of the underrepresented group is much worse in a model trained with DP-SGD than trained with PATE or without privacy. Table 3 summarises the existing literature demonstrating empirical disparities on image classification tasks. Prior work [6] [115] Datasets MNIST, Diversity in Faces, iNaturalist MNIST Accuracy gap โˆผ 12% (Faces), โˆผ 2.5% (MNIST) โˆผ 10% [35] [94] Celeb A CIFAR-10, Celeb A โˆผ 20% โˆผ 7% (Celeb A), โˆผ 10% (CIFAR-10) Attributed reason for disparity Clipping and noise addition Larger gradients of minority groups Class imbalance Long-tail data (example difficulty) Table 3 | Overview of prior work reporting disparity on image classification datasets. C. Experimental Details C.1. Additional Dataset Details ImageNet-1k is a dataset of approximately 1.3M images, each labelled with one of 1000 mutually exclusive classes [91]. Due to its popularity in deep learning, it is an extremely competitive benchmark that allows us to compare the accuracy of our privately fine-tuned models against the most performant published image classification models to date without differential privacy [87, 14, 124]. When finetuning on ImageNet, we found that we could obtain performance comparable to full model fine-tuning 33 Unlocking Accuracy and Fairness in Differentially Private Image Classification when solely fine-tuning the final classification layer. This allows us to pre-compute the feature vector corresponding to each image, and to only learn a linear classifier layer, which significantly reduced the computational cost of training, enabling us to evaluate our performance on the larger NFNet-F7+ pre-trained model. AGC is not used when only fine-tuning the last layer. We note however that we found that it was necessary to fine-tune the full model to maximize performance on all the other datasets in this study. We speculate that this may indicate that the distribution shift between JFT and ImageNet is relatively small, when compared to the other datasets we study. Places-365 is a dataset of approximately 1.8 million images of scenes, labelled into 365 exclusive categories. CheXpert is a dataset of chest X-ray images annotated with labels of potential diseases [53]. This allows us to evaluate our fine-tuning methods in a scenario where the images protected by DP-SGD during fine-tuning are very different in nature from the images used for pre-training. In addition, CheXpert is a public competition that has received many entries, which provides us with strong baselines for non-private training. MIMIC-CXR is another large dataset of chest X-ray (grayscale) images annotated with labels of potential diseases, where the labels were automatically extracted from medical reports. MIMIC-CXR serves as an additional challenging benchmark to confirm that our findings obtained on CheXpert do generalize to other datasets. C.2. Non-Private Pre-Training The pre-trained models were based on the NFNet architecture [14], and were pre-trained without differential privacy on two data-sets, JFT-4B [100, 124] and ImageNet-21K [27]. The choice of NFNets was motivated by their high performance on image classification while avoiding the use of batch normalization [52], which is incompatible with DP-SGD. Three models were used in the experiments: F0, F3 and F7+. F0 and and F3 are pre-activation residual networks [47], which share the same block design and have the same model width. F3 has double the depth of F1 (excluding input and output layers), which itself has double the depth of F0. F7+ has double the depth of F3, and also has slightly increased width [14]. JFT-4B is an internal proprietary dataset, containing approximately 4 billion images, labelled into 30k (non-exclusive) classes. Networks pre-trained on JFT hold the state of the art for non-private training on a number of popular computer vision benchmarks [87, 14, 124, 26]. To preserve privacy guarantees when fine-tuning, a script was run to remove all images from JFT that were exact or nearduplicates of images in the ImageNet and Places-365 data sets across common data augmentations. To ensure the absence of subtle duplicates, the script also removed close semantic matches by comparing image embeddings under a pre-trained model [66]. The pre-training strategy follows [14] and uses an image resolution of 320x320. The NFNet-F0 and F3 were pre-trained for 2 epochs using the cross entropy loss. The larger F7+ model was pre-trained for a total of 4 epochs. For all four models, the learning rate was tuned on a logarithmic grid. No other hyper-parameters were tuned. As described in [14], our pre-training pipeline used SGD with adaptive gradient clipping (AGC). These pre-trained models perform significantly better if the same optimization algorithm is used during fine-tuning (both with and without DP). In order to provide results using smaller public/non-proprietary pre-training datasets, additional networks were pre-trained on the ImageNet-21k dataset, which comprises 14 million labelled images from 22k classes. ImageNet-1k (referred to as โ€˜ImageNetโ€™ in the main text) is a subset of ImageNet-21k, and consequently it cannot be used to meaningfully evaluate the performance of privately fine-tuning models pre-trained on ImageNet-21k on ImageNet-1k. Following an identical protocol to [14], the 34 Unlocking Accuracy and Fairness in Differentially Private Image Classification NFNets F0 and F3 were pre-trained for 80 epochs at resolution 224x224. C.3. Fine-Tuning on ImageNet Hyper-parameters were cross-validated a validation set of 10k examples extracted from the official training set, and as is standard practice, the evaluation accuracy was reported on the official validation set of 50k examples. Images were resized to 320x320 while preserving their aspect ratio, and standardized per channel (using pre-computed average and standard deviation) before being fed to the model. On this dataset, only the last linear layer was fine-tuned. The privacy parameter ๐›ฟ was set to 8 ยท 10โˆ’7 , and a clipping norm ๐ถ = 1 was employed. The batch size was set to ๐ต = 218 = 262, 144. The model was trained for 1000 updates at ๐œ€ โˆˆ {2, 4, 8} (with corresponding ๐œŽ โˆˆ {14.75, 7.92, 4.38} to fit the privacy budget), for 750 updates at ๐œ€ = 1 (๐œŽ = 24.18), and for 500 updates at ๐œ€ = 0.5 (๐œŽ = 37.67). More precise tuning of the step budget or noise scale did not improve performance further in our experiments. The learning rate was tuned on a logarithmic grid spaced by powers of 3.3. C.4. Fine-Tuning on Places-365 All images were resized to 256x256 while preserving their aspect ratio, and they were standardized per channel (using pre-computed averages and standard deviations). All parameters of the model were fine-tuned simultaneously using DP-SGD with AGC [14]. Results are provided when fine-tuning the NFNet-F3 pre-trained on either JFT-4B or ImageNet-21K. The privacy parameter ๐›ฟ was set to 5 ยท 10โˆ’7 . During training, label smoothing [102] was employed with value 0.1. Hyper-parameters were tuned on an internal validation set of 10k images extracted from the official training set. Final results were computed on the official validation set of 36.5k images. When fine-tuning the model pre-trained on JFT-4B, the batch size was set to ๐ต = 217 = 131072. The model was trained for 1000 updates at ๐œ€ โˆˆ {4, 8} (๐œŽ โˆˆ {2.70, 1.73}), 750 updates at ๐œ€ = 2 (๐œŽ = 4.72), 500 updates at ๐œ€ = 1 (๐œŽ = 7.25) and 250 updates at ๐œ€ = 0.5 (๐œŽ = 9.79). When fine-tuning the model pre-trained on ImageNet-21K at ๐œ€ = 8, the batch size was set to ๐ต = 217 and the model was trained for 1000 updates (๐œŽ = 1.73). The learning rate was tuned on a logarithmic grid spaced by powers of 3.3. C.5. Fine-Tuning on CheXpert Following the standard methodology [53], multi-label binary classifiers were trained on all classes and evaluated on a subset of 5 classes: โ€˜Atelectasisโ€™, โ€˜Cardiomegalyโ€™, โ€˜Consolidationโ€™, โ€˜Edemaโ€™ and โ€˜Pleural Effusionโ€™ โ€“ the classes available in the official validation and test sets. The Area Under the Curve (AUC) was computed per class and then averaged over classes. For simplicity, all examples were treated independently, both at training and evaluation time, instead of aggregating them by patient or study. Images were rescaled to have values within [โˆ’1, 1]. Following [53], during training, uncertain labels were mapped to positive or negative labels: for classes โ€˜Atelectasisโ€™, โ€˜Edemaโ€™ and โ€˜Pleural Effusionโ€™, uncertain labels were treated as positive, while for the other classes they were treated as negative. At training time, label smoothing [102] was used with value 0.2 for the uncertain labels (and no label smoothing for positive/negative labels). The model was trained on all 223k training images. Hyper-parameters were cross-validated on the official validation set of 234 images, and the final performance was evaluated on the official test set of 668 images (as done in the CheXpert competition [53]). All layers of the model were fine-tuned, and parameters were accumulated using an Exponential Moving Average with parameter 0.9. 35 Unlocking Accuracy and Fairness in Differentially Private Image Classification For all runs with differential privacy, the batch-size was set to 4096, and the clipping-norm to ๐ถ = 10 โˆ’3 , ๐›ฟ = 1/ ๐‘ , where ๐‘ is the number of training samples, and a learning-rate of 2/๐ถ . The noise multiplier ๐œŽ was automatically adjusted to fit the privacy budget ( ๐œ€, ๐›ฟ), number of steps ๐‘‡ and batch-size ๐ต. For ๐œ€ = 0.5, the model was trained for ๐‘‡ = 188 updates (๐œŽ = 2.11), for ๐œ€ = 1.0 ๐‘‡ = 375 (๐œŽ = 1.64), for ๐œ€ = 2.0 ๐‘‡ = 750 (๐œŽ = 1.30), for ๐œ€ = 4.0 ๐‘‡ = 1500 (๐œŽ = 1.07) and for ๐œ€ = 8.0 ๐‘‡ = 3000 (๐œŽ = 0.91). For the non-private baseline, the model was trained for ๐‘‡ = 2000 updates at a batch-size of 1024 (training longer resulted in over-fitting), using a constant learning-rate of 1.0 and a weight decay of 0.0001. For both private and non-private runs, the final predictions were aggregated from multiple checkpoints stored along the course of a single training run, following [53]. For each run, ten evenly spaced checkpoints were created, each using the Exponential Moving Average version of the parameters. Neither the EMA or the checkpoint aggregation incur a cost in terms of differential privacy since DP-SGD allows for the release of all intermediate checkpoints. The experiments on DP fine-tuning take from 3 TPUv3 hours (๐œ€ = 0.5) to 42 TPUv3 hours (๐œ€ = 8), while the non-private fine-tuning experiment takes approximately 1 TPUv3 hour. C.6. Fine-Tuning on MIMIC-CXR All classes were used for training and evaluation on MIMIC-CXR, following standard protocol [95]. As in CheXpert, Area Under the Curve (AUC) was computed per class and averaged over the classes. In order to compare with the results from [95, 63], an identical protocol was used to extract validation and test sets from the official training set, forming a 80-10-10 split for training-validation-testing (respectively 259k, 32k and 32k examples). For simplicity, all examples were treated independently (both at training and evaluation time), instead of aggregating them by patient or study. The images were down-sampled to 256x256 and rescaled to have values within [โˆ’1, 1]. All layers of the model were fine-tuned in this experiment, and the parameters were accumulated using an Exponential Moving Average with parameter 0.9. For all runs with differential privacy, the batch-size was set to 4096, the clipping-norm to ๐ถ = 10โˆ’3 , ๐›ฟ = 1/ ๐‘ , where ๐‘ is the number of training samples, and a learning-rate of 2/๐ถ . The noise multiplier ๐œŽ was automatically adjusted to fit the privacy budget ( ๐œ€, ๐›ฟ), number of steps ๐‘‡ and batch-size ๐ต. At ๐œ€ = 0.5, the model was trained for ๐‘‡ = 188 updates (๐œŽ = 1.88), at ๐œ€ = 1.0 for ๐‘‡ = 375 (๐œŽ = 1.48), at ๐œ€ = 2.0 for ๐‘‡ = 750 (๐œŽ = 1.19), at ๐œ€ = 4.0 for ๐‘‡ = 1500 (๐œŽ = 0.99) and at ๐œ€ = 8.0 for ๐‘‡ = 3000 (๐œŽ = 0.85). For the non-private baseline, the model was trained for ๐‘‡ = 2000 updates at a batch-size of 1024 (training longer resulted in over-fitting), using a constant learning-rate of 0.25 and a weight decay of 0.0001. The experiments on DP fine-tuning take from 2.4 TPUv3 hours (๐œ€ = 0.5) to 37 TPUv3 hours (๐œ€ = 8), while the non-private fine-tuning experiment takes approximately 1 TPUv3 hour. C.7. Analyzing Fairness Disparities in CXR Classification Models Fairness is analyzed through the lens of accuracy parity [9] by measuring disparities through additive differences in model performance on subgroups relative to the overall accuracy as in [95]. Area under the ROC curve (AUC) was used as the measure of accuracy for CXR models. For an evaluation dataset ๐ท, a subgroup ๐ด and a classifier ๐‘“ , the disparity in AUC of ๐‘“ on ๐ด is given by ๐ท๐‘–๐‘ ๐‘ ๐ด = 36 Unlocking Accuracy and Fairness in Differentially Private Image Classification ๐ด๐‘ˆ๐ถ ( ๐‘“ , ๐ท) โˆ’ ๐ด๐‘ˆ๐ถ ( ๐‘“ , ๐ท ๐ด ) , where ๐ท ๐ด is the dataset obtained by taking all the records in ๐ท corresponding to individuals in group ๐ด. Disparities on MIMIC-CXR. Following [96], sub-groups were defined using demographic groups present in the dataset based on patientโ€™s sex (M/F), age bracket (age discretized to one of 18-20, 20-40, 40-60, 60-80, 80+), race (american indian/alaska native, asian, black/african american, hispanic/latino, white, and other), and type of insurance (medicaid, medicare, other). In addition to sub-groups defined by single attributes, intersectional sub-groups based on (sex, race) and (sex, age bracket) were also considered. Demographic attributes for each individual were obtained by linking the patient ID in the MIMIC-CXR dataset [59] with the admissions table in the MIMIC-IV dataset [58]. In a small number of cases, patients had multiple admission records reporting different races or types of insurance โ€“ records from these patients were removed from the dataset before analysis to avoid confounders. Disparities on CheXpert. A similar procedure to MIMIC-CXR was followed, although some modifications were required. First, the private models and non-private baselines were re-trained following the procedure described in Section C.5 but using a different dataset split: the official training dataset was partitioned into 3 parts using an 80-10-10 proportion for training, validation and testing, resulting in datasets of size 178k, 22k and 22k respectively. This was necessary because the official test set does not contain demographic attributes (while the official training set does). The splits were created by assigning the first 10% of patient IDs to the test set, the following 10% to the validation set and the remaining 80% to the training set. This process did not bias the data distribution of each split because patient IDs were randomly generated when the CheXpert dataset was originally compiled, and at the same time ensures that images from a single patient only occur in one of the splits. When training private models on this split, the batch size and number of iterations remained identical to the experiments from Section C.5, but the noise multiplier was adapted to fit the privacy budget ๐œ€ given the new number of training examples as follows: ๐œŽ = 2.52 for ๐œ€ = 0.5, ๐œŽ = 1.94 for ๐œ€ = 1, ๐œŽ = 1.52 for ๐œ€ = 2, ๐œŽ = 1.23 for ๐œ€ = 4, and ๐œŽ = 1.03 for ๐œ€ = 8. Demographic sub-groups in CheXpert were based on the only two demographic attributes present: age range (discretized as above), and sex. C.8. Tuning the Hyper-Parameters In all our experiments when fine-tuning with DP-SGD, we use a constant learning rate throughout training. Although learning rate schedules significantly improve performance in non-private training [75, 118], we found that they did not improve performance for private training and required additional hyper-parameter tuning. We also use an Exponential Moving Averaging (EMA) of the model parameters with rate 0.9999 (unless otherwise stated), using the EMA warm-up scheme described in [103]. We found that EMA consistently improved model accuracy by increasing the convergence rate when training with the added noise required to achieve privacy guarantees with DP-SGD. We use extremely large batch sizes, often only an order of magnitude smaller than the total dataset size. As also observed by other researchers [111, 76, 68, 121, 74, 2, 31], we found that this significantly improved performance when training with DP-SGD. When we use large batch-sizes, we use gradient accumulation across multiple steps to avoid memory overflow. We note that our DP-SGD implementation encodes this feature carefully so as to ensure that the correct amount of noise is injected to satisfy the DP guarantees. We found that the performance of DP-SGD did not depend strongly on the clipping norm ๐ถ (so long as the clipping norm was not too large). We therefore set this hyper-parameter ๐ถ = 1 for all experiments on the ImageNet and Places-365 sets. When fine-tuning on the CheXpert and MIMIC-CXR 37 Unlocking Accuracy and Fairness in Differentially Private Image Classification datasets, we employ a per-class loss function that results in gradients with smaller scale, thus we use ๐ถ = 0.001 in these cases. The main hyper-parameter specific to DP-SGD that we needed to tune carefully on some (though not all) datasets was the noise multiplier ๐œŽ. We found that performance usually improves as ๐œŽ rises, however this also increases the computational cost of training, since the number of training iterations allowed within a fixed privacy budget ๐œ€ increases as ๐œŽ rises. We found that values of ๐œŽ close to 2 usually achieve a good trade-off between achieving high accuracy while not requiring too large a computational budget [92]. Data augmentation as conventionally applied, drawing a single random sample from the augmentation procedure for each image in the current batch [72, 97], consistently reduced the performance of private training with DP-SGD. In order for private training to benefit from data augmentation, we found in [25] that one must apply augmentation multiplicity [48, 38], whereby the per-example gradient of each image in the batch is averaged over multiple independent random augmentations. This averaging can be applied before the per-example gradients are clipped, reducing the privacy loss. However the downside of this scheme is that it increases the computational cost of training with DP-SGD substantially. For the experiments provided in the main text we instead simply removed data augmentation from the training pipeline. As stated above, we found performance improved when the same optimization algorithm used for pre-training was also used during fine-tuning. We therefore applied DP-SGD with AGC [14], to match our pre-training framework described in the Section C section. On ImageNet, only the last linear layer was fine-tuned with differential privacy. On all other datasets, all layers were fine-tuned, which we find to significantly outperform tuning the last layer only. We postulate that this is due to the wider significant shift between the pre-training data and Places-365 as well as the medical image classification tasks, compared to that between the pre-training data and ImageNet. D. Disparities in Private Chest X-Ray Classification: Additional Results D.1. MIMIC-CXR To measure the fairness effects of DP training on the MIMIC-CXR dataset, we analyze the differences in AUC disparities between the private models and non-private baselines (cf. Section C.6). In Figure 4 we present the results of this evaluation for private models with ๐œ€ = 8 compared to non-private models using the average AUC across all 14 labels present in the dataset. Figure 7-12 present a more fine-grained view of these results using AUC across individual labels and a range ๐œ€โ€™s. To understand how the level of privacy guarantee (i.e. ๐œ€), Figure 13 provides a view on the distribution (over random seeds) of the maximum AUC disparity across sub-groups defined by single demographic attributes. D.2. CheXpert Figure 14 presents the results of evaluating disparities between the private models and non-private baseline using the same procedure as in Figure 4. We note that CheXpert only contains sex and age demographic attributes, so the evaluation is limited to intersectional groups based on these two attributes. Qualitatively we observe the same behavior as in the MIMIC-CXR dataset. 38 Unlocking Accuracy and Fairness in Differentially Private Image Classification AUC Disparity Atelectasis Cardiomegaly Consolidation Edema 0.2 0.0 0.2 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.2 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Macro Micro No Finding Pleural Effusion 0.2 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Pleural Other Pneumonia Pneumothorax Support Devices 0.2 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Weighted 0.2 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf ('20-40', 'F') ('40-60', 'F') ('20-40', 'M') ('40-60', 'M') Age-Sex ('60-80', 'F') ('60-80', 'M') ('80+', 'F') ('80+', 'M') ('<20', 'F') ('<20', 'M') Figure 7 | AUC disparities on MIMIC-CXR for the various age-sex subgroups, as a function of ๐œ€. AUC disparities do not consistently increase as the model gets more private (i.e. as ๐œ€ decreases). 39 Unlocking Accuracy and Fairness in Differentially Private Image Classification AUC Disparity AUC Disparity Atelectasis AUC Disparity Consolidation Edema 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Macro Micro No Finding Pleural Effusion 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Pleural Other AUC Disparity Cardiomegaly Pneumonia Pneumothorax Support Devices 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Weighted 0.0 0.2 0.5 1.0 2.0 4.0 8.0 inf 20-40 40-60 AgeRange 60-80 80+ <20 Figure 8 | AUC disparities on MIMIC-CXR for the various age subgroups, as a function of ๐œ€. 40 Unlocking Accuracy and Fairness in Differentially Private Image Classification AUC Disparity Atelectasis Cardiomegaly Consolidation Edema 0.0 0.1 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Macro Micro No Finding Pleural Effusion 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Pleural Other Pneumonia Pneumothorax Support Devices 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Weighted 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf Medicaid Insurance Medicare Other Figure 9 | AUC disparities on MIMIC-CXR for the various insurance subgroups, as a function of ๐œ€. 41 Unlocking Accuracy and Fairness in Differentially Private Image Classification AUC Disparity Atelectasis Cardiomegaly Consolidation Edema 0.25 0.00 0.25 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.25 0.00 0.25 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Macro Micro No Finding Pleural Effusion 0.25 0.00 0.25 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Pleural Other Pneumonia Pneumothorax Support Devices 0.25 0.00 0.25 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Weighted 0.25 0.00 0.25 0.5 1.0 2.0 4.0 8.0 inf AMERICAN INDIAN/ALASKA NATIVE ASIAN Race BLACK/AFRICAN AMERICAN HISPANIC/LATINO OTHER WHITE Figure 10 | AUC disparities on MIMIC-CXR for the various race subgroups, as a function of ๐œ€. 42 Unlocking Accuracy and Fairness in Differentially Private Image Classification AUC Disparity Atelectasis Cardiomegaly Consolidation Edema 0.05 0.00 0.05 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.05 0.00 0.05 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Macro Micro No Finding Pleural Effusion 0.05 0.00 0.05 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Pleural Other Pneumonia Pneumothorax Support Devices 0.05 0.00 0.05 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf AUC Disparity Weighted 0.05 0.00 0.05 0.5 1.0 2.0 4.0 8.0 inf F Sex M Figure 11 | AUC disparities on MIMIC-CXR for the various sex subgroups, as a function of ๐œ€. 43 AUC Disparity AUC Disparity Unlocking Accuracy and Fairness in Differentially Private Image Classification 0.5 Atelectasis Cardiomegaly Consolidation Edema 0.0 0.5 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Enlarged Cardiomediastinum Fracture Lung Lesion Lung Opacity 0.0 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 Macro Micro No Finding Pleural Effusion 0.0 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 Pleural Other Pneumonia Pneumothorax Support Devices 0.0 AUC Disparity 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 Weighted 0.0 0.5 1.0 2.0 4.0 8.0 inf Sex-Race ('F', 'AMERICAN INDIAN/ALASKA NATIVE') ('F', 'ASIAN') ('F', 'BLACK/AFRICAN AMERICAN') ('F', 'HISPANIC/LATINO') ('F', 'WHITE') ('M', 'AMERICAN INDIAN/ALASKA NATIVE') ('M', 'ASIAN') ('M', 'BLACK/AFRICAN AMERICAN') ('M', 'HISPANIC/LATINO') ('M', 'OTHER') ('M', 'WHITE') Figure 12 | AUC disparities on MIMIC-CXR for the various race-sex subgroups, as a function of ๐œ€. 44 Distribution of maximum AUC disparities Unlocking Accuracy and Fairness in Differentially Private Image Classification 0.06 0.05 0.04 0.03 0.02 0.01 AgeRange 0.5 Insurance Race Sex 2.0 4.0 8.0 1.0 inf Figure 13 | Maximum AUC disparities across sub-groups of different types (age, insurance, race and sex) as a AUC Disparity ( = 8.0) function of ๐œ€ on MIMIC-CXR. The distribution shown is over 20 random seeds for the model training procedure. The maximum disparity reduces as ๐œ€ grows from 0.5 to 8 โ€“ for public models (i.e. ๐œ€ = โˆž) disparities are higher than in private models for age and insurance groups, and smaller for race and sex groups. Age/Race Group 0.050 20-40 40-60 60-80 80+ <20 0.025 0.000 0.025 Sex Group 0.050 Female Male 0.075 0.05 0.00 0.05 AUC Disparity ( = ) Age/Race Group AUC Disparity Gap 0.0050 20-40 40-60 60-80 80+ <20 0.0025 0.0000 0.0025 Sex Group Female Male 0.0050 0.0075 0 2000 4000 Group Size Figure 14 | AUC disparities (i.e. population AUC - sub-group AUC) in private models are comparable to the disparities we observe on non-private models with comparable overall accuracy on CheXpert. (Top) For the private (๐œ€ = 8) and non-private baseline models from Figure 3, comparing AUC disparities by sub-group between both models. Disparities are averaged over 20 independent runs, and gray crosses represent standard deviation. (Bottom) Stratification of differences in disparities between private and non-private models (averaged over 20 independent runs) by sub-group size. We include an OLS regression line to predict disparity gap as a function of group size (and 95% confidence interval based on 1000 bootstrap resamples). The slope of the regression model lies in the 95% confidence interval [-6.63e-07, 6.37e-07]. 45 AUC Disparity AUC Disparity Unlocking Accuracy and Fairness in Differentially Private Image Classification 0.1 Atelectasis Cardiomegaly Consolidation Edema 0.0 0.1 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Macro Micro Pleural Effusion Weighted 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Age-Sex ('20-40', 'Female') ('20-40', 'Male') ('40-60', 'Female') ('40-60', 'Male') ('60-80', 'Female') ('60-80', 'Male') ('80+', 'Female') ('80+', 'Male') ('<20', 'Female') ('<20', 'Male') AUC Disparity AUC Disparity Figure 15 | AUC disparities on CheXpert for the various age-sex subgroups, as a function of ๐œ€. Atelectasis Cardiomegaly Consolidation Edema 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Macro Micro Pleural Effusion Weighted 0.0 0.1 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 20-40 40-60 AgeRange 60-80 80+ <20 Figure 16 | AUC disparities on CheXpert for the various age subgroups, as a function of ๐œ€. 46 AUC Disparity AUC Disparity Unlocking Accuracy and Fairness in Differentially Private Image Classification 0.02 Atelectasis Cardiomegaly Consolidation Edema 0.00 0.02 0.02 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Macro Micro Pleural Effusion Weighted 0.00 0.02 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf 0.5 1.0 2.0 4.0 8.0 inf Sex Female Male Distribution of maximum AUC disparities Figure 17 | AUC disparities on CheXpert for the various sex subgroups, as a function of ๐œ€. 0.05 0.04 0.03 0.02 0.01 0.00 AgeRange 0.5 1.0 Sex 2.0 4.0 8.0 inf Figure 18 | Maximum AUC disparities across sub-groups of different types (age, insurance, race and sex) as a function of ๐œ€ on CheXpert. The distribution shown is over 20 random seeds for the model training procedure. The maximum disparity reduces or stays roughly the same as ๐œ€ grows from 0.5 to 8 โ€“ for public models (i.e. ๐œ€ = โˆž) disparities are higher than in private models. 47 Unlocking Accuracy and Fairness in Differentially Private Image Classification E. Additional Results on Training From Scratch on CIFAR-10 E.1. Per-Class Disparities While the prior sections focused on measuring disparities in models fine-tuned with differential privacy, this section will be dedicated to understand disparities in privacy-preserving models trained from scratch. We build on prior work examining private model disparities at lower overall accuracy by extending analysis to higher model accuracy regimes. Our results also complement theoretical impossibility results about the limits of private learning. We look at a practical regime of model performance that does not yet require memorisation of individual examples [36]. The significance of our findings in this section lies in the re-framing of privacy-fairness trade-offs into privacy-accuracy trade-offs. When private model accuracy is much worse than non-private models, closing the accuracy gap will also lead to narrowing the disparity gap. Experimental setup. We train a Wide Residual Network (WRN) on CIFAR-10, where we use 45k examples for training, 5k for validation and 10k for testing. The input images are standardized per channel using pre-computed average and standard deviation per channel, aggregated over the dataset. For our two main models for comparison: we use the same architecture to train a non-private model to achieve an overall 93% Top 1 accuracy2 and for a private model ๐œ€ = 8 with DP-SGD to an overall 81% accuracy following techniques from [25]. We describe specific procedures for different experiments in the subsequent sections. Defining subgroups. When examining accuracy disparities across classes for our CIFAR-10 dataset, we measure the difference in accuracy of one class compared to another; the subgroups of interest are defined by the true class label. We note that notions of fairness such as demographic parity (independence) has no meaning when subgroups are defined by the true label. However, we study this type of disparity across classes following a series of prior work measuring class disparities [6, 115]. This notion of disparity has also been used in adjacent areas such as robustness [11, 116]. Examining disparities between classes gives more insight beyond just accuracy in terms of understanding the strengths and weaknesses of a model. However, since class labels are not independent, a high accuracy for one class may arise from a model predicting too many examples to be that class and consequently reducing the accuracy of another class. Better measurement of private model disparities. One question that naturally arises is whether the per-class accuracy differences between models can be replicated across different random seeds. We would require disparity measurement to be stable across random seeds before concluding that private models have more disparity or a different type of disparity. To answer this question, we trained 50 different models at each ๐œ€ over different random seeds and plot the spread of class conditional accuracy observed. While all non-private and private models were within the same overall accuracy range (๐œ€ = 1: 50% - 55% accuracy, ๐œ€ = 8: 77% - 81% accuracy), Figure 19 (left) shows how much class conditional disparity can vary from run to run across different random seeds. There is enough variation that it is not clear what the worst performing class is. Furthermore, when we stop training of a non-private model early to be similar in accuracy to the ๐œ€ = 8 model, the non-private models exhibit substantially less variation across random seeds than the private models. To explain this phenomenon, we can look at the precision-recall plot. In Figure 19 (right), there is a large spread between where each class lies in the precision-recall plot across random seeds. For example, in some runs, the cat class has precision โ‰ˆ 0.82 and recall โ‰ˆ 0.45 recall which means that 2While some non-private models can achieve up to 99.9% accuracy on CIFAR-10, our accuracy is a reasonable comparison to models without pre-training, architecture search, ensembling, or attention https://benchmarks.ai/cifar-10 48 Unlocking Accuracy and Fairness in Differentially Private Image Classification 1.0 Accuracy by group over 50 models 1.0 0.9 0.8 0.7 0.6 0.5 Private ( = 8) Non-Private (Early Stop) Non-Private airplane car bird cat deer dog frog horse ship truck Class Precision Test Accuracy 0.9 Precision Recall Plot at = 8 (final params) 0.8 0.7 0.6 0.5 0.6 0.7 0.8 Recall 0.9 1.0 airplane car bird cat deer dog frog horse ship truck Figure 19 | (Left) Box plot of class conditional accuracy shows significant variation across random seeds. (Right) Class-wise precision vs recall plot across random seeds. The variation in class conditional accuracy is due to private models over and under predicting certain classes. most images that are predicted to be cats are indeed cats but there are also many cat images that are predicted to be another class. High precision and low recall causes under-prediction of a class while low precision and high recall causes over-prediction of a class. Over-predicting one class implies that the model is also under-predicting another class; for one model with high precision on the cat class would then necessarily have low precision on the dog class since the two classes of images are often confused for one another. In Figure 19 (right), each model is illustrated by a set of 10 dependent plotted points, one for each class. This over- and under- prediction is not simply an artifact of a less accurate model. For an earlystopped non-private model with the same accuracy as the ๐œ€ = 8 models, the precision-recall plot shows all classes to be on or near the diagonal across all runs. Class conditional accuracy of the last checkpoint in a model trained with DP-SGD does not provide stable estimates of disparity for multi-class classification. When subgroups defined by classes are not independent (e.g. CIFAR-10 / MNIST), the additional noise introduced by DP-SGD may cause over and under prediction of classes across runs. We observe that the reduction in the variance of accuracy under private training when using EMA directly mitigates the noisiness in these measurements, thus producing more robust disparity measurements because significant fluctuations as a function of the random seed are removed (Figure 21 - right). Alternatively, we observe that using F1 score as a measure of accuracy (without EMA) also helps reduce the variations of accuracy measurements across random seeds (Figure 21 - left). Results: Balanced Classes. To disentangle the role of class imbalance and lower overall accuracy, we first compare private and non-private models trained with balanced data. When comparing a ๐œ€ = 8 model to a non-private model that is โˆผ 12% more accurate, we observe an absolute disparity gap between private and non-private models (Figure 20 - left). However, it is unclear whether private training hurts specific classes as suggested by previous work [6, 109], or whether these disparities arise from the model being overall less accurate. We compare models at ๐œ€ = 8 (around โˆผ 81% overall accuracy) with non-private models of the same architecture where training was stopped early (around โˆผ 80% overall accuracy). We see in Figure 20 (right) that worst class performance is comparable between private and non-private models. Beyond two specific accuracy levels, we can also look at the progression of the worst class accuracy as the overall accuracy increases during model training. Figure 22 shows the worst class (cat) accuracy as training progresses for ๐œ€ = {8, 16, 50, โˆž}. What would DP-SGD induced disparity look like in this plot? We would expect to see private models diverging from the trajectory of non-private models by having a smaller slope. This would indicate that the worst group suffers from lower accuracy at 49 Unlocking Accuracy and Fairness in Differentially Private Image Classification Per class accuracy of a private and a non-private model 1.0 0.8 0.4 Class truck ship horse frog dog deer airplane truck ship horse frog dog deer 0.0 cat Non-Private (Early Stop) Private ( = 8) Avg Accuracy 0.2 cat airplane 0.2 car Non-Private Private ( = 8) Avg Accuracy car 0.4 0.6 bird Test Accuracy 0.6 bird Test Accuracy 0.8 0.0 Comparing private and early stopped non-private models 1.0 Class Figure 20 | (Left) Class conditional accuracy of a non-private model (93% Top 1 accuracy) and a ๐œ€ = 8 private model (81% Top 1 accuracy) trained on CIFAR-10 (balanced classes). Dotted lines represent the overall Top 1 accuracy. (Right) Comparison of ๐œ€ = 8 model with non-private models at a similar accuracy (EMA). We see that the per class disparities are similar. 1.0 Precision Recall Plot at = 8 (averaged params) 1.0 F1 score by group and privacy level 0.9 F1 0.8 0.7 0.6 0.5 Private ( = 8) Non-Private (Early Stop) Non-Private airplane car bird cat deer dog frog horse ship truck Class Precision 0.9 0.8 0.7 0.6 0.6 0.7 0.8 Recall 0.9 1.0 airplane car bird cat deer dog frog horse ship truck Figure 21 | (Left) Using F1 score instead of accuracy drastically reduces variation in per class performance for private models. (Right) Using checkpoint averaging like Exponential Moving Average (EMA) also reduces variation. 50 Unlocking Accuracy and Fairness in Differentially Private Image Classification Overall accuracy vs worst group accuracy for different epsilon models 0.85 Worst group accuracy 0.80 0.75 :8 : 16 : 50 : 0.70 0.65 0.60 0.55 0.50 0.70 0.75 0.80 Overall accuracy 0.85 0.90 Figure 22 | Overall accuracy (EMA) vs worst class accuracy (EMA) as private and non-private models are trained. We see that while private models stop at a lower overall accuracy, worst class accuracy of private models are similar to or higher than the non-private model throughout the training trajectory. Each point corresponds to a (EMA) checkpoint in the training process. the same level of overall accuracy. However, this is not what we observe. We observe private models matching or surpassing non-private models in terms of worst group accuracy at the same overall accuracy. In the setting of balanced data for the CIFAR-10 dataset, we do not observe any evidence that DP-SGD exacerbates disparity at in privacy preserving models with a similar accuracy. Results: Imbalanced Classes. A crucial scenario we also consider is when certain classes in the training dataset are underrepresented. This is the case when [6] compare private and non-private model performance on a 60,000-image subset of iNaturalist [112]. The species classes are highly unbalanced with the smallest class being only 1,000 examples. The works [115] and [110] also create an artificially unbalanced dataset by only including 10% of class 8 images. In our experiments, we look at two classes: car and cat. We artificially create unbalanced training and validation sets of 50%, 20% and 10% of the cat class while leaving the rest of the classes at 100% representation. We repeat the same setup with the car class. We leave the test set to be the same 10,000 balanced examples. For various levels of artificially unbalanced data, Figure 23 (left) shows the training progression of ๐œ€ = 8 private models compared to non-private models. We see that in terms of overall accuracy, private models achieve lower accuracy by about 10%. In terms of unbalanced class accuracy, we see that representation in the training set drastically affects the accuracy of the cat class. For 50% and 25% representation of the cats in the training data, private models are comparable to non-private models in the regime where non-private model accuracy is between less than 75%. At 10% (โˆผ500) cats in the training data, the private model achieves zero accuracy on cats despite a โˆผ 75% overall accuracy. This is due to the resulting model never predicting any example in the cat class. When a subclass is small enough, private models may not predict the label at all. In the limit, if only 1 example had the class label cat, then we would expect our DP models to guarantee to never predict said example class. While differential privacy guarantees tell us what happens when there is only 1 example, we do not know what level of class under-representation causes a private model at a fixed ๐œ€ to eschew predicting said class altogether in real world datasets. 51 Unlocking Accuracy and Fairness in Differentially Private Image Classification Unbalanced class accuracy during training Overall vs cat 1.0 0.6 0.4 0.2 0.0 : 8 Balanced : 8 50% Car : 8 25% Car : 8 10% Car : Balanced : 50% Car : 25% Car : 10% Car 0.8 Car accuracy Cat accuracy 0.8 Overall vs car 1.0 : 8 Balanced : 8 50% Cat : 8 25% Cat : 8 10% Cat : Balanced : 50% Cat : 25% Cat : 10% Cat 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Overall accuracy 0.8 1.0 0.0 0.0 0.2 0.4 0.6 Overall accuracy 0.8 1.0 Figure 23 | (Left) Overall accuracy (EMA) vs cat class accuracy (EMA) at different levels of dataset imbalance for private and non-private models. (Right) Overall accuracy (EMA) vs cat class accuracy (EMA) at different levels of dataset imbalance for private and non-private models. We see that while private models stop at a lower overall accuracy, the ratio (relative disparity) is similar throughout the training trajectory. While the cat class is a difficult class for private and non-private models, the car class is an easy class. In Figure 20, we see that the car class consistently achieves a higher accuracy than the overall accuracy. To disentangle the effect of class difficulty and class representation in private training, we also train ๐œ€ = 8 models on CIFAR-10 with the cars class artificially under-sampled to 10%, 25%, 50%. In Figure 23 (right), the green cluster in the top right shows that non-private models achieve a very similar final overall/car class accuracies for all the unbalanced datasets. The ๐œ€ = 8 models at 25% and 50% imbalance have the same car (underrepresented) class accuracy as non-private models at the same overall accuracy. It is only at 10% under representation that we observe non-private models at the same overall accuracy have much higher underrepresented class accuracy. This demonstrates that while DP-SGD does not necessarily imply worsened disparity when classes are underrepresented, DP-SGD can introduce worsened disparities at extreme levels of data imbalance. When measuring per-class disparities in private models, the mere presence of data imbalance does not dictate disparity in private models. We compare two classes with different levels of โ€œdifficulty" and find that different levels of under-representation are required for observing differences between private and non-private model disparity. While our results here are not particularly surprising, they do motivate more systematic empirical analyses of regimes where private model disparities are comparable to non-private models in terms of disparity. The presence of data imbalance should not deter practitioners from applying privacy preserving machine learning due to the risk of class disparities; the level of actual disparity depends on the dataset, the class, and the level of data imbalance. E.2. Picking ๐œ€: Rethinking Example Difficulty We have focused primarily on ๐œ€ = 8 CIFAR-10 models to make the case the training models with DP-SGD does not necessarily imply more disparity than non-private models. Different settings may require different levels of privacy guarantees. A natural question that arises is what kind of disparities stronger levels of privacy may produce. When comparing private models at different levels of privacy, Figure 24 illustrates that as ๐œ€ increases from 1 to 8, the overall accuracy also increases. This is expected since as ๐œ€ approaches โˆž we would expect the same accuracy as a non-private model. This is gives further insight into Figure 22 where we only compare the worst class accuracy against the 52 Unlocking Accuracy and Fairness in Differentially Private Image Classification Test accuracy and disparity of different classes across different epsilons resnet-40-ema Accuracy Absolute Disparity 0.15 0.9 0.10 0.8 0.7 group_name airplane bird car cat deer dog frog horse ship truck 0.6 0.5 0.4 1 2 3 4 5 6 7 8 Mean disparity Mean accuracy 0.05 0.00 0.05 0.10 0.15 0.20 1 2 3 4 5 6 7 8 Figure 24 | Class conditional accuracy (EMA) and absolute disparity averaged across 5 random seeds for different values of ๐œ€. overall accuracy; in the CIFAR-10 dataset, cat is the worst class across different ๐œ€. Suppose both the cat and bird classes were not a part of the dataset, then it appears that the most disadvantaged class would depend on ๐œ€. When ๐œ€ โ‰ค 4, we see that deer is the worst class while for ๐œ€ โ‰ฅ 4, dog becomes the worst performing class. There is a cross-over between the class conditional accuracy of deer and dog as ๐œ€ increases. A naive practitioner may observe that deer is the worst class at a small ๐œ€ and may then focus on improving the performance of the class without realising that the ordering of example difficulty may change. The transition of the deer class being more difficult at small ๐œ€ and easier at larger ๐œ€ is consistent across random seeds and model architectures in our experiments. In contrast, during the training progression of non-private models, all classes progress in per-class accuracy without any crossover in ranking. In fact, prior work examining the difficulty of examples [18, 37, 56] rely on the assumption that example difficulty is not model/algorithm dependent in order to give useful score estimates of difficulty. [64] observed one type of non-monotonic behaviour by finding examples in the CIFAR-10 dataset which correlate inversely with overall model performance for increasingly complex non-private models. However, this swapping of class difficulty we observe suggests that training private models at increasing ๐œ€ may not behave the same as training non-private models of increasing complexity (in terms of parameters or training iterations). Future investigation into measuring and modelling changes in the relative difficulty of examples, and consequently classes, as the level of privacy changes is also crucial for understanding how to select privacy parameters. 53