Private Learning with Public Features Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Shuang Song, Abhradeep Thakurta, and Li Zhang∗ arXiv:2310.15454v1 [cs.LG] 24 Oct 2023 Google Abstract We study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not require protection. A natural question is whether private algorithms can achieve higher utility in the presence of public features. We give a positive answer for multi-encoder models where one of the encoders operates on public features. We develop new algorithms that take advantage of this separation by only protecting certain sufficient statistics (instead of adding noise to the gradient). This method has a guaranteed utility improvement for linear regression, and importantly, achieves the state of the art on two standard private recommendation benchmarks, demonstrating the importance of methods that adapt to the private-public feature separation. 1 Introduction Models trained on private user data risk leaking sensitive information (Korolova, 2010; Calandrino et al., 2011; Zhu et al., 2019), and Differential Privacy (DP) (Dwork et al., 2006) offers ways to quantify and limit this risk. One of the main challenges of DP training is that privacy guarantees come at the expense of losses in utility – the state of the art private models suffer significant quality losses compared to their non-private counterpart in many benchmarks. To mitigate these losses, a large body of work (e.g., Bassily et al. (2020); Yu et al. (2021a); Zhou et al. (2021); Golatkar et al. (2022); Li et al. (2022a); Amid et al. (2022); Bassily et al. (2023); Ganesh et al. (2023) among others) has explored methods to improve utility by leveraging public data sources. For example, it was shown that in image classification and certain language tasks, pre-training models on large public data can significantly improve utility of private models (Li et al., 2022b; De et al., 2022; Mehta et al., 2023; Ganesh et al., 2023). The vast majority of these approaches assume access to public training examples. A different practical setting that remains under-explored–and which we propose to study–is having access to public features. This is often the case in personalization problems such as recommendation or ad prediction, where the same training example contains sensitive features about an individual, as well as features about an item that the individual interacts with (e.g., the recommended movie or song, or the advertised product). These item features are public and ∗ Work done while at Google. 1 don’t need privacy protection; for example, the director of a movie, the genre of a song, or the brand of a product are all public information. Formally, we assume access to a feature matrix X pub that is public, and each training example contains some row of this matrix, together with private features and labels. Our goal is to design private algorithms adapted to this setting, and to study whether access to X pub can improve privacy/utility trade-offs. 1.1 Contributions 1. We design new first-order algorithms for private learning with public features. The main algorithmic novelty is that instead of protecting gradients (as one would do in noisy gradient descent (Bassily et al., 2014) and its variants), our algorithms work by computing and protecting certain sufficient statistics, that are then used to compute the final gradient. This has several practical advantages that we will discuss in detail; a notable one is that this allows us to preserve gradient sparsity, addressing a major practical issue of noisy gradient descent. 2. We give a utility analysis for linear regression (Theorem 3), where we compare the utility when some features are public, to the utility when p all features are private. We show that the excess empirical risk improves by a factor of m/p (m is the number of items, and p √ is the number of public features). In other words, the typical p scaling is replaced with √ a m scaling. To the best of our knowledge, this is the first analysis that shows a utility improvement under public features. 3. Empirically, we evaluate our methods on two standard private recommendation benchmarks (one classification task and one regression task from the MovieLens data sets (Harper and Konstan, 2016)). Our method achieves the DP state of the art on both benchmarks, demonstrating the importance of algorithms that adapt to the private-public feature structure. Besides improving utility, our method is also more computationally efficient, achieving better quality than DP-SGD at a fraction of the computational cost. We conclude with a discussion of limitations and open questions. 1.2 Related Work Private learning with public data Several approaches have been proposed to augment private training with public data. Existing methods broadly fall into two categories. The first is to use the public data to modify the training procedure: a common theme is to project gradients on a low-dimensional manifold estimated using public data (Kairouz et al., 2021; Yu et al., 2021b; Zhou et al., 2021), or to augment the private loss with terms that depend on the public samples (e.g. a regularization term as in (Golatkar et al., 2022), or a Bregman divergence term as in (Amid et al., 2022)). The second approach is to pre-train a model on public data, then fine-tune it on private data. This approach has proven successful in domains where public data is plentiful, as in image classification (De et al., 2022; Mehta et al., 2023) and natural language (Li et al., 2022b). For a comprehensive survey, see (Cummings et al., 2023). These works assume access to a separate public data set (often from the same or similar distribution as the private data). Our setting is different, in that we assume that the training data include some private and some public features. In this sense, our approach is orthogonal to, and can be combined with the aforementioned methods. Label DP Perhaps the closest setting to private training in the presence of public features is the notion of label DP (Chaudhuri and Hsu, 2011; Ghazi et al., 2023, 2021). There, it is 2 assumed that all features are public, and only labels are private. The setting of this paper bears some similarity with label DP, with two important distinctions: first, our setting aims to be more general, by allowing some but not all of the features to be public. Second, even when all features are public, our methods provide a stronger DP guarantee than label DP. This will be discussed in more detail in Section 2, but in short, our methods protect which row of X pub appears in a given training example (i.e., the correlation between the public features and the users), while label DP provides no such protection (it assumes that this is public knowledge). Private model personalization We will consider the class of multi-encoder models studied for example by Collins et al. (2021); Singhal et al. (2021); Jain et al. (2021); Shen et al. (2023). These models consist of learning a shared item encoder (that learns representations of items), together with a personalized user encoder (that learns individual preferences), as will be detailed in Section 2.2. Prior work has only considered the case where all features are private. We study the case where the item features are public, while user features and labels are private, and show that by designing algorithms that adapt to this separation, one can achieve better utility. Finally, in recent work, Curmei et al. (2023) developed a method for private matrix completion with public features, a special case of our setting. They show strong empirical results (achieving the current SOTA on the DP MovieLens benchmarks), though their method is limited to matrix completion, and they do not provide a utility analysis. Our methods apply to a broader class of models, come with utility guarantees (albeit in a simplified case), and show substantial improvements on the same benchmarks. 1.3 Notation Throughout the paper, m, n will denote the number of items and users respectively. For a vector x, we will write ∥x∥2 for the Euclidean norm of x. For a matrix X, we will denote by ∥X∥F its Frobenius norm, and by ∥X∥2 its induced norm. We will denote by N d a multivariate normal vector, and N d×d a symmetric matrix whose upper triangle entries are i.i.d. normal. Additional notation is summarized in Appendix A. 2 Problem Formulation 2.1 DP with Public Features First, we formalize the notion of private learning with public features. Let X pub ∈ Rm×p be a public feature matrix. To give a concrete example, one can think of rows of X pub as representing items (for example movies to recommend, or ads to show to a user), and columns of X pub as representing features, (for example, all possible movie genres and cast, or all possible ad features such as brand and price). Note that practical applications tend to have a large number of categorical features, and X pub tends to be sparse. We assume that each training example is of the form (xpriv , y, xpub j ): it contains a private feature vector, a private label, together with some row xpub of X pub . For instance in a movie j recommendation task, a training example might consist of a user’s features xpriv , the features of a movie that this user watched (xpub j ), and a rating y that the user assigned to the movie. Remark 1. One important detail is that we want to protect which row of the public matrix appears in each training example. In the above scenario, even though the features of a given 3 movie are public, we want to protect which movies were seen by a user, as this is sensitive information. We formalize this by saying that X pub is public, but the row index j is private. Formally, we will define the private dataset as follows D = {(xpriv , yi , ji , ki ) ∈ Rq × R × [m] × [n]}i∈{1,...,D} i where q is the dimension of private features, m is the number of items, and n is the number of users. Here ji represents an item index (it indexes into X pub ), and ki represents a user index (its only purpose is to allow for user-level privacy accounting). Given the dataset definition above, we adhere to the usual notion of (approximate) differential privacy: Definition 1. A randomized algorithm A with output space S is said to be (ϵ, δ)-DP, if for all neighboring data sets D, D′ , and all measurable S ⊂ S, Pr(A(D) ∈ S) ≤ eϵ Pr(A(D′ ) ∈ S) + δ. In example-level privacy, D, D′ are said to be neighboring data sets if one is obtained from the other by removing a single example. In user-level privacy, they are said to be neighboring if one is obtained from the other by removing all examples belonging to a user. Remark 2 (Interpretation of DP with Public Features). Note that we use the standard DP definition. The only new assumption we make is that the private data includes an index ji ∈ [m], used to lookup a feature vector from a public matrix X pub . The fact that X pub is public means in particular that it doesn’t change with the dataset D. Remark 3 (Comparison to Label DP). Similar to our setting, label DP (Chaudhuri and Hsu, 2011; Ghazi et al., 2021) also defines a privacy guarantee where only part of the data needs protection. There are two distinctions with our setting: first, label DP assumes all features to be public, while DP with public features (our setting) allows for some features to be private. Second, even in the special case where all features are public, our setting provides a stronger notion of privacy. This is due to the definition of neighboring data sets: in label DP, neighboring data sets are only allowed to differ in a single label, and feature vectors are not allowed to change. As a consequence, participation in the process is not protected. In our setting, participation is protected (since completely removing the example yields a valid neighboring dataset). Another consequence is that label DP provides no protection of which public item appears in an example (again, since the feature vector is not allowed to change), while our setting does (the index ji can change). 2.2 Multi-Encoder Models and Alternating Minimization We consider multi-encoder models, studied for example by Agarwal and Chen (2009); Jain et al. (2021); Shen et al. (2023), and commonly used in practice, in advertising (Agarwal and Chen, 2009), recommender systems (Covington et al., 2016; Volkovs et al., 2017), and similarity learning (Chopra et al., 2005; Schroff et al., 2015; Dong and Shen, 2018). In our case, given a training example (xpriv , xpub ), the model’s prediction is given by fθu ,θv (xpriv , xpub ) = uθu (xpriv ) · vθv (xpub ), 4 (1) Algorithm 1: Alternating Minimization priv 1 Inputs: Training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , number of outer steps S outer 2 for 1 ≤ s ≤ S outer do θu ← UserUpdate(θu , θv , D) 4 θv ← ItemUpdate(θu , θv , D) 5 return θu , θv 3 where vθv : Rp → Rd is an encoder that maps the public features to an embedding in Rd , uθu : Rq → Rd is a second encoder that maps the private features to another embedding in Rd , and the model’s prediction is the dot product of the two embeddings. Finally, θv ∈ Rdv , θu ∈ Rdu are the parameters of the two encoders. We will refer to vθv (·) as the public encoder (since it operates on public features) and to uθu (·) as the private encoder. This architecture is well-suited to our problem, as it creates a separation between the private and public features, and as we shall see, our algorithms will take advantage of this separation. Remark 4 (Personalized model interpretation). One interpretation of this model (see Jain et al. (2021); Shen et al. (2023)), is that the prediction task is decomposed into training a shared item encoder (vθv ) which learns a user-independent representation of items, together with a personalized classifier (uθu ) which captures each user’s preferences. This point of view is also common in the collaborative filtering literature (Hu et al., 2008). The goal is to minimize the empirical risk D   X pub ℓ uθu (xpriv ) · v (x ), y , i θ v i ji (2) i=1 where ℓ is a loss function. A popular approach to solve this problem is the Alternating Minimization (AM) procedure described in Algorithm 1, where one alternates between optimizing one encoder (while the other is frozen) and vice-versa. The recent works of Chien et al. (2021); Jain et al. (2021); Shen et al. (2023); Curmei et al. (2023) all fall in this family of algorithms, and they only differ in how the encoders are optimized (i.e. in the sub-routines UserUpdate and ItemUpdate). None of them, however, considers the problem of training with public features. Our focus will be to design an algorithm for optimizing the public encoder. For the private encoder, any DP training procedure such as DP-SGD can be used. 3 Private Training of the Public Encoder In this section, we design and analyze algorithms for optimizing the public encoder under a quadratic loss, i.e., ℓ(ŷ, y) = 12 (ŷ − y)2 . We discuss in Appendix D the more general case where ℓ is convex. For ease of notation, we will denote by ui = uθu (xpriv ). We can then write the loss (2) in i a more concise form: D 2 1 X L(θv ) = ui · vθv (xpub ) − y . (3) i ji 2 i=1 Note that since the private encoder is frozen, we only keep the explicit dependence on the public encoder parameters θv . 5 3.1 Gradient Decomposition under Public Features The main observation is that we can factorize the gradient into terms that only depend on public features, and terms that only depend on private data. Let Ωj = {i ∈ [D] : ji = j} (in other words, Ωj is the set of examples which contain item j). Then we have the following decomposition: Proposition 1. The gradient of the loss (3) is equal to: ∇L(θv ) = m X ∂vθv (xpub j ) j=1 ∂θv [Aj vθv (xpub j ) − bj ], (4) where, for all j, Aj = X ui u⊤ i , bj = i∈Ωj X yi ui . (5) i∈Ωj Observe that in (4), both of the terms vθv (xj ) ∈ Rd and ∂vθv (xj )/∂θv ∈ Rdv ×d only depend on public features (these terms correspond respectively to the public encoder’s forward pass, and its Jacobian), while the terms Aj ∈ Rd×d , bj ∈ Rd depend on private data (yi is a private label, and ui = uθu (xpriv ) is a function of the private features). We will refer to Aj , bj as the i sufficient statistics for item j. Aj can be viewed as a partial private covariance matrix (partial in the sense that it’s the covariance of just the examples that involve the j-th row of the public matrix). 3.2 Sufficient Statistics Perturbation Algorithms The gradient decomposition in Proposition 1 suggests that one can achieve privacy protection by adding noise to the terms Aj , bj , instead of the final gradient. We propose two algorithms based on this approach, see Algorithms 2 and 3. Both operate by first computing the sufficient statistics for all items, then taking T steps of gradient descent, where the gradient is computed based on the noised statistics. They differ in how the noise is added (the differences are highlighted in blue). The first algorithm samples independent noise at each iteration, while the second only adds noise once, and reuses the noisy statistics across all iterations. Notice that to achieve the same privacy, the noise standard deviations σ needed in the two algorithms are different. Remark 5. Reusing the noisy statistics is only possible due to the fact that Aj , bj do not depend on the iterate θvt (see eq.(5)), so they don’t need to be recomputed. One consequence is that the scale of the noise in Algorithm 3 is independent of the number of steps T (as will be apparent in the privacy guarantee). However, this makes the noise in the gradient estimates correlated across iterations, which makes utility analysis difficult. For this reason, our utility analysis will be for Algorithm 2 (the independent noise version). Empirically we evaluate both algorithms, and we find that the correlated version (Algorithm 3) performs better. To give some intuition how this approach improves upon noisy GD, observe that the gradient is a vector in Rdv (where dv is the number of parameters of the encoder, which can be very large), while the right factor in the decomposition (4) (the term Aj vθv (xpub j ) + bj ) is d a vector in R , where d is the output dimension of the encoder, and typically much smaller 6 Algorithm 2: SSP1: Sufficient Statistics Perturbation with Independent Noise priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional 1 weights {wi }, number of steps T , clipping parameters Γy , Γu , noise standard deviation σ, learning rate ηt , initial parameters θv0 , θu0 . priv 2 Let ūi = Clip(uθu (xi ), Γu ), ȳi = Clip(yi , Γy ) 3 for 1 ≤ j ≤ m do P Aj ← i∈Ωj wi ūi ū⊤ i P 5 bj ← i∈Ωj wi ȳi ūi 6 for 0 ≤ t ≤ T − 1 do 7 for 1 ≤ j ≤ m do 8 Âtj = Aj + σΓ2u N d×d 9 b̂tj = bj + σΓy Γu N d P ∂vθt (xpub ) j t v [Âtj vθvt (xpub 10 Ĝt ← m j=1 j ) − b̂j ] ∂θv 11 θvt+1 ← θvt − ηt Ĝt 12 return θvT 4 Algorithm 3: SSP2: Sufficient Statistics Perturbation with Correlated Noise priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional weights {wi }, number of steps T , clipping parameters Γy , Γu , noise standard deviation σ, learning rate ηt , initial parameters θv0 , θu0 . priv 2 Let ūi = Clip(uθu (xi ), Γu ), ȳi = Clip(yi , Γy ) 3 for 1 ≤ j ≤ m do P 2 d×d 4 Âj ← i∈Ωj wi ūi ū⊤ i +σΓu N P 5 b̂j ← i∈Ωj wi ȳi ūi +σΓy Γu N d . 6 for 0 ≤ t ≤ T − 1 do P ∂vθt (xpub ) j v 7 Ĝt ← m [Âj vθvt (xpub j=1 j ) − b̂j ] ∂θv t+1 t t 8 θv ← θv − ηt Ĝ 9 return θvT than dv . Protecting the lower dimensional object is more efficient, as will become clear in the analysis. Preserving Gradient Sparsity Another practical advantage of adding noise to the sufficient statistics is that it allows us to preserve gradient sparsity. Sparse gradients are very common in models that use a large number of categorical features, which is the case in most personalization models (Cheng et al., 2016). Typically, only few features are active for a given item (and inactive features have 0 gradients). This sparsity will manifest in the Jacobian term ⊤ pub in eq.(4), indeed, if the encoder is of the form v(xpub j ) = ν(θin xj ) (θin represents an input 1 The optional weights wi (used as weights in the sufficient statistics, see Lines 4-5) are useful in practice for user-level privacy: for example, by reducing the weights of a user who has many examples, this allows us to control the worst-case user sensitivity. See Theorem 2 for a precise statement on how weights affect the DP guarantee. 7 embedding layer, and ν is the rest of the encoder), then its Jacobian w.r.t. θin is of the form pub ⊤ xpub = 0) the j ρj for some vector ρj , meaning that for any feature that is not active (xjl corresponding l-th row in the Jacobian (and hence in the gradient) is also 0. See Appendix E for an illustration of the sparsity in our experiments. In noisy gradient descent and its variants, since noise is added to the final gradient, this destroys its sparsity–this was identified as one of the challenges of applying DP-SGD in practice (Zhang et al., 2021). In our SSP algorithms, since no noise is added to the Jacobian, its sparsity (and that of the gradient) are preserved. 3.3 Privacy Guarantees We now state the privacy guarantee for both algorithms. All proofs are deferred to the appendix. Theorem 1 (Example-level Privacy Guarantee). Let wi = 1 for all i. Let ϵ, δ √ > 0 with 8T log 1/δ ϵ < log 1/δ. Then Algorithm 2 (resp. Algorithm 3) with standard deviation σ = ϵ √ 8 log 1/δ (resp. σ = ) is (ϵ, δ)-DP. ϵ √ The main difference between the two algorithms is that noise scales with T in the independent noise version. To state the user-level guarantee, it will be useful to define the set Ωk = {i : ki = k} (in other words, these are the indices of examples that belong to user k). P Theorem 2 (User-level Privacy Guarantee). Let w̄2 = maxk i∈Ωk wi2 . Let ϵ, δ > 0 with ϵ < √ w̄ 8T log 1/δ log 1/δ. Then Algorithm 2 (resp. Algorithm 3) with noise standard deviation σ = ϵ √ w̄ 8 log 1/δ (resp. σ = ) is (ϵ, δ)-DP. ϵ 3.4 Utility Analysis for Linear Encoders Our goal in this section is to identify settings in which access to public features improves utility compared to the same problem where all features are private. We will consider linear ⊤ pub where θ ∈ Rp×d . The problem is to minimize L(θ ) = encoders, i.e. vθv (xpub v v j ) = θv x j PD pub ⊤ 2 θv ui − yi ) , and the gradient factorization in Proposition 1 simplifies to i=1 (xji ∇L(θv ) = m X ⊤ pub xpub − bj ]⊤ ∈ Rp×d . j [Aj θv xj j=1 Assumption 1. In this section, we assume that there exist Γx , Γy , Γu such that for all j, ∥xpub j ∥2 ≤ Γx , and for all i, ∥ui ∥2 ≤ Γu and |yi | ≤ Γy . Furthermore, we will optimize the problem over a bounded set Θ = {θ ∈ Rp×d : maxj ∥θ⊤ xpub j ∥2 ≤ Γy /Γu }. The boundedness assumptions on y, u are standard in private linear regression (Wang, ⊤ 2018). The definition of the feasible set Θ simply requires that the model’s output (xpub j θ ui ) be of the same scale as the labels–this is often enforced in practice by normalizing the output of the encoder. This condition is for convenience (so that certain constants simplify). We have the following bound on excess empirical risk: 8 Theorem 3 (Utility Guarantee of Algorithm 3). Suppose Assumption 1 holds. Let Γ = Γx Γy Γu . Let θv∗ = argminθv ∈Θ L(θv ), and let θ̂v be the output of projected SSP1 run with √ |Θ| D2 √ . Then σ = ρ T , for T = mdρ 2 steps and with learning rates ηt = ΓD 8t  √  E[L(θ̂v )] − L(θv∗ ) = Õ |Θ|Γρ md , √ where Õ hides poly-log factors. Furthermore, setting ρ = rithm is (ϵ, δ)-DP. 8 log 1/δ ϵ guarantees that the algo- To highlight the implications of Theorem 3, we compare the bound to (projected) noisy gradient descent (Bassily et al., 2014), where at each iteration " # D X θvt+1 = ΠΘ θvt − η Clip(gi (θvt ), Γ) + σΓN p×d , i=1 ⊤ ⊤ pub − y u ]⊤ (note that under Assumption 1, we have the bound where gi (θv ) = xpub i i ji [ui ui θv xj ∥gi ∥2 ≤ Γ = Γx Γu Γy , hence we use Γ as the clipping constant). A standard analysis (Bassily √ et al., 2014) shows that the excess empirical risk for noisy GD is bounded √ by Õ(|Θ|Γρ pd). The main difference in Theorem 3 is that the dimension dependence pd is replaced with √ md. Recall that d is the output dimension of the encoder, m is the number of public items, and p is the dimension of public features. For data with many categorical features, p can be orders of magnitude larger than m, especially when one uses feature crosses. In such cases, the bound of Theorem 3 suggests potentially large utility improvements. 3.5 Other Practical Considerations Mini-batching Our algorithms were described in the full-batch setting for simplicity. In this section, we discuss their mini-batched variants. In SSP1, this can be achieved by replacing Lines 7-9 with the following: uniformly sample a batch of examples B ⊂ [D], then compute the sufficient statistics over the batch, see Algorithm 4 in the appendix. In SSP2, mini-batching can be achieved by replacing Line 7 with the following: uniformly P ∂vθvt (xpub ) j sample a batch B ⊂ [m] of items, then let Ĝt = (Âj vθvt (xpub j ) − b̂j ), see Algo∂θv j∈B rithm 5 in the appendix. Note that in this case, we sample items instead of training examples, which in many cases has a computational advantage that we discuss below. This per-item sampling is not recommended, however, in DP-SGD or SSP1, because sampling per-item precludes amplification by sampling (Bassily et al., 2014; Abadi et al., 2016) which would lead to worse privacy guarantees. Per-item sampling is possible in SSP2 because we only add noise once (so there is no need for amplification). Computational Complexity Per-item sampling in SSP2 has important implications on computational cost, which we now discuss. First, notice that the gradient of an example i is Jji [ui u⊤ i vji − yi ui ] (where we write ∂vθ (xpub ) v j for the Jacobian of item j). This vj = vθv (xpub j ) for the encoder’s output, and Jj := ∂θv involves computing a forward pass vj and a partial back-propagation Jj . Under per-record 9 sampling (DP-SGD or SSP1), an item j is revisited many times during one epoch (since it appears in all records in Ωj ), so the forward/backward passes for this item are recomputed at each visit. Whereas under per-item sampling (SSP2), the forward/backward passes are computed exactly once per item per epoch, reducing the cost of gradient computation. At the same time, SSP has the additional overhead of computing the sufficient statistics, but since this is done once, one can hope to amortize it across iterations. The following proposition compares the computational cost of SSP2 and DP-SGD–more precisely, we will compare to an optimized version of DP-SGD (that avoids recomputing the forward/backward passes if the same item appears more than once in the batch, see Algorithm 6 in the appendix). Proposition 2. Let c be the average cost of computing one forward/backward pass of the public encoder. Let βj be the expected number of batches in which item j appears per epoch. Then the totalPexpected cost over e epochs is bounded as follows: The cost of DP-SGD (Algorithm 6) 2 2 is Ω(ce m j=1 βj ), and the cost of SSP2 (Algorithm 5) is O(Dd + (c + d )em). We give a detailed comparison in Appendix C under different regimes. Here, we discuss a simplified case. Suppose that e ≥ D/m, then the cost of SSP2 becomes O((c + d2 )em), and we it to DP-SGD. The ratio (cost of SSP2 by cost of DP-SGD)  can more  easily compare 2) P m(c+d m m d2 P is O c m βj = O( β (1 + c )), where β = m j=1 βj . The first term β is smaller than 1 j=1 (typically orders of magnitude smaller): indeed, the quantity β is at least m (the full batch case) and at most D (when the batch size is 1). It can be estimated precisely if one knows the item counts (we show examples in Appendix C). 2 Finally, to bound the term dc , notice that if the encoder has at least one hidden layer of width ≥ d (a reasonable assumption, since d is the encoder’s output dimension), then c ≥ d2 , and the ratio simplifies to O( m β ) which, as argued above, can be very small. Our experiments fall under this setting, and the improvement in our case is approximately two orders of magnitude (see next section). 4 Experimental Results We evaluate our methods on two standard private recommendation benchmarks (Jain et al., 2018; Chien et al., 2021; Krichene et al., 2023; Curmei et al., 2023), based on the MovieLens data sets (Harper and Konstan, 2016). The first is a regression task on MovieLens 10M (abbreviated as ML10M in the results), and the second is a classification task on MovieLens 20M (abbreviated as ML20M). 4.1 Experimental Setup priv Data Each training example is a tuple (xpriv , xpub is a user’s id, xpub is the i ji , yi ) where xi ji feature vector of a movie, and yi is a rating that the user assigned to the movie. The public features X pub consist of metadata contained in the original MovieLens data (movie genre and release year) along with a richer set of metadata extracted from Wikidata by Curmei et al. (2023), containing information such as movie cast and fine-grained genres. Statistics about the data and features are detailed in Appendix E.1. 10 0.88 0.84 0.36 test R@20 0.86 test RMSE 0.37 DP-SGD DP-ALS (id only) DP-CMF AM-DPSGD AM-SSP non-private 0.82 0.35 non-private AM-SSP DP-CMF AM-SSP (id-only) AM-DPSGD (id-only) AM-DPSGD 0.34 0.33 0.80 1 5 10 15 0.32 20 Figure 1: Privacy-RMSE trade-off in ML10M (lower is better). 1 5 10 15 20 Figure 2: Privacy-Recall trade-off in ML20M (higher is better). Models and Algorithms The current SOTA on these DP benchmarks consist of matrix completion methods: the DP-ALS algorithm of Chien et al. (2021) (which does not use features, only movie ids), and the DP-CMF algorithm of Curmei et al. (2023) (which factorizes a concatenation of the rating matrix and the public feature matrix). Matrix completion is a special case of the multi-encoder setting (1), where both encoders are linear and the feature vectors are one-hot. In our experiments, we keep the same id-based user encoder, but train a more general item encoder, consisting of a wide embedding layer ∈ Rp×d (which embeds the categorical features) followed by a dense layer ∈ Rd×d . The model is trained using alternating minimization (AM, see Algorithm 1) where UserUpdate is the same as in DP-ALS/DP-CMF, and we evaluate different methods for the ItemUpdate2 . We consider two variants: AM-DPSGD uses DPSGD for the ItemUpdate, and AM-SSP uses SSP2 (Algorithm 3) for the ItemUpdate (we also compare the SSP1 and SSP2 variants). Evaluation Protocol We follow the exact protocol used in prior work on these benchmarks (Jain et al., 2018; Chien et al., 2021; Krichene et al., 2023; Curmei et al., 2023) and use their code for data processing and privacy accounting. Each data set is split into training/validation/test, we tune hyper-parameters on the validation data and report metrics on test data (the shaded region on each plot shows the standard deviation across ten runs). Utility is measured using RMSE on ML10M and Recall@20 on ML20M. We report user-level (ϵ, δ) privacy guarantees, where δ = 10−5 for ML10M and 8.10−6 for ML20M. A fully detailed account of the setup (and additional experiments) can be found in Appendix E. 4.2 Privacy Utility Trade-off We compare the privacy/utility trade-offs of the different algorithms. The results for ML10M are reported in Figure 1. We start by observing that feature-based methods (DP-CMF, AMDPSGD and AM-SSP) all improve upon the DP-ALS baseline (which uses only item ids). Our method (AM-SSP) significantly improves upon the prior SOTA (DP-CMF), bringing utility 2 Since we design algorithms for training the public encoder, our comparison uses the same user encoder and the same UserUpdate, so quality differences can be attributed to the public encoder and the ItemUpdate. 11 0.975 0.950 =5 0.875 =1 0.96 0.94 0 0.850 =1 =5 = 20 0.98 test RMSE 0.900 =5 test RMSE =1 =1 0.925 1.00 AM-DPSGD AM-SSP =1 0.92 0 0 =2 0.825 =2 0 0.800 101 102 103 training time (s) 0.90 104 Figure 3: Time-evolution of test RMSE on ML10M. 1 2 4 resamples 8 16 Figure 4: Comparison of SSP1 and SSP2 on ML10M. much closer to the non-private baseline. The absolute improvement ranges from 0.025 (at ϵ = 1) to 0.0123 (at ϵ = 20). Turning to the ML20M benchmark (Figure 2), we see a more modest (though non-trivial) improvement compared to DP-CMF. One interesting observation, however, is that AM-DPSGD performs quite poorly on this benchmark: observe the large gap between AM-SSP and AMDPSGD. To further investigate this performance gap, we apply the same methods to an id-only model that doesn’t use any of the item features (the results are in dashed lines on the same figure). Remarkably, adding features improves quality under AM-SSP, but it degrades it under AM-DPSGD. Recall that SSP takes explicit advantage of the fact that movie features are public, while DPSGD doesn’t. This is an example where adapting to the public-private feature separation leads to substantial gains. 4.3 Computational Cost Besides quality improvements, we also highlight the computational advantage of AM-SSP. We plot, in Figure 3, the time-evolution of the test RMSE for AM-SSP and AM-DPSGD (with the optimal hyper-parameters tuned separately for each method). The results show approximately two-orders of magnitude difference in training time between the two methods. This illustrates the complexity advantage of SSP discussed in Section 3.5. The extent of the difference will generally depend on the problem parameters (such as number of items, sparsity, and batch size). This example shows that the improvement can be quite significant in practice (similar results are shown in the appendix for ML20M). 4.4 Comparing SSP1 and SSP2 Finally, we assess the empirical difference between the SSP1 and SSP2 variants (Algorithms 23), see Figure 4. In this experiment, we only optimize the public encoder (i.e., without alternating minimization). Recall that SSP1 resamples noise at every step, while SSP2 adds noise once and reuses the statistics across iterations. Here we experiment with varying the 3 To put this into perspective, an absolute improvement of 0.004 on the ML10M benchmark is considered significant, and some works have reported even smaller improvements, see (Rendle et al., 2019) for a survey. 12 number of times noise is resampled, so resamples=1 corresponds to SSP2, while resamples=16 (16 is the total number of steps) corresponds to SSP1. We find that as we increase the resampling frequency, quality degrades, and this appears to be more significant for lower ϵ (similar results hold on ML20M). 5 Discussion The AM-SSP method has several advantages in practice. It is specifically designed to take advantage of public features, it is computationally cheaper in many cases, it preserves gradient sparsity, and the fact that we can reuse noised statistics (see Remark 5) seems to further improve quality. One limitation of the SSP approach is that the number of statistics to protect increases with the number of items. The method will generally work better when there are fewer items (and more examples per item). When too many fine-grained public features are used, this may lead to an increase in the number of unique items, and a degradation of quality. To give a hypothetical example, suppose font size is used as an ad feature. Then ads that are otherwise identical except for the font size will be treated as different items. We recommend excluding such fine-grained features from the public encoder to control the number of unique items (these excluded features can be incorporated in the private encoder). An interesting open question is to develop a more systematic approach to learning items that are similar but not identical (in the sense that their public feature vectors are close in some metric). One promising observation is that since the feature matrix X pub is public, one can compute such feature-based similarity without paying a privacy cost. Another direction is motivated by our empirical observation that the correlated noise version of SSP performs better in practice than the independent noise version. Developing an analytical understanding of this variant, and conditions under which it provably improves utility, is an interesting open question. References M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, page 308–318, New York, NY, USA, 2016. Association for Computing Machinery. D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, page 19–28, New York, NY, USA, 2009. Association for Computing Machinery. E. Amid, A. Ganesh, R. Mathews, S. Ramaswamy, S. Song, T. Steinke, V. M. Suriyakumar, O. Thakkar, and A. Thakurta. Public data-assisted mirror descent for private model training. In International Conference on Machine Learning, pages 517–535. PMLR, 2022. R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014. 13 R. Bassily, S. Moran, and A. Nandi. Learning from mixtures of private and public populations. Advances in Neural Information Processing Systems, 33:2947–2957, 2020. R. Bassily, M. Mohri, and A. T. Suresh. Principled approaches for private adaptation from a public source. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 8405–8432. PMLR, 2023. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, page 1175–1191, New York, NY, USA, 2017. Association for Computing Machinery. J. A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten, and V. Shmatikov. “you might also like:” privacy risks of collaborative filtering. In 2011 IEEE symposium on security and privacy, pages 231–246. IEEE, 2011. K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 155–186. PMLR, 2011. H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, page 7–10, New York, NY, USA, 2016. S. Chien, P. Jain, W. Krichene, S. Rendle, S. Song, A. Thakurta, and L. Zhang. Private alternating least squares: Practical private matrix completion with tighter rates. In International Conference on Machine Learning, pages 1877–1887. PMLR, 2021. S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1, 2005. L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai. Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2089– 2099. PMLR, 2021. P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, page 191–198, New York, NY, USA, 2016. Association for Computing Machinery. R. Cummings, D. Desfontaines, D. Evans, R. Geambasu, M. Jagielski, Y. Huang, P. Kairouz, G. Kamath, S. Oh, O. Ohrimenko, et al. Challenges towards the next frontier in privacy. arXiv preprint arXiv:2304.06929, 2023. M. Curmei, W. Krichene, L. Zhang, and M. Sundararajan. Private matrix factorization with public item features. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 805–812, New York, NY, USA, 2023. Association for Computing Machinery. 14 S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private image classification through scale. CoRR, abs/2204.13650, 2022. X. Dong and J. Shen. Triplet loss in siamese network for object tracking. In Computer Vision – ECCV 2018, pages 472–488, Cham, 2018. Springer International Publishing. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, page 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. A. Ganesh, M. Haghifam, M. Nasr, S. Oh, T. Steinke, O. Thakkar, A. Guha Thakurta, and L. Wang. Why is public pretraining necessary for private model training? In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10611–10627. PMLR, 2023. B. Ghazi, N. Golowich, R. Kumar, P. Manurangsi, and C. Zhang. Deep learning with label differential privacy. In Advances in Neural Information Processing Systems, 2021. B. Ghazi, P. Kamath, R. Kumar, E. Leeman, P. Manurangsi, A. Varadarajan, and C. Zhang. Regression with label differential privacy. In The Eleventh International Conference on Learning Representations, 2023. A. Golatkar, A. Achille, Y.-X. Wang, A. Roth, M. Kearns, and S. Soatto. Mixed differential privacy in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8376–8386, 2022. I. Goodfellow. Efficient per-example gradient computations. CoRR, abs/1510.01799, 2015. F. M. Harper and J. A. Konstan. The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016. Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 263–272, 2008. P. Jain, O. D. Thakkar, and A. Thakurta. Differentially private matrix completion revisited. In International Conference on Machine Learning, pages 2215–2224. PMLR, 2018. P. Jain, J. Rush, A. Smith, S. Song, and A. Guha Thakurta. Differentially private model personalization. In Advances in Neural Information Processing Systems, volume 34, pages 29723–29735. Curran Associates, Inc., 2021. P. Kairouz, M. R. Diaz, K. Rush, and A. Thakurta. (nearly) dimension independent private erm with adagrad rates via publicly estimated subspaces. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2717–2746. PMLR, 2021. A. Korolova. Privacy violations using microtargeted ads: A case study. In 2010 IEEE International Conference on Data Mining Workshops, pages 474–482. IEEE, 2010. 15 W. Krichene, P. Jain, S. Song, M. Sundararajan, A. Guha Thakurta, and L. Zhang. Multitask differential privacy under distribution skew. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17784–17807. PMLR, 2023. J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages II–82–II–90, 2013. T. Li, M. Zaheer, S. Reddi, and V. Smith. Private adaptive optimization with side information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13086–13105. PMLR, 2022a. X. Li, F. Tramer, P. Liang, and T. Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2022b. D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara. Variational autoencoders for collaborative filtering. WWW ’18, page 689–698, 2018. S. Liu and Y. Zheng. Long-Tail Session-Based Recommendation, page 509–514. Association for Computing Machinery, New York, NY, USA, 2020. H. Mehta, A. G. Thakurta, A. Kurakin, and A. Cutkosky. Towards large scale transfer learning for differentially private image classification. Transactions on Machine Learning Research, 2023. I. Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017. S. Rendle, L. Zhang, and Y. Koren. On the difficulty of evaluating baselines: A study on recommender systems. CoRR, abs/1905.01395, 2019. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 815–823. IEEE Computer Society, 2015. Z. Shen, J. Ye, A. Kang, H. Hassani, and R. Shokri. Share your representation only: Guaranteed improvement of the privacy-utility tradeoff in federated learning. In The Eleventh International Conference on Learning Representations, 2023. K. Singhal, H. Sidahmed, Z. Garrett, S. Wu, J. K. Rush, and S. Prakash. Federated reconstruction: Partially local federated learning. In Advances in Neural Information Processing Systems, 2021. M. Volkovs, G. Yu, and T. Poutanen. Dropoutnet: Addressing cold start in recommender systems. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Y. Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, pages 93–103. AUAI Press, 2018. 16 H. Yin, B. Cui, J. Li, J. Yao, and C. Chen. Challenging the long tail recommendation. Proc. VLDB Endow., 5(9):896–907, 2012. D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y. T. Lee, A. Manoel, L. Wutschitz, et al. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021a. D. Yu, H. Zhang, W. Chen, J. Yin, and T.-Y. Liu. Large scale private learning via lowrank reparametrization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12208–12218. PMLR, 2021b. H. Zhang, I. Mironov, and M. Hejazinia. Wide network learning with differential privacy. CoRR, abs/2103.01294, 2021. Y. Zhou, S. Wu, and A. Banerjee. Bypassing the ambient dimension: Private SGD with gradient subspace identification. In International Conference on Learning Representations, 2021. L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 17 A Notation We summarize notation used throughout the paper in this section for easy reference. For a positive integer n, we denote by [n] the set {1, . . . , n}. Given a convex set C, we denote by ΠC the Euclidean projection on C and by ∥C∥ the diameter of C. For a vector x, we denote by Clip(x, Γ) the Euclidean projection of x on the L2 ball of radius Γ. The data is given by D = {(xpriv , yi , ji , ki ) ∈ Rq × R × [m] × [n]}i∈{1,...,D} , i and we will always use i to index training examples, j to index items (i.e. rows in the public matrix X pub ), and k to index users (for user-level DP). We also use the following partitions n k of the data set: D = ⊔m j=1 Ωj = ⊔k=1 Ω , where Ωj = {i : ji = j} is the set of examples that involve public item j, and Ωk = {i : ki = k} is the set of examples that belong to user k. The model is given by pub priv fθu ,θv (xpriv , xpub ), i j ) = vθv (xj ) · vθu (xi ), and vj (θv ) = vθv (xpub and we use the shorthand ui (θu ) = uθu (xpriv ji ). This allows to emphai size the dependence on the encoder’s parameters. Finally, the dimensions of the problem are summarized in the table below. m n D p d dv , du number of items number of users number of examples dimension of public features (X pub ∈ Rm×p ) encoders’ output dimension (vj (θv ) ∈ Rd ) encoders’ parameters (θv ∈ Rdv ) Table 1: Dimension Parameters B Proofs B.1 Proof of Theorem 1 Proof. The result is an application of the Gaussian mechanism. Let D, D′ be two neighboring data sets that differ by a single example (xpriv , xpub i ji , yi , ji ). This example only affects the statistics of item ji (all other statistics remain unchanged). Denote by Aji , bji (resp. A′ji , b′ji the statistics computed on data sets D (resp. D′ ). Then 2 4 ∥Aji − A′ji ∥2F = ∥ūi ū⊤ i ∥F ≤ Γu , thus releasing Aji + σΓ2u N d×d is (α, 2σα2 )-RDP for all α > 1. Similarly, we have ∥bji − b′ji ∥22 = ∥ȳi ūi ∥22 ≤ Γ2u Γ2y , so releasing bji + σΓu Γy N d×d is (α, 2σα2 )-RDP. 18 2 By RDP composition, we have that Algorithm 2 is (α, α Tσw̄2 )-RDP (we release 2T noisy 2 statistics), and Algorithm 3 is (α, α w̄ )-RDP (we release only 2 such statistics). σ2 Translation from RDP to DP is standard, see for example (Mironov, 2017, Proposition 3): 2 if a process is (α, αβ)-RDP for all α, and β ≤ 8 logϵ 1/δ , then the process is (ϵ, δ)-DP. The result follows from setting β = σT2 for Algorithm 2, and β = σ12 for Algorithm 3. B.2 Proof of Theorem 2 Proof. Let D, D′ be two neighboring data sets that differ by a single user k. The user contributes to the sufficient statistics of all items {ji , i ∈ Ωk } (by definition of Ωk ). Define A to be the matrix [A1 | . . . |Am ] ∈ Rd×md , and let A′ be the same matrix computed on D′ instead of D. Then X X 2 wi2 Γ4u ≤ w̄2 Γ4u ∥wi ūi ū⊤ ∥A − A′ ∥2F = i ∥F ≤ i∈Ωk i∈Ωk P where we used the definition of w̄2 = maxk i∈Ωk wi2 . Similarly, defining b = [b1 , . . . , bm ] ∈ Rd×m we have that X X ∥b − b′ ∥2F = ∥wi ūi yi ∥22 ≤ wi2 Γ2u Γ2y ≤ w̄2 Γ2u Γ2y . i∈Ωk i∈Ωk 2 Therefore, releasing A + Γ2u σ 2 N d×md is (α, α2σw̄2 )-RDP, and so is releasing b + σΓu Γy N d×m . The rest of the proof proceeds as in the proof of Theorem 1. B.3 Proof of Theorem 3 Proof. We will use the following standard lemma, which can be found for example in (Bassily et al., 2014, Lemma 2.5). Lemma 1. Let L be a convex function defined on a bounded domain Θ, and let θv∗ = argminθv ∈Θ L(θv ). Consider the SGD algorithm   θvt+1 = ΠΘ θvt − ηt ĝ t , √ . Then for all where ĝ t satisfy the following: E[ĝ t ] = ∇L(θvt ), and E∥ĝ t ∥22 ≤ G 2 . Let ηt = G|Θ| t t,   |Θ|G log t √ E[L(θvt )] − L(θv∗ ) = O . t To apply the lemma, we will show that Ĝt (Line 10 in Algorithm 2) is an unbiased estimate of the gradient and bound its second moment. To make the notation concise, we will write vjt = vθvt (xpub j ). Recall that the true gradient is ∇L(θv ) = m X ⊤ xpub j (Aj vj − bj ) = j=1 m X j=1 19 ⊤ xpub j ρj where we defined ρj = Aj vj − bj . And by definition of the SSP1 algorithm, the gradient estimate is t Ĝ = m X t t t ⊤ xpub j [Âj vj − b̂j ] j=1 = m X 2 d×d t xpub )vj − (bj + σΓy Γu N d ]⊤ j [(Aj + σΓu N j=1 = ∇L(θvt ) + σ m X 2 d×d t xpub vj − Γy Γu N d ]⊤ , j [Γu N j=1 where, importantly, the noise samples N d×d , N d are conditionally independent of the trajectory up to step t. Thus, it follows by independence that E[Ĝt ] = ∇L(θvt ), and we can bound the second moment E[∥Ĝt ∥22 ] = E[∥∇L(θvt )∥22 ] + σ 2 m X 2 d×d t E[∥xpub vj − Γy Γu N d ]⊤ ∥22 ] j [Γu N (6) j=1 (the cross-terms vanish by independence). It remains to bound individual terms. First, observe that ∥ρj ∥2 = ∥Aj vj − bj ∥2 ≤ ∥Aj ∥2 ∥vj ∥2 + ∥bj ∥2 , P 2 where ∥Aj ∥2 = ∥ i∈Ωj ui u⊤ i ∥2 ≤ |Ωj |Γu , ∥vj ∥2 ≤ Γy /Γu (by Assumption 1), and ∥bj ∥2 = P ∥ i∈Ωj yi ui ∥2 ≤ |Ωj |Γy Γu . Thus ∥ρj ∥2 ≤ 2|Ωj |Γy Γu , and ∥∇L(θvt )∥2 ≤ ≤ m X j=1 m X ⊤ ∥xpub j ρj ∥2 ≤ m X ∥xpub j ∥2 ∥ρj ∥2 j=1 2Γx |Ωj |Γy Γu j=1 (7) = 2DΓx Γy Γu where we used that j |Ωj | = D. This gives a bound on the first term in (6). √ Turning to the second term in (6), we have that E[∥N d×d ∥2 ] = d (induced norm) and √ E[∥N d ∥2 ] = d (Euclidean norm), thus √ 2 d×d t E[∥xpub vj − Γy Γu N d ]⊤ ∥F ] ≤ 2Γx Γy Γu d. (8) j [Γu N P Combining bounds (7)-(8) into (6), we get E[∥Ĝt ∥22 ] ≤ [2Γx Γy Γu ]2 (D2 + σ 2 md) p Next, taking σ 2 = ρ2 T and applying Lemma 1 with G = 2Γx Γy Γu D2 + mdT ρ2 , we get ! r 2 D E[L(θvT )] − L(θv∗ ) = Õ |Θ|2Γx Γy Γu + mdρ2 . T 2 D Finally we choose T to equalize the two terms, T = mdρ 2 (under this choice of T , G =   √ √ 2 2Γx Γy Γu D), to obtain the final desired bound O 2|Θ|Γx Γy Γu ρ 2md . 20 C Batched Algorithms and Complexity Analysis In this section, we give additional details related to the batched variants of our algorithms, and discuss computational complexity. The mini-batched versions of SSP1 and SSP2 are given respectively in Algorithms 4 and 5, where we highlight differences compared to the full-batch case in blue. For mini-batch DP-SGD, notice that if several examples in a batch have the same feature pub vector xpub j , then the forward/backward passes vj = vθv (xj ) and Jj = to be computed once. This is summarized in Algorithm 6. ) ∂vθv (xpub j only have ∂θv Algorithm 4: Mini-batch SSP1 priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional D weights {wi }, number of steps T , clipping parameters Γy , Γu , noise standard deviation σ, learning rate ηt , initial parameters θv0 , θu0 . priv 2 Let ūi = Clip(uθu (xi ), Γu ), ȳi = Clip(yi , Γy ) 3 for 0 ≤ t ≤ T D − 1 do Uniformly sample B ⊂ [D] 5 for 1 ≤ j ≤ m do P 2 d×d 6 Âtj = i∈Ωj ∩B wi ūi ū⊤ i + σΓu N P 7 b̂tj = i∈Ωj ∩B wi ȳi ūi + σΓy Γu N d P ∂vθt (xpub ) j t v 8 Ĝt ← m (Âtj vθvt (xpub j=1 j ) − b̂j ) ∂θv 9 θvt+1 ← θvt − ηt Ĝt 10 return θvT 4 Algorithm 5: Mini-batch SSP2 priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional I weights {wi }, number of steps T , clipping parameters Γy , Γu , noise standard deviation σ, learning rate ηt , initial parameters θv0 , θu0 . priv ), Γu ), ȳi = Clip(yi , Γy ) 2 Let ūi = Clip(uθu (xi 3 for 1 ≤ j ≤ m do P 2 d×d Âj ← i∈Ωj wi ūi ū⊤ i + σΓu N P 5 b̂j ← i∈Ωj wi ȳi ūi + σΓy Γu N d . 6 for 0 ≤ t ≤ T I − 1 do 7 Uniformly sample B ⊂ [m] P ∂v t (xpub ) 8 Ĝt ← j∈B θv∂θvj (Âj vθvt (xpub j ) − b̂j ) t+1 t t 9 θv ← θv − ηt Ĝ 10 return θvT 4 21 Algorithm 6: Mini-batch DP-SGD (adapted to multi-encoder models for improved efficiency) priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional D weights {wi }, number of steps T , clipping parameters Γg , noise standard deviation σ, learning rate ηt , initial parameters θv0 , θu0 . priv 2 Let ui = uθ0 (xi ) u D 3 for 0 ≤ t ≤ T − 1 do 4 5 6 7 8 Uniformly sample a batch B ⊂ [D]. for j ∈ {ji : i ∈ B} do vj ← vθv (xpub j ) ∂vθ (xpub ) v j Jj ← X ∂θv  Ĝt ← Clip Jji (uTi vji − yi )ui , Γg + σΓN dv i∈B t+1 9 θv ← θvt − ηt Ĝt 10 return θvT Proof of Proposition 2 We now turn to the complexity analysis of mini-batch SSP2 and DP-SGD. Proof. Let c be the cost of computing one forward and one backward pass, i.e. the cost of computing vj and Jj for one item j. We also recall the followingP quantities: βj is the expected number of batches that contains item j in one epoch, and β = m j=1 βj . First, consider DP-SGD. For some item j, every time the item P is visited, we compute one forward/backward pass for that item (Lines 6-7), for a cost of c m j=1 βj = β, hence the cost of DP-SGD over e epochs is at least Ω(ceβ). Note that there is the additional cost of computing and clipping the gradients (Line 8), but for the purposes of our analysis, we only need a lower bound on the cost of DP-SGD (we will compare an upper bound on the cost of SSP to this lower bound on the cost of DP-SGD). We now consider SSP2. First, there is the cost of computing the sufficient statistics (Lines 3-5). This can be done by iterating over all examples in D and accumulating the statistics. This costs O(D(d2 + d)) (d2 for accumulating the Âj ’s, and d for accumulating the b̂j ’s). Second, for each batch, we compute the gradient Ĝt (Line 8). This consists of the following operations for each j ∈ B: computing the forward/backward passes vj , Jj (a cost of c), then computing the vector Âj vj − b̂j (a cost of d2 ), and finally multiplying this vector by Jj (a cost of at most c, because the cost of computing the Jacobian is greater than the cost of multiplying by the Jacobian). The total cost for item j is therefore O(c + d2 ). Finally, the total cost over e epochs O(Dd2 + em(c + d2 )) (the first term is the cost of computing statistics, and the second is the cost of computing gradients from the statistics). This completes the proof. Comparison of Computational Complexity We expand on the complexity discussion of Section 3.5. The ratio between the cost of SSP2 and the cost of DP-SGD (see Proposition 2) is bounded by  2     Dd + em(c + d2 ) m d2 Dd2  O =O 1+ + (9) ceβ β c cem 22 This depends on several problem parameters: • β/m: this represents the average number of visits per item: the larger this number is, the bigger the advantage of SSP. 2 • dc : this term depends on the encoder’s architecture. c is the cost of one forward/backward pass, and d is the output dimension of the encoder. 2 • the last term Dd cem represents the relative overhead of computing statistics in SSP, and decreases with the number of epochs e (if the number of epochs is large enough, the overhead is amortized). 103 = 0.1 = 0.3 =1 10000 102 8000 6000 /m Count | j| = 0.1 = 0.3 =1 101 4000 2000 100 0 0 2000 4000 6000 Item j (sorted by count) 8000 100 10000 101 102 103 Batch size 104 105 106 Figure 5: Item visits under a power law distribution. In this example the data set size is D = 106 and the number of items is m = 103 . The distribution of item counts follows a power law distribution with density f (x) ∝ x−α where α is a positive parameter. The left plot shows the count distribution, and the right plot shows the average number of visits β/m as we vary the batch size B. First, consider the term β/m. Assuming that examples are sampled independently and |Ω | with replacement, we can get a precise estimate of β: let pj = Dj be the probability of sampling item j. For a batch size B, the probability that item j appears (at least once) in the batch is qj = 1 − (1 − pj )B . The number of batches in which item j appears is a Binomial distribution with probability qj , and since there are D/B batches in one epoch, the expected number of batches containing j in one epoch is βj = D/B(1 − (1 − pj )B ). This is entirely determined by the item frequencies pj , the data size D, and the batch size B. We plot a few examples in Figure 5, where we take the item count distribution to follow a power-law distribution, commonly encountered in practice in recommender systems and ads applications (Yin et al., 2012; Liu and Zheng, 2020). The plot shows that as the batch size increases, the total number of visits β decreases. When B = 1, β = D, and when B = D (full batch), β/m ≈ 1 (if the batch is all of D, then β = m, but since we are sampling with replacement, some items may not be sampled at all so one could have β slightly lower than m as can be seen on the plot). In the example, we take D = 106 and m = 103 . Notice that β/m remains large (more than a hundred) for batch sizes up to a few thousands. Once β/m is estimated, we can easily compute the ratio (9). In Figure 6, we plot the ratio (9) as we vary the batch size and the number of epochs. We fix the data size D = 106 (changing this parameter will change the scale the graph, but the trends will be similar). 2 As discussed above, the ratio dc depends on the model architecture, we consider two representative cases: the first is when the encoder contains at least a hidden layer of width larger than d, in which case we have c ≥ d2 (whenever the encoder has a hidden layer, its width is usually more than the width of the last layer). The second case is when the encoder 23 103 103 103 1.00 102 10 1 00 0.10 101 100 100 epochs 1.0 0 100 101 10 1 100 .00 10. 101 102 101 10. 00 0.01 epochs 102 102 .00 103 100 0 10 101 102 103 batch size 104 105 10 2 106 10 2 10 3 100 0 10 101 (a) c = d2 102 103 batch size 104 105 106 10 3 (b) c = d Figure 6: Complexity ratio (cost of SSP divided by cost of DP-SGD) as we vary the batch size and number of epochs. In this example, D = 106 , m = 103 . Two representative cases are shown: c = d2 corresponding to encoders with at least one hidden layer (left), and c = d, corresponding to id-only linear encoders (right). The solid line shows the case where both algorithms have comparable cost. is a single linear layer, in which case c = kd where k is the number of non-zero features per item. In the simplest case (the most favorable to DP-SGD), there is a single active feature per item and c = d. In the case c = d2 , there a large region in which SSP has a computational advantage, and the advantage increases as the number of epochs increases and as the batch size decreases. The advantage can easily be several orders of magnitude in the small batch regime (batch size of 100-1000). The case c = d is intuitively the least favorable to SSP, since the cost of computing the statistics (d2 ) is much higher than the cost of the encoder (c = d), so it takes longer to amortize the cost of computing the statistics. In this case, we expect SSP to be more expensive, unless the number or epochs is very large. D SSP Algorithm with Non-Quadratic Losses In this section, we discuss a generalization of our algorithms from the quadratic to the convex loss case. The main idea is to compute and optimize successive quadratic approximations of the loss, much like in second-order methods. The main difference is that instead of computing the approximation in the θv space (which would be intractable in most practical settings), we compute it in the v space, i.e. a quadratic approximation w.r.t. the encoder’s output. More precisely, consider the loss (2), reproduced below L(θv ) = D X ℓ (ui · vji (θv ), yi ) , i=1 where ℓ : R × R → R is convex (note however that L is not, since the item encoder vθv is typically non-linear). Writing ℓi (v) = ℓ(ui · v, yi ), the loss L is the composition L(θv ) = PD i=1 ℓi (vji (θv )). By the chain rule, the gradient and Hessian of ℓi at v 0 are given respectively by ∇ℓi (v 0 ) = ′ ′′ ℓ′ (ui · v 0 , yi )ui and ∇2 ℓi (v 0 ) = ℓ′′ (ui · v 0 , yi )ui u⊤ i , where ℓ and ℓ denote the first and second 24 derivatives of ℓ w.r.t. its first argument. To simplify notation, let us write ℓi (v 0 ) = ℓ(ui ·v 0 , yi ), dℓi (v 0 ) = ℓ′ (ui · v 0 , yi ) and d2 ℓi (v 0 ) = ℓ′′ (ui · v 0 , yi ). Now, a simple Taylor expansion of ℓi around v 0 yields the quadratic approximation 1 0 ℓ̃i (v) = ℓi (v 0 ) + dℓi (v 0 )ui · (v − v 0 ) + d2 ℓi (v 0 )(v − v 0 )⊤ ui u⊤ i (v − v ). 2 Denoting vj0 = vj (θv0 ), we obtain the following approximation of L around θv0 : L̃(θv ) = m X j=1 1 cj + gj⊤ (vj (θv ) − vj0 ) + (vj (θv ) − vj0 )⊤ Hj (vj (θv ) − vj0 ), 2 (10) where cj = X ℓi (v 0 ) i∈Ωj gj = X dℓi (vj0 )ui i∈Ωj Hj = X d2 ℓi (vj0 )ui u⊤ i (11) i∈Ωj Since this is a quadratic function in v, this gives us a similar gradient decomposition, which we state below. Proposition 3. The gradient of the quadratic approximation (10) is given by ∇L̃(θv ) = m X Jj (θv )[Hj (vj (θv ) − vj0 ) + gj ], (12) j=1 where we use the shorthands Jj (θv ) = ∂vθv (xpub ) j and vj (θv ) = vθv (xpub j ). ∂θv As in Proposition 1, the terms Jj and vj only depend on public data and don’t need protection. The only difference is in the sufficient statistics: Instead of Aj , bj , the decomposition now involves Hj , gj . This motivates Algorithm 7. The main difference with Algorithm 2 is that Hj , gj depend on where we take the approximation (note the dependence on v 0 in (11)), so they need to be periodically recomputed. Algorithm 7 consists of multiple calls to SSP2 (Algorithm 3), each applied to the quadratic approximation around the current iterate v t . Note that the main iterates are denoted by θvt , (τ ) while the inner loop uses θv . The privacy guarantee 1-2: to guarantee (ϵ, δ)-DP, √ is an immediate extension of Theorems√ 8T log 1/δ w̄ 8T log 1/δ it suffices to take σ = for example-level DP, and σ = for user-level DP. ϵ √ ϵ Notice that both scale with T (since sufficient statistics are recomputed at each step). Empirical evaluation of Algorithm 7 is outside the scope of this paper, and is left as future work. 25 Algorithm 7: SSP-convex: Sufficient Statistics Perturbation for Convex Losses (with correlated noise) priv 1 Inputs: Public features X pub , training data D = {(xi , yi , ji , ki )}i∈{1,...,D} , optional weights {wi }, number of steps T , number of inner iterations τ max , clipping parameters ΓH , Γg , noise standard deviation σ, learning rate η, initial parameters θv0 , θu0 . priv 2 Let ūi = Clip(uθu (xi ), Γu ), ȳi = Clip(yi , Γy ) 3 for 0 ≤ t ≤ T − 1 do for 1 ≤ j ≤ m do /* Compute statistics for the quadratic approximation around v t */ P d×d Ĥjt ← i∈Ωj wi Clip(d2 ℓi (vjt )ūi ū⊤ i , ΓH ) + σΓH N P ĝjt ← i∈Ωj wi Clip(dℓi (vjt )ūi , Γg ) + σΓg N d . 4 5 6 (0) θv ← θvt for 0 ≤ τ ≤ τ max − 1 do /* Optimize the quadratic approximation */ P (τ ) (τ ) t t t Ĝ(τ ) ← m j=1 Jj (θv )[Ĥj (vj (θv ) − vj (θv )) + ĝj ] 7 8 9 (τ +1) θv 10 (τ ) ← θv − η Ĝ(τ ) (τ max ) θvt+1 ← θv 12 return θvT 11 E Additional Experiments In this section, we provide additional details regarding the exact experimental setup, along with additional results. E.1 Experimental Setup Data Sets and Public Features We use two benchmarks based on MovieLens data4 . The first is a regression task proposed by Lee et al. (2013), and the second is a classification task proposed by Liang et al. (2018). The MovieLens data include movie features consisting of the release year and 19 movie genres. Curmei et al. (2023) expanded the movie features by extracting additional metadata from the Wikidata website; the additional features consist of a finer set of 310 genres, as well as 58,139 persons covering 55 roles. While the release year is a univalent feature, all other features (genres, persons, and roles) are multivalent features. Furthermore, roles and persons are paired, and one person may have multiple roles in a movie, and a role, such as “actor", may be associated to multiple persons. In our experiments, we simply treat each feature category as a bag of words (where we count repetitions, so if a feature such as “actor" is repeated, this will be treated as a weight of that feature value). One alternative is to use a feature cross between person and role, so each unique pair defines a feature value. Statistics about the two data sets are summarized in Table 2. Following previous work (Jain et al., 2018; Chien et al., 2021; Krichene et al., 2023), 4 The license information can be found on the GroupLens website at the following URLs: https://files. grouplens.org/datasets/movielens/ml-10m-README.html and https://files.grouplens.org/datasets/ movielens/ml-20m-README.html 26 data set n (users) m (items) p (features) ML10M ML20M 69,878 136,677 10,677 2010 58,619 25,805 Table 2: Statistics of the MovieLens data sets ML10M uses the entire data, while in ML20M, training is restricted to the top 10% most frequent movies (but evaluation is always done on the entire set of movies). The histograms of the number of features active for each item are reported in Figure 7. The difference between the two distributions is explained by the fact that ML20M is restricted to the top frequent movies, for which metadata tends to be more comprehensive. ML10M ML20M items 103 102 101 100 0 50 100 150 200 250 active features per item 300 350 Figure 7: Density of features across movies Model Architecture and Training The public encoder consists of a two-layer model. The first layer is an embedding layer where each one of the 5 features (year, original genre, extended genre, person, role) is mapped to an embedding vector5 in Rd (multivalent features are treated as a bag of words and their embeddings are averaged). These 5 embeddings are concatenated to form the hidden layer in R5d , and followed by a dense layer6 with output dimension Rd . In our experiments, the dimension d is treated as a hyper-parameter, and we consider dimensions up to d = 64, following the DP-ALS and DP-CMF baselines we compare to. We observe that higher quality can be achieved on ML20M with higher dimension (but to make a fair comparison to the baselines, we restrict to d ≤ 64). As for the user encoder, we follow the same setup as (Jain et al., 2018; Chien et al., 2021; Krichene et al., 2023; Curmei et al., 2023). The user encoder is an embedding lookup, i.e. each user k is mapped to a unique vector uk (there are no other user features except the id). The model’s output is therefore uki · vθv (xpub ji ) (ki is the user id, ji is the movie id). Each model is trained on an NVIDIA Tesla P100 GPU. The model is trained by minimizing the regularized quadratic loss D 1X L(θv ) = (ui · vji (θv ) − yi )2 + λu ∥ui ∥22 + λv ∥vji (θv )∥22 . 2 i=1 5 For simplicity, we use the same embedding dimension d for all features, and we also use the same dimension d as the output dimension of the second layer. 6 Note that for this architecture, the cost c in our complexity analysis would be approximately 5d2 , since the dense layer is a matrix in R5d×d . 27 We use the same UserUpdate for all methods (see footnote 2). During the UserUpdate, since we learn an embedding per user (with no shared parameters across users), the problem reduces to n decoupled least squares problems (one per user): u∗k = argminu 1 X (u · vji (θv ) − yi )2 + λu ∥uk ∥22 , 2 k i∈Ω P P for which we use the closed form solution u∗k = ( i∈Ωk vji vj⊤i + λu I)−1 ( i∈Ωk yi vji ). Evaluation We follow the protocol of the original benchmarks Lee et al. (2013); Liang et al. (2018), which we summarize below. Each data set is split into training/validation/test. In ML10M, the set r of ratings is split at random, and utility is measured using RMSE, defined as RMSE = P 2 i∈Dtest (ui ·vji −yi ) |Dtest | , where Dtest is the set of test ratings. In ML20M, the validation and test splits consist of held-out users. Since validation/test users are not present in the training data, at evaluation time, one first needs to compute an embedding uk for those users to be able to generate predictions. For this purpose the benchmark also splits the examples of each validation/test user k into an 80-20 split Ωkhistory ⊔ Ωktarget , where the first part is used to compute the user embedding (by running UserUpdate on Ωkhistory ), and the second is used as the target labels. Utility is measured using Recall@20, defined as follows. A prediction score is computed for all movies except in Ωkhistory (to avoid penalizing a model that recommends items from the user’s history). Let Ω̂k be the set of P |Ωktarget ∩Ω̂k | top-20 predictions, then the recall is defined as Recall@20 = n1 nk=1 min(20,|Ω . k |) target Hyper-parameters are tuned on the validation data and metrics are reported on test data. Plots in Figures 1,2,4 report the mean metric value across ten runs and the standard deviation of the metric value is depicted as a shaded region (small standard deviations are barely visible in some instances). Privacy Budget Allocation All algorithms (including the DP-ALS and DP-CMF baselines) use the budget allocation method of Krichene et al. (2023). The method consists of computing example weights wi (these are the inputs weights wi in Algorithms 2-3) that are designed to control the sensitivity of each example. By allocating a higher weight (hence sensitivity) to tail items, it was found that overall utility can improve. The weights are computed following the method of (Krichene et al., 2023, equations (11)-(12)), which we summarize here. First, compute a DP estimate of the item counts, n̂j , then define the weights −1/4 n̂ , where w̄ is a parameter that control the privacy guarantee. By definiP tion, we have that i∈Ωk wi2 = w̄2 , so this parameter corresponds to the w̄2 in Theorems 2. Finally, privacy accounting is done via RDP composition (Mironov, 2017), where we also account for the estimation of item counts described above. The accounting is identical across all algorithms we evaluated. wi = w̄ qP ji i∈Ωk −1/4 i nj The DP-SGD baseline In addition to SOTA baselines (DP-ALS and DP-CMF), we also compare to plain DP-SGD, where all model parameters are optimized jointly (without alternating minimization). This can be considered a weaker baseline, since it is not adapted to the problem’s structure. The DP-SGD result is reported in Figure 1. It achieves a lower utility 28 = 5, 10, 20 0.35 =1 =2 0 0 =1 =5 test R@20 0.30 0.25 =1 0.20 AM-DPSGD AM-SSP 0.15 101 102 104 103 training time (s) Figure 8: Comparison of SSP1 and SSP2 on ML20M than all other methods, for example the RMSE at ϵ = 20 is worse than the RMSE of competing methods at ϵ = 5. Similarly for ML20M, the utility is much worse than other methods: we measured a Recall@20 of 0.206 at ϵ = 20, while all other methods have better Recall@20 even at ϵ = 1. E.2 Computational Cost on ML20M Figure 8 compares training a recall model with AM-DPSGD and AM-SSP on the ML20M dataset. The overall observation is similar to the one drawn from Figure 4, and in this case there is approximately three-orders of magnitude difference in training time between the two methods. It’s worth mentioning that we use an efficient implementation for DP-SGD (using the TensorFlow Privacy library7 that implements fast per-example gradient clipping (Goodfellow, 2015)). We acknowledge that run times depend on implementation, though we believe a 2-3 orders of magnitude improvement to be significant, and the improvement should persist despite variations due to implementation. DP-ALS (id only), = 1 DP-ALS (id only), = 5 DP-ALS (id only), = 20 AM-SSP, = 1 AM-SSP, = 5 AM-SSP, = 20 AM-DPSGD, = 1 AM-DPSGD, = 5 AM-DPSGD, = 20 1.2 RMSE 1.1 1.0 0.5 0.4 Recall@k 1.3 0.2 0.9 0.1 0.8 0.7 0.3 DP-ALS (id only), = 1 DP-ALS (id only), = 5 DP-ALS (id only), = 20 AM-SSP, = 1 AM-SSP, = 5 AM-SSP, = 20 AM-DPSGD, = 1 AM-DPSGD, = 5 AM-DPSGD, = 20 0 1 2 Count bucket 3 0.0 4 (a) ML10M 0 1 2 Count bucket 3 4 (b) ML20M Figure 9: Sliced quality metrics. The movies are sorted by increasing count, and partitioned in five buckets of equal size. So bucket 0 corresponds to the rarest 20% of the movies, while bucket 4 corresponds to the most frequent 20%. 7 www.tensorflow.org/responsible_ai/privacy 29 E.3 Quality on the Long Tail Prior work (Chien et al., 2021; Krichene et al., 2023; Curmei et al., 2023) reported that tail movies, i.e. those with fewer training examples, are more susceptible to the noise added to guarantee DP, and this manifests in larger quality losses on the long tail. To assess the impact of our methods on tail quality, we report in Figure 9 utility metrics sliced by movie count, for different values of ϵ. On ML10M, we observe that the AM methods perform much better on the tail than the DP-ALS method (they also improve on the top slice, but the improvement is more significant on the tail). Recall that both AM methods train a feature-based encoder, while DP-ALS trains an id based encoder. This improvement on the tail perhaps indicates that using (public) features helps learn better representations of the long tail. Now comparing AM-DPSGD and AM-SSP, is appears that AM-SSP generally performs better on the head slices, and worse on the tail slices. On ML20M, AM-DPSGD performs worse on all slices (consistent with Figure 2). Comparing AM-SSP to DP-ALS, we do observe more improvements on tail slices. F AM-SSP in Federated Learning Alternating minimization has also been studied in the Federated Learning (FL) literature (Singhal et al., 2021; Collins et al., 2021), where the goal is to perform distributed training while keeping each user’s data on the user’s device. The class of multi-encoder models is particularly well-suited to the federated setting, as the model reflects the natural separation between the user’s data (to be protected) and the public features. In this section, we discuss implications of our algorithms on the FL setting. Figure 10 is an illustration of a federated implementation of the SSP2 algorithm. The system consists of the following components: • A server that holds the public feature matrix X pub . The server is assumed untrusted (so any data or model parameters sent to the server need to be DP protected). • n client devices (one per user). Client k holds the training data of user k, consisting of the private feature vector xpriv k , together with the items and labels of the user, {(ji , yi )}i∈Ωk . • A secure distributor, which simply broadcasts model parameters to the n clients. • A secure aggregator, responsible for computing and protecting the sufficient statistics. Techniques for secure aggregation have been studied for example by Bonawitz et al. (2017). The training of the public encoder is described in Algorithm 8. Note that the steps are identical to Algorithm 5, except that we additionally specify on which component each operation is done. Advantages of SSP in the Federated Setting Besides the advantages that we discussed in the general case (preserving gradient sparsity, noise that does not scale with the number of steps, etc.), SSP2 has additional advantages in the federated setting, compared to federated DP-SGD. First, since we only need to add noise to the sufficient statistics once (see Remark 5), the clients only need to be involved in one communication round : they send data to the aggregator once, then the server takes several gradient steps. This is in contrast to federated DP-SGD, where a new communication round is initiated after each update to the encoder’s parameters (the new parameters need to be broadcast to clients so gradients can be computed on clients 30 and aggregated). Reducing the number of synchronization barriers (waiting for clients to become available) may have a significant impact on training time in practice. Algorithm 8: Federated SSP2 1 Distributor Send: θu to clients. 2 Client k 3 k Input: xpriv k , D = {(yi , ji )}i∈Ωk pub uk ← uθu (xk ). Send: uk , Dk to aggregator. 4 Aggregator 5 6 7 8 Input: Clipping parameters Γy , Γu , noise standard deviation σ, optional weights {wi } Let ūi = Clip(uθu (xpriv ), Γu ), ȳi = Clip(yi , Γy ) i for 1 ≤ j ≤ m do P 2 d×d Âj ← i∈Ωj wi ūi ū⊤ i + σΓu N P b̂j ← i∈Ωj wi ȳi ūi + σΓy Γu N d . Send: {Âj , b̂j }1≤j≤m to server. 9 Server 10 11 12 13 14 Input: X pub , number of steps T I , learning rate ηt , statistics {Âj , b̂j }1≤j≤m for 0 ≤ t ≤ T I − 1 do Uniformly sample B ⊂ [m] P ∂v t (xpub ) Ĝt ← j∈B θv∂θvj (Âj vθvt (xpub j ) − b̂j ) t+1 t t θv ← θv − ηt Ĝ return θvT Server Client Figure 10: Illustration of Federated SSP2 31 Second, the communication cost is drastically reduced. Each client only needs to send to the aggregator the output of the private encoder (uk ), and the labels {ji , yi }i∈Ωk . Contrast this with federated DP-SGD, where client need to send model gradients (which can be significantly larger), and they do so at each round. In SSP2, communication cost does not scale with the number of rounds nor with the public encoder’s intermediate size such as feature vocab or hidden layers (as long as its output dimension is fixed). Finally, gradient computation can happen on an untrusted server, since dependence on sensitive data is isolated into the protected sufficient statistics. This makes it possible to (i) reduce the computational burden on client devices (each client only needs to compute a single forward pass), and (ii) avoid having to send the public item features X pub to the client devices. This further reduces the communication cost. 32