Fast Federated Machine Unlearning with Nonlinear Functional Theory Tianshi Che 1 Yang Zhou 1 Zijie Zhang 1 Lingjuan Lyu 2 Ji Liu 3 Da Yan 4 Dejing Dou 5 Jun Huan 6 6 Abstract the accuracy of the ML model on remaining data (Cao & Yang, 2015; Golatkar et al., 2020a; Shibata et al., 2021; Ginart et al., 2019; Guo et al., 2020; Garg et al., 2020; Gupta et al., 2021; Wu et al., 2022b; Nguyen et al., 2022). Existing research efforts on machine unlearning can be divided into two groups: (1) Centralized machine unlearning (CMU), where all holders’ training data collected by the server to unlearn centralized ML models (Golatkar et al., 2020a;b; Guo et al., 2020; Wu et al., 2020b; Nguyen et al., 2020; Izzo et al., 2021; Neel et al., 2021; Khan & Swaroop, 2021; Bourtoule et al., 2021; Ullah et al., 2021; Chen et al., 2021; 2022; Fu et al., 2022; Lu et al., 2022; Setlur et al., 2022; Suriyakumar & Wilson, 2022; Chundawat et al., 2022; Zeng et al., 2022; Liu et al., 2022b; Chourasia et al., 2023; Warnecke et al., 2023; Jagielski et al., 2023; Pawelczyk et al., 2023; Jang et al., 2023) and (2) Federated machine unlearning (FMU) for forgetting the target data from federated learning (FL) models when full access to all training data becomes unavailable (Liu et al., 2020; 2021a; Gong et al., 2021a; Liu et al., 2021b; Gong et al., 2021b; Yuan et al., 2022; Pan et al., 2022; Fraboni et al., 2022; Wang et al., 2022a; Liu et al., 2022c; Wu et al., 2022c;a; Gao et al., 2022; Halimi et al., 2022; Cao et al., 2023). Federated machine unlearning (FMU) aims to remove the influence of a specified subset of training data upon request from a trained federated learning model. Despite achieving remarkable performance, existing FMU techniques suffer from inefficiency due to two sequential operations of training and retraining/unlearning on large-scale datasets. Our prior study, PCMU, was proposed to improve the efficiency of centralized machine unlearning (CMU) with certified guarantees, by simultaneously executing the training and unlearning operations. This paper proposes a fast FMU algorithm, FFMU, for improving the FMU efficiency while maintaining the unlearning quality. The PCMU method is leveraged to train a local machine learning (MU) model on each edge device. We propose to employ nonlinear functional analysis techniques to refin the local MU models as output functions of a Nemytskii operator. We conduct theoretical analysis to derive that the Nemytskii operator has a global Lipschitz constant, which allows us to bound the difference between two MU models regarding the distance between their gradients. Based on the Nemytskii operator and average smooth local gradients, the global MU model on the server is guaranteed to achieve close performance to each local MU model with the certified guarantees. To the best of our knowledge, a common property of the above methods in either CMU or FMU settings need to sequentially perform two expensive operations: training a ML model on the whole dataset and producing an unlearning model, by either retraining a new ML model on the remaining data or directly unlearning the original ML model. This strategy of sequential execution is computationally expensive when training complex models over large datasets. The above efficiency issue becomes much worse in the FMU, since edge devices in the FMU often have limited computational resources (Dhar et al., 2021), such as smartphones, image sensors, Internet-of-Things devices, and wearable devices. The combination of high-dimensional models and constrained edge devices drastically limits the applicability of the FMU models in real world. 1. Introduction Machine unlearning (MU) aims to give data holders the right to remove the influence of a certain subset of data from a trained machine learning (ML) model, while maintaining 1 Auburn University, USA 2 Sony AI, Japan 3 Baidu Research, China 4 University of Alabama at Birmingham, USA 5 Boston Consulting Group, USA 6 AWS AI Labs, USA. Correspondence to: Yang Zhou . Our prior study, PCMU (Zhang et al., 2022b), presented a prompt certified MU approach based on randomized gradient smoothing and quantization. It is simultaneously executes the training and unlearning operations for improving the CMU efficiency. The PCMU method leverages randomized smoothing (RS) for certified robustness (CR) on Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 6 Work Done Before Joining AWS. 1 Fast Federated Machine Unlearning with Nonlinear Functional Theory smooth manifold O(R3 ). The further analysis shows that C the smooth manifold has a global Lipschitz constant √2πσ , classification (Cohen et al., 2019) to do RS for CMU on gradient quantization. The authors treat data removals in the CMU as perturbations on the whole dataset. Analogously, they consider adversarial attacks in the CR as perturbations on the data samples. In addition, they analogize output quantized gradients in the CMU to output discrete class labels in the CR. Since the output class labels in the CR with RS are able to keep unchanged against adversarial attacks within a certified radius, the output quantized gradients in the CMU with RS can keep unchanged against data removals within a certified budget. This implies that the CMU model with RS directly trained on the whole dataset shares the same gradients (and parameters) with the one retrained on only the remaining data. The authors also derive the certified radius regarding the gradient change before and after data removals and the certified budget of data removals (i.e., the maximally allowed amount of escaped data samples). −Ḡl k2 k −ql k2 i.e. kO(qk )(x) − O(ql )(x)k ≤ kq√ for ≤ CkḠ√k2πσ 2πσ T any Ḡk , Ḡl ∈ R , where C is the Lipschitz constant of Q. This global Lipschitz property of O(qk ) allows to bound the difference between two MU models g(·; qk ) and g(·; ql ) regarding the distance kḠk − Ḡl k2 between their gradients. Last but not least, a global gradient Ḡ is aggregated on the server by averaging the gradients of all the local ML modḠK els, i.e., Ḡ = Ḡ1 +···+ , where K is the number of edge K devices in the FMU. Thus, the global MU model O(q)(·) on the server is parameterized with a smooth gradient Q(Ḡ+ε), where q = Q(Ḡ). Based on the global Lipschitz property of O(q), we theoretically derive the global MU model O(q)(x) √ on the server has a certified guarantee of (K−1)Cd , where 2πKσ d = max kḠk − Ḡl k. Namely, for k = 1, · · · , K, it 1≤k,l≤K It is challenging to directly extend the idea of PCMU to the FMU settings. Due to FL’s privacy requirement, the access by the server to local training data on the edge devices becomes unavailable. It is impossible to utilize the PCMU method to train a CMU model on the server. On the other hand, if local MU models are trained on the edge devices and the standard aggregation methods (e.g., FedAvg (McMahan et al., 2017)) are used to generate a global MU model on the server, then it is difficult to guarantee the certified radius and budget of data removals of the global MU model. √ holds |O(q)(x) − O(qk )(x)| ≤ (K−1)Cd for any input sam2πKσ ple x. This certified guarantee of O(q)(x) ensures that the global MU model O(q)(x) is close to each local MU model √ O(qk )(x) on the edge devices within distance (K−1)Cd , 2πKσ which implies O(q)(x) with bounded errors can maintain the certified radius and budget of data removals of the local MU models to a certain degree. Our FMU method based on the theory of nonlinear functional analysis exhibits three compelling strengths against the existing FMU techniques: (1) It inherits the superior efficiency of the PCMU method by simultaneously executing the training and unlearning operations; (2) The global MU model on the server is guaranteed to achieve close performance to each local MU model with the certified CMU guarantees; and (3) The MU training is conducted on only the edge devices, which satisfies the FL’s privacy requirement. Empirical evaluation on real datasets demonstrates the superior performance of our FMU model against several state-of-the-art techniques. This work aims to extend the PCMU technique to the FMU setting for improving the FMU efficiency while maintaining the unlearning quality, by leveraging the theory of nonlinear functional analysis, including Nemytskii operator and Fréchet differentiable smooth manifolds. First, given a specific ML task (e.g., image classification), a local ML model f (x; Ḡk ) (x ∈ Rn ) is trained on each edge device k with only its local data, where x is a data sample and Ḡk ∈ RT is the gradient of the local ML model. The PCMU method (Zhang et al., 2022b) is leveraged to transform the original gradient Ḡk into its smooth and quantized version Q(Ḡk + ε) for the purpose of CMU, where Q is a gradient quantizatier to map the continuous gradients over a discrete three-class space {−1, 0, 1} and ε ∼ N 0, σ 2 I is the isotropic Gaussian distribution (See Eqs.(20)-(21) in the PCMU paper). Thus,  a local MU model g(x; Q(Ḡk )) = E f (x; Q(Ḡk + ε)) , ε ∼ N (0, σ 2 I) is ε generated on device k. For ease of presentation, we use symbol qk to replace Q(Ḡk ). 2. Preliminaries 2.1. Federated Machine Unlearning First, given a ML task (e.g., image classification), K edge devices with their local training data D = {D1 , · · · , DK }, and a server, federated learning (FL) aims to learn a global ML model on the server by optimizing the problem below. min L(W ) = Second, by leveraging the theory of nonlinear functional analysis, the local MU models g(x; qk ) are reformulated as output functions of a Nemytskii operator O(qk )(x), which maps the gradient space to a function space. The theoretical analysis demonstrates that the Nemytskii operator O(qk )(·) is smooth and induces a Fréchet differentiable W ∈Rd K X Nk k=1 1 where Lk (W ) = Nk N Lk (W ) (1) X li (W ) {xi ,yi }∈Dk where li (W ) = l(xi , yi ; W ) denotes the loss function of the prediction on data example {xi , yi } ∈ Dk made with a 2 Fast Federated Machine Unlearning with Nonlinear Functional Theory G(x, y) ∈ RT , then global model parameter W . Nk = |Dk | denotes the size of local dataset Dk . N is the size of total training data D, i.e., N = N1 + · · · + NK . In the FL, the global model parameter W is iteratively updated with the aggregation of all local model parameters W1 , ·, WK on K devices in each round, PK i.e., W = k=1 NNk Wk . √ R≥ T 0 R L (3) The certified budget B 0 of data removal from R0 is Second, the devices submit data removal requests at a certain time. The complete training data D is partitioned into two subsets: Df ⊆ D denoting the data which we wish the ML model to forget and Dr ⊆ D specifying the data which we want the model to remember (D = Df ∪ Dr ). The goal of federated machine unlearning (FMU) is to unlearn the forgotten data Df , i.e., eliminate the influence of Df from W . A straightforward solution is to use the remembered data Dr as the training data to retrain new local models on the edge devices with the data removal requests from scratch and to produce a new global model W r on the server. However, this naive method is often time-consuming over large-scale datasets. An efficient FMU algorithm is to directly generate a sanitized model W̃ r based on the deployed model W , D, and Df to approximate W r , i.e., W̃ r ≈ W r . B0 ≤ N − 36dL2 T (Φ−1 (pA 0 ) − Φ−1 (pB 0 ))2 (4) where Φ−1 is the inverse of the standard Gaussian CDF. Let pc (Ḡ) be the output probability of Q over gradient class c, i.e., pc (Ḡ) = P (Q(Ḡ + ε) = c). p0A and p0B are the ε∼D probabilities on the most probable class cA and the runnerup class cB respectively. The above theorem shows that the smooth gradient quantizatier S can always output the correct and unchanged quantized gradients as long as the data removals B 0 (i.e., the number of escaped data samples in D) is within a certified 2 budget of N − T (Φ−1 (pA36dL 0 )−Φ−1 (p 0 ))2 . This implies that B the CMU model with randomized gradient smoothing and quantization directly trained on the whole dataset D shares the same gradients (and parameters) with the one retrained on only the remembered data Dr , which is the gold standard for evaluating the MU performance. 2.2. Randomized Gradient Smoothing and Quantization for Centralized Machine Unlearning P Let Ḡ = N1 {xi ,yi }∈D G(xi , yi ) be the gradient average on all data samples, where D is the training data, N is the size of training data D, i.e., N = |D|, G(xi , yi ) ∈ RT is the gradient of a ML model on a data sample {xi , yi } ∈ D, T is the dimension of the gradient G(xi , yi ). 3. Fast Federated Machine Unlearning In this work, we first train a local ML model f (x; Ḡk ) (x ∈ Rn ) on each edge device k and leverage the PCMU method (Zhang et al., 2022b) to transform the original model gradient Ḡk into its smooth and quantized version Q(Ḡk +ε) for the fast MU on the edge devices. We further generate  a local MU model g(x; Q(Ḡk )) = E f (x; Q(Ḡk + ε)) , The randomized gradient smoothing for centralized machine unlearning (CMU) on gradient quantization in the PCMU method (Zhang et al., 2022b) is given as follows. ε S(Ḡ) = argmax P (Q(Ḡ + ε) = c) c∈{−1,0,1}ε∼D ε ∼ N (0, σ 2 I) on device k. By leveraging the theory of nonlinear functional analysis, we reformulate g(x; qk ) as output functions of a Nemytskii operator O(qk )(x), where qk = Q(Ḡk ). We theoretically prove that the Nemytskii operator N (qk )(·) is smooth and induces a Fréchet differentiable smooth manifold N (R3 ). The further analysis shows that the smooth manifold has a global Lipschitz constant √ C . Based on global Lipschitz property of O(R3 ), we de2πσ rive the bounded difference between the global MU model O(q)(x) on the server and the local MU model O(qk )(x) on each device k, where q = Q(Ḡ). Thus, the global MU model on the server is guaranteed to achieve close performance to each local MU model with the certified CMU guarantees. This help O(q)(x) maintain the certified radius and budget of data removals of the local MU models based on the PCMU method to a certain degree. (2)  where D = N 0, σ 2 I is a Gaussian distribution. Q is a gradient quantizatier to map each dimension of the continuous gradient G(x, y) ∈ RT over a discrete three-class space {−1, 0, 1}, for mimicking the classification in the randomized smoothing for certified robustness. S is a smooth version of Q. S returns whichever gradient classes Qt is most likely to return when Ḡ is perturbed by noise ε. The authors theoretically derive the certified radius R regarding the data change and the certified radius R0 about the gradient change of the MU model before and after data removals. The following theorem analyzes the correlation between two types of certified radii and the certified budget of data removals (Zhang et al., 2022b). Nonlinear functional analysis is a branch of mathematical analysis that deals with nonlinear mappings (i.e., nonlin- Theorem 1. Let L be the Lipschitz constant of gradient 3 Fast Federated Machine Unlearning with Nonlinear Functional Theory ear operators) between infinite-dimensional vector spaces and certain classes of nonlinear spaces (Riesz & Sz.-Nagy, 1955). The following definitions describe the differentiability properties of operators between Banach spaces and the Nemytskii operator. from Banach spaces to R, as well as a Nemytskii operator from certain Banach spaces to the space of bounded linear functionals from Lp to R, i.e., B(Lp ). Consider a linear function h(x) = qx, where q, x ∈ Rn . If the domain of x is a Banach space (Rn , k · kLp ), then g(x) can be treated as a functional from Lp to R. Based on the Hölder Inequality, we can get |h(x)| = |qx| ≤ kqkLp0 kxkLp , where p0 is the conjugacy of p satisfying p1 + p10 = 1. This implies that the operator norm khkop of the linear functional 0 h(x) is khkop = sup |h(x)| kxkp = kqkp . Thus, h(x) is a In functional analysis, a Banach space is a complete normed vector space. It is often used for the computation of vector length and distance between vectors. Definition 1. A Banach space is a complete normed space (X, k · k), where it is complete if any Cauchy sequence in X has a limit. Namely, for every Cauchy sequence x1 , x2 , · · · ∈ X, there exists some x ∈ X, such that the sequence’s convergence to x can be expressed as lim kxi − xk = 0. The norm k · k of a normed space x∈Rn ,x6=0 0 bounded linear functional from Lp to R for any q ∈ Lp , 0 i.e., g(x) ∈ B(Lp ) for any q ∈ Lp . If we further consider q as a vector, the mapping P (q):q → h(x) ∈ S(Lp ), i.e., P (q)(x) = h(x), can be viewed as an operator from the 0 Banach space Lp to the Banach space S(Lp ). The operator 0 P (q) : Lp → S(Lp ) is a Nemytskii operator. n→∞ (X, k · k) is a complete norm if (X, k · k) is a Banach space. Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. Specifically, (Rn , k · kLp ) is the vector space Rn equipped with the Lp -norm, i.e., k(x1 , x2 , · · · , xn )kLp = ( n X In terms of the above analysis, we define a hypothesis space H (H = {f (x; Q(Ḡ))|∀Q(Ḡ))}) as a Banach manifold in Lp space. We reformulate the non-smooth models f (x; Q(Ḡ)) as output functions of a Nemytskii operator P : R3 → Lp with P (q)(x) = f (x; Q(Ḡ)). We define another Nemytskii operator O : R3 → Lp with the randomized gradient smoothing and quantization as follows. |xi |p )1/p for 1 ≤ p < ∞; (5) i=1 k(x1 , x2 , · · · , xn )kL∞ = sup |xi |  O(q)(·) = E f (·; Q(Ḡ + ε)) , ε ∼ N (0, σ 2 I) 1≤i≤n ε Namely, we have The Lp function spaces are defined using a natural generalization of the Lp -norm for finite-dimensional vector spaces. O(q)(·) = O(Q(Ḡ))(·) Z kεk2 2 1 − 2σ 2 dε = P (Q( Ḡ + ε))(·)e 2 3/2 (2πσ ) R3 Next, we define Lp (Ω) Space. When 1 ≤ p < ∞, Lp (Ω) = {h(x) : Ω ⊂ Rn → R, Z |h(x)|p dx < ∞} (6) (9) Definition 4. Let X and Y be two Banach spaces, and o is an operator from X to Y . The operator o is called Fréchet differentiable at x ∈ X if there exists a bounded linear operator L : X → Y , such that Ω R equipped with the norm kh(x)kLp = ( Ω |h(x)|p dx)1/p . When p = ∞, L∞ (Ω) = {h(x) : Ω ⊂ Rn → R, sup|h(x)| < ∞} (8) lim (7) k∆kX →0 x∈Ω equipped with the norm kh(x)kL∞ = sup|h(x)|. ko(x + h) − o(x) − L(x)∆kY →0 k∆kX (10) The linear operator L(x) is called the Fréchet derivative of o at x. x∈Ω Definition 2. Let X, Y be two Banach spaces. The operator L : X → Y is linear if and only if L(αx1 + βx2 ) = αL(x1 ) + βL(x2 ) for any α, β ∈ R and x1 , x2 ∈ X. The Y operator norm of L is defined by kLkop = sup kL(x)k kxkX . The following theory demonstrates that the Nemytskii operator O(q)(·) is smooth and induces a Fréchet differentiable smooth manifold O(R3 ). x∈X,x6=0 Theorem 2. O is Fréchet differentiable and O(R3 ) ⊂ Lp is a smooth manifold. n Definition 3. Let Ω ⊂ R be a domain and Y be a Banach space. Given a functional F : Ω × X → R for any y ∈ Y , a new functional O(y) : Ω → R is defined as O(y)(x) = F (x; y). The operator O is a Nemytskii operator. Please refer to Appendix A.1 for detailed proof of Theorem 2. Now, we analyze the advantage of the randomized gradient smoothing techniques and the reason why we choose Nemytskii operator O(q)(x) or O(qk )(x) as global model or local model in the FMU setting. Since O(R3 )(·) is a smooth In order to better understand the concept of Nemytskii operators, we use linear functions as an example to explain it. The linear functions are essentially linear functionals 4 Fast Federated Machine Unlearning with Nonlinear Functional Theory for any Ω ⊂ Rn . manifold, it guarantees that the gradient quantizatiers qk are trainable. Most importantly, the smooth manifold O(R3 )(·) has a global Lipschitz constant which is independent of the input data, as demonstrated in Theorem 3. This independence property ensures that the certified MU guarantees of the smooth local models can be maintained to a certain degree against any data removals within the certified budget. On the other hand, the manifold P (R3 )(·) without the randomized gradient smoothing is only locally Lipschitz with respect to qk and the Lipschitz constant is determined by the input data. In addition, the Lipschitz constant regarding qk could be rather large since it is hard to control the amplification of difference through propagation over neural networks and thus the Lipschitz constant keeps increasing with the number of layers, which prevents the global model from preserving the certified MU guarantees of the local models in the FMU. Please refer to Appendix A.1 for detailed proof of Theorem 4. Therefore, O(q)(x) on the server has a certified guarantee √ of (K−1)Cd in Lp space for any 1 ≤ p ≤ ∞. Namely, 2πKσ this certified guarantee of O(q)(x) ensures that the smooth global model O(q)(x) is close to each smooth local model √ O(qk )(x) on the edge devices within distance (K−1)Cd . No2πKσ n tice that Theorem 4 is satisfied for any Ω ⊂ R , that is, the certified guarantee is independent of the input data. Therefore, it assures the closeness between the smooth global and local models on any data sample x ∈ Rn . The standard deviation σ in Gaussian noise in the randomized gradient smoothing serves as a tradeoff hyperparameter to well balance the MU performance and prediction accuracy achieved by the smooth global model. A larger σ results in higher closeness between the smooth global and local models and thus better preservation of the certified MU guarantees by the smooth local models, while a smaller σ leads to better prediction accuracy. Especially, when σ → 0, the smooth global model O(q)(x) on the server converges to the non-smooth one P (q)(x), which is validated by Theorem 5. In the PCMU method (Zhang et al., 2022b), the randomized gradient smoothing is a necessary step to provide the certified MU guarantees (i.e., the certified radius regarding the gradient change and the certified budget of data removals) of the smooth models. the randomized gradient smoothing, P (q)(x) fails to provide any certificate guarantees for the FMU task. Theorem 3. Let Ḡk ∈ RT be the gradient of the local ML model on edge device k, Q be the gradient quantizatier, C be the Lipschitz constant of Q, qk = Q(Ḡk ), and O be the smooth Nemytskii operator. If kḠk − Ḡl k2 ≤ d for any qk , ql ∈ R3 , then we have Cd |Ω|−1 kO(qk )(x) − O(ql )(x)kLp (Ω) ≤ √ if 1 ≤ p < ∞, 2πσ Cd if p = ∞ kO(qk )(x) − O(ql )(x)kLp (Ω) ≤ √ 2πσ (11) for any Ω ⊂ Rn . Please refer to Appendix A.1 for detailed proof of Theorem 3. Theorem 5. Let Ḡ be the gradient of the global ML model on the server, Q be the gradient quantizatier, q = Q(Ḡ), and O and P be the smooth and non-smooth Nemytskii operators respectively. O(q)(x) → P (q)(x) for any data sample x ∈ Rn if σ → 0. According to Theorem 3, the following theorem demonstrates that the smooth global model O(q)(x) on the server is close to each smooth local model O(qk )(x) on the edge √ devices within distance (K−1)Cd for the preservation of the 2πKσ MU certificates of the smooth local models based on the PCMU method to a certain degree. Please refer to Appendix A.1 for detailed proof of Theorem 5. FFMU model training. On the device side, a local ML model f (x; Ḡk ) (x ∈ Rn ) is trained on each edge device k with only its local data. The PCMU method (Zhang et al., 2022b) is leveraged to transform the original gradient Ḡk into its smooth and quantized version Q(Ḡk + ε). The local MU model is reformulated as output functions of a smooth Nemytskii operator O(qk )(x). Theorem 4. Let Ḡk ∈ RT be the gradient of the local ML model on edge device k, Q be the gradient quantizatier, C be the Lipschitz constant of Q, qk = Q(Ḡk ), O be the smooth Nemytskii operator, and O(qk )(x) be the ḠK smooth local model on device k. Let Ḡ = Ḡ1 +···+ be K the gradient of the global ML model on the server by averaging the gradients of all the local ML models, q = Q(Ḡ), and O(q)(x) be the smooth global model on the server. If max kḠk − Ḡl k2 = d, then we have  O(qk )(x) = E f (x; Q(Ḡk + ε)) , ε ∼ N (0, σ 2 I) ε (13) 1≤k,l≤K On the server side, the gradient Ḡ of the global ML model (K − 1)Cd √ if 1 ≤ p < ∞, on the server is aggregated by averaging the gradients of all 2πKσ the local ML models. (K − 1)Cd kO(q)(x) − O(qk )(x)kLp (Ω) ≤ √ if p = ∞ Ḡ1 + · · · + ḠK 2πKσ Ḡ = (14) (12) K |Ω|−1 kO(q)(x) − O(qk )(x)kLp (Ω) ≤ 5 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 1: Performance with 10% data removal and CNN on Fashion-MNIST Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 84.40 81.05 81.86 82.36 81.64 77.31 81.10 80.87 79.44 82.47 83.56 84.55 Performance Errort Errorr 15.60 14.70 18.95 17.51 18.14 17.56 17.64 16.82 18.36 17.70 22.69 22.29 18.90 17.61 19.13 17.90 20.56 19.92 17.53 17.22 16.44 16.01 15.45 14.88 Errorf 15.75 20.28 17.50 16.81 16.72 19.89 19.03 16.69 18.97 13.24 14.37 15.07 Training 211 213 198 227 369 234 232 234 493 232 232 1,288 Runtime (s) Unlearning 1,636 2,734 1,850 1,830 3,610 1,943 1,719 1,965 5,865 1,846 1,866 0 Total 1,847 2,947 2,048 2,057 3,979 2,177 1,951 2,199 6,358 2,078 2,098 1,288 Table 2: Performance with 20% data removal and CNN on Fashion-MNIST Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 84.39 81.58 82.34 81.02 79.51 81.65 81.89 82.20 79.27 82.17 81.98 84.55 Performance Errort Errorr 15.61 14.74 18.42 17.63 17.66 16.49 18.98 18.37 20.49 19.33 18.35 16.53 18.11 16.93 17.80 16.43 20.73 20.40 17.83 16.51 18.02 17.07 15.45 14.83 The global gradient Ḡ is converted to its smooth and quantized version Q(Ḡ + ε). ε Training 198 227 187 213 369 219 240 221 346 232 244 1,288 Runtime (s) Unlearning 2,340 5,459 3,658 3,608 3,610 3,908 3,409 3,935 7,091 3,622 3,751 0 Total 2,538 5,686 3,845 3,821 3,979 4,127 3,649 4,156 7,437 3,854 3,995 1,288 Baselines. We compare the FFMU model with ten state-ofthe-art federated machine unlearning models. Knowledge Distillation is a federated unlearning method to eliminate a client’s contribution by subtracting the accumulated historical updates from the model and leveraging the knowledge distillation method to restore the model’s performance without using any data from the clients (Wu et al., 2022a). Rapid Retraining is a rapid retraining approach to fully erase data samples from a trained FL model (Liu et al., 2022c). MacForget introduced a mask gradient generator that continuously generates mask gradients, and apply them to the neurons of the neural network and stimulate them to unlearn the memorization of the given samples (Liu et al., 2020). FedEraser is a federated unlearning methodology that can eliminate the influence of a federated client’s data on the global federated learning (FL) model while significantly reducing the time used for constructing the unlearned FL model (Liu et al., 2021a). VeriFi is a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods (Gao et al., 2022). Class-Discriminative Pruning proposed to utilize CNN channel pruning to guide the federated machine unlearning process (Wang et al., 2022b). UN performs unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to maximize the local empirical loss (Halimi et al., 2022). RCAD can unlearn spurious features in the The global MU model is reformulated as output functions of a smooth Nemytskii operator O(q)(x).  O(q)(x) = E f (x; Q(Ḡ + ε)) , ε ∼ N (0, σ 2 I) Errorf 15.12 19.69 16.71 16.75 18.65 19.82 18.59 18.85 18.38 19.00 17.68 15.19 (15) After the model training, O(q)(x) will be the output of our FFMU algorithm for fast federated machine unlearning. 4. Experiments In this section, we have evaluated the effectiveness of our FFMU model and other comparison methods for federated machine unlearning over three popular image classification datasets: Fashion-MNIST (Xiao et al., 2017; Gupta et al., 2021; Fu et al., 2022), CIFAR-10 (Krizhevsky, 2009; Golatkar et al., 2020a;b; Thudi et al., 2021a; Gupta et al., 2021; Fu et al., 2022), and SVHN (Netzer et al., 2011; Guo et al., 2020; Bourtoule et al., 2021). We train the classifiers on the training set and test them on the test set for three datasets. We train a convolutional neural network (CNN) on Fashion-MNIST for clothing classification. We train LeNet over CIFAR-10 for image classification. We apply the ResNet-18 architecture on SVHN for street view house number identification. We evaluate the performance of various machine unlearning methods on three datasets with the ratio of data removal between 5% and 20%. 6 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 3: Performance with 10% data removal and LeNet on CIFAR-10 Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 53.66 52.32 49.68 51.98 52.84 49.56 50.07 52.13 51.74 52.01 50.78 54.84 Performance Errort Errorr 46.34 42.67 47.68 47.09 50.32 47.82 48.02 47.75 47.16 45.89 50.44 49.05 49.93 48.05 47.87 46.21 48.26 47.10 47.99 47.33 49.22 47.18 45.16 42.78 Errorf 47.76 49.66 49.56 48.21 46.28 51.88 53.84 48.50 46.76 46.75 49.96 47.68 Training 165 143 144 151 375 144 194 187 311 172 145 986 Runtime (s) Unlearning 1,666 1,999 1,345 1,358 4,160 1,429 1,123 1,441 6,142 1,358 1,467 0 Total 1,831 2,142 1,489 1,509 4,543 1,573 1,317 1,628 6,453 1,530 1,612 986 Table 4: Performance with 20% data removal and LeNet on CIFAR-10 Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 54.97 51.62 51.67 50.71 52.53 50.57 52.27 53.51 49.60 51.39 51.77 54.84 Performance Errort Errorr 45.03 41.20 48.38 47.53 48.33 46.79 49.29 48.89 47.47 45.51 49.43 48.73 47.73 46.46 46.49 43.81 50.40 49.73 48.61 47.60 48.23 46.75 45.16 42.59 training data by increasing entropy only on examples generated along the adversarial direction (Setlur et al., 2022). IJ is an online unlearning algorithm that is both computationally and memory efficient by leveraging the infintesimal jacknife (Suriyakumar & Wilson, 2022). Noisy-GD is a robust data-deletion guarantee that can satisfy differential privacy to ensure true data deletion (Chourasia et al., 2023). Errorf 45.86 48.46 44.97 48.71 47.54 49.92 47.88 46.44 47.84 46.90 47.50 46.00 Training 166 145 141 148 367 149 165 184 313 183 145 986 Runtime (s) Unlearning 3,152 3,869 2,664 2,725 6,993 2,812 2,297 2,919 7,890 2,825 2,960 0 Total 3,318 4,014 2,805 2,873 7,360 2,961 2,462 3,103 8,203 3,008 3,105 986 formance (Golatkar et al., 2020a; Fu et al., 2022). Ideally, the accuracy and three errors of the unlearning models W̃ r should match that of the retrained model W r . Machine unlearning accuracy with varying ratios of data removal. Tables 1-4 exhibit the accuracy obtained by twelve federated machine unlearning approaches by varying the ratio of unlearning request / data removal between 10% and 20%. Retrain represents the model retrained on only the remembered data Dr from scratch, without the knowledge of the forgotten data Df . A federated machine unlearning algorithm with more similar performance to the Retrain model achieves a better unlearning result. It is observed that among eleven approaches except the Retrain model, no matter how large the ratios of data removal are, the FFMU method achieves the closest accuracy to the Retrain model in all tests, showing the effectiveness of FFMU to the federated machine unlearning. Compared to the absolute performance difference between other baselines and the Retrain model, FFMU, on average, achieves at least 22.09% and 9.43% improvement of absolute accuracy difference on Fashion-MNIST and CIFAR-10 respectively. In addition, the promising performance of FFMU over Fashion-MNIST and CIFAR-10 implies that FFMU has great potential as a general federated machine unlearning solution to other image datasets, which is desirable in practice. Variants of FFMU model. We evaluate two versions of FFMU to show the strengths of different techniques. FFMUN uses the FedAvg method to aggregate the local MU models on the edge devices into a global MU model on the server. FFMU performs the aggregation process with Nemytskii operator transformation and average smooth local gradients. FFMU can guarantee the global MU model on the server to achieve close performance to each local MU model with the certified guarantees. Evaluation metrics. By following the same settings in representative machine unlearning models (Golatkar et al., 2020a;b; Thudi et al., 2021a; Fu et al., 2022), we use four popular measures in machine unlearning to verify the performance of different methods: Accuracy, Errorf (classification errors on the forgotten data Df ), Errorr (errors on the remembered data Dr ), and Errort (errors on the test data). Since the model W r (Retrain) that uses only the remembered data Dr as the training data retrained from scratch has never seen the forgotten data Df , it is often used as the gold standard for evaluating the unlearning per- Machine unlearning error with varying ratios of data removal. Tables 1-4 also show the classification errors on 7 Fast Federated Machine Unlearning with Nonlinear Functional Theory (b) Errorr (a) Accuracy (c) Errorf Figure 1: Performance of FFMU variants with 20% data removal f f the deleted data D (Error ), errors on the remembered data Dr (Errorr ), and errors on the test data (Errort ) by twelve federated machine unlearning methods respectively. We have observed that the performance of our FFMU method behaves similarly and achieves at least 20.35% and 14.98% descrease of absolute error difference on two datasets respectively. FFMU substantially outperforms the performance of other baselines in most tests, especially on Fashion-MNIST. In addition, the errors by our FFMU are not sensitive to the ratio of data removals. This is because that our FFMU method performs one-time operation of simultaneous training and unlearning when addressing a series of machine unlearning requests, as long as the ratio of actual data removals is below the certified budget of data removals in our FFMU. However, other baselines need to sequentially handle these unlearning requests one by one. (a) Standard Deviation (b) Data Removal (%) Figure 2: Performance with varying parameters Impact of standard deviation. Figure 2 (a) measures the performance effect of standard deviation of the Gaussian distribution in the randomized smoothing for machine unlearning by varying σ from 0.025 to 0.03. Notice that the Retrain and FFMU-N models do not contain the module of randomized smoothing. Thus, their accuracy scores keep unchanged with varying σ. We have witnessed the performance curves by FFMU initially increase quickly and then become stable or even slight drop when σ continuously increases. Initially, a large σ can help utilize the strength of randomized gradient smoothing and quantization for directly training a machine unlearning model in advance. Later on, when σ continues to increase and goes beyond some thresholds, the performance curves become stable. A rational guess is that after the data removals have been already certified at a certain threshold and considered in the training of machine unlearning models, our FFMU model is able to generate a good machine unlearning result. When σ continuously increases, this does not affect the performance of federated machine unlearning any more. Ablation study. Figure 1 exhibits the unlearning performance with the Retrain model and two variants of FFMU on three datasets. We have observed the FFMU achieves the closest accuracy and errors to the Retrain model over three datasets, which are obviously better than FFMU-N. A reasonable explanation is that FFMU leverages the PCMU method to utilize the randomized gradient smoothing and quantization for supporting certified MU on the edge devices. The global Lipschitz property of Nemytskii operator ensures the global MU model to achieve close performance to each local MU model, which implies the certified guarantees on the devices are maintained in the global MU model. Impact of data removal ratio. Figure 2 (b) evaluates the accuracy impact of data removal ratios varying from 5% to 20% on three datasets of Fishion-MNIST, CIFAR-10, and SVHN. It is observed that when changing data removal ratios, the accuracy by our FFMU model matches well with that of the retrained model from scratch. The performance by our FFMU model keeps relatively stable, since our method directly trains a unlearning model based on the certified budget of data removals in advance and performs one-time operation of simultaneous training and unlearning, as long as the ratio of actual data removals is below the certified budget of data removals. Thus, we do not need to re-unlearn the model when a new unlearning request is coming. This shows the potential of our FFMU model on addressing a series of federated machine unlearning requests. Running time. Tables 1-4 report the running time achieved by all comparison methods over two dataset to produce machine unlearning results respectively. We observe that FFMU scales well with deep neural network architectures over different image datasets and shows good efficiency for federated machine unlearning. Our FFMU method achieves better efficiency than other baseline methods in most experiments. In addition, our FFMU method performs one-time operation of simultaneous training and unlearning when addressing a series of machine unlearning requests. However, other baseline methods need to sequentially handle these machine unlearning requests one by one. This is clearly a computationally expensive process when the number of machine unlearning requests is huge. 5. Related Work (1) Centralized Machine Unlearning. Trustworthy machine learning has attracted active research in recent years (Palanisamy et al., 2018; Zhou et al., 2020b; Zhang et al., 2020; Zhou et al., 2021; Zhao et al., 2021; Ren et al., 2021; Zhang et al., 2021c;a; Zhou et al., 2022b; Jin et al., 8 Fast Federated Machine Unlearning with Nonlinear Functional Theory 2022b; Zhang et al., 2022b; Zhou et al., 2010; 2009; Cheng et al., 2011; Zhou & Liu, 2012; Cheng et al., 2012; Zhou & Liu, 2013; Su et al., 2013; Zhou & Liu, 2014; Su et al., 2015; Zhou & Liu, 2015; Zhou et al., 2015a; 2016; 2018b;a; Ren et al., 2019; Zhou et al., 2019b;a;c; Zhou & Liu, 2019; Wu et al., 2020a; 2021a; Zhou et al., 2020c;a; Jin et al., 2021; Wu et al., 2021b). Machine unlearning, one of important research topics in the trustworthy machine learning, is gaining attention in recent years (Cao & Yang, 2015; Ginart et al., 2019; Guo et al., 2020; Golatkar et al., 2020a; Garg et al., 2020; Shibata et al., 2021; Gupta et al., 2021; Wu et al., 2022b; Nguyen et al., 2022). Machine unlearning can be broadly grouped into two categories: exact unlearning and approximate unlearning methods. In exact unlearning, the impact of the data to be forgotten is excluded from the model, as if retraining the model on the remaining data from scratch (Cauwenberghs & Poggio, 2000; Karasuyama & Takeuchi, 2009; Cao & Yang, 2015; Ginart et al., 2019; Chen et al., 2019; Schelter, 2020; Li et al., 2021; Mahadevan & Mathioudakis, 2021; Brophy & Lowd, 2021; Schelter et al., 2021; Chen et al., 2022). In approximate unlearning, the methods aim to approximate the parameters that would have been obtained if the model was trained without using the data to be unlearned (Baumhauer et al., 2020; Graves et al., 2021; Golatkar et al., 2021; Thudi et al., 2021b; Liu et al., 2021a; Marchant et al., 2022; Zeng et al., 2022; Liu et al., 2022b; Chourasia et al., 2023). ated machine unlearning process (Wang et al., 2022b). Liu et al. proposed a rapid retraining approach to fully erase data samples from a trained FL model (Liu et al., 2022c). Wu et al. developed a general pipeline for simultaneously three common types of federated unlearning requests: class unlearning, client unlearning, and sample unlearning (Wu et al., 2022c). Wu et al. proposed a novel federated unlearning method to eliminate a client’s contribution by subtracting the accumulated historical updates from the model and leveraging the knowledge distillation method to restore the model’s performance without using any data from the clients (Wu et al., 2022a). VeriFi is a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods (Gao et al., 2022). UN performs unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to maximize the local empirical loss (Halimi et al., 2022). FedRecover can recover an accurate global model from poisoning attacks with small cost for the clients, by using the server to estimate the clients’ model updates (Cao et al., 2023). PCMU is the only method to simultaneously execute the training and unlearning operations for dramatically improving the unlearning efficiency in centralized setting (Zhang et al., 2022b). To the best of our knowledge, other machine unlearning methods iTo the best of our knowledge, a common property of the above methods in either centralized or federated settings need to sequentially perform two expensive operations: training a ML model on the whole dataset and producing an unlearning model, by either retraining a new ML model on the remaining data or directly unlearning the original ML model. This strategy of sequential execution is computationally expensive when training complex models over large datasets. Motivated the idea of PCMU, this work is the first to simultaneously execute the training and unlearning operations for improving the FMU efficiency while maintaining the unlearning quality, by leveraging the theory of nonlinear functional analysis, including Nemytskii operator and Fréchet differentiable smooth manifolds. Certified Machine Unlearning. Subsequent works follow similar approximate definitions in (Ginart et al., 2019) to provide certified unlearning guarantees for strongly-convex learning problems (Guo et al., 2020; Neel et al., 2021; Sekhari et al., 2021). certified removal is a certified-removal mechanism that applies a Newton step on the model parameters that largely remove the influence of the deleted data points (Guo et al., 2020). PCMU executes one-time operation of simultaneous training and unlearning in advance for a series of machine unlearning requests, without the knowledge of the forgotten data (Zhang et al., 2022b). (2) Federated Machine Unlearning. Parallel, distributed, and federated learning have been extensively studied in recent years (Lee et al., 2019; Wu et al., 2021a; Goswami et al., 2020; Zhang et al., 2021b; Zhou et al., 2022a; Guo et al., 2022; Jin et al., 2022a; Zhang et al., 2022a; Che et al., 2022; Yan et al., 2022a; Liu et al., 2022a; Yan et al., 2022b;c; Liu et al., 2023; Li et al., 2022; Liu et al., 20213; Zhou & Liu, 2013; Zhou et al., 2014; Bao et al., 2015; Zhou et al., 2015b;c; Lee et al., 2015; Zhou, 2017). Although the Centralized machine unlearning techniques dominate the existing research efforts, federated machine unlearning has attracted active research in recent years (Liu et al., 2020; 2021a; Gong et al., 2021b; Liu et al., 2021b; Yuan et al., 2022; Pan et al., 2022; Fraboni et al., 2022). Wang et al. proposed to utilize CNN channel pruning to guide the feder- 6. Conclusions In this work, we proposed a fast FMU algorithm for improving the FMU efficiency while maintaining the unlearning quality. First, the PCMU method is leveraged to train a local MU model on each edge device. Second, the local MU models are reformulated as output functions of a Nemytskii operator. Based on the Nemytskii operator and average smooth local gradients, the global MU model on the server is guaranteed to achieve close performance to each local MU model with the certified guarantees. Finally, the theoretical analysis is conducted to bound the difference between two MU models regarding the distance between their gradients. 9 Fast Federated Machine Unlearning with Nonlinear Functional Theory References privacy. In Kim, Y., Kim, J., Vigna, G., and Shi, E. (eds.), CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 896–911. ACM, 2021. Bao, X., Liu, L., Xiao, N., Zhou, Y., and Zhang, Q. Policydriven autonomic configuration management for nosql. In Proceedings of the 2015 IEEE International Conference on Cloud Computing (CLOUD’15), pp. 245–252, New York, NY, June 27-July 2 2015. Chen, Y., Xiong, J., Xu, W., and Zuo, J. A novel online incremental and decremental learning algorithm based on variable support vector machine. Clust. Comput., 22 (Supplement):7435–7445, 2019. Baumhauer, T., Schöttle, P., and Zeppelzauer, M. Machine unlearning: Linear filtration for logit-based classifiers. CoRR, abs/2002.02730, 2020. Cheng, H., Zhou, Y., and Yu, J. X. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(2):1–33, 2011. Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In Proceedings of the 42nd IEEE Symposium on Security and Privacy, 2021. Cheng, H., Zhou, Y., Huang, X., and Yu, J. X. Clustering large attributed information networks: An efficient incremental computing approach. Data Mining and Knowledge Discovery (DMKD), 25(3):450–477, 2012. Brophy, J. and Lowd, D. Machine unlearning for random forests. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 1092– 1104. PMLR, 2021. Chourasia, R., Shah, N., and Shokri, R. Forget unlearning: Towards true data-deletion in machine learning. In 11th International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, Conference Track Proceedings, 2023. Cao, X., Jia, J., Zhang, Z., and Gong, N. Z. Fedrecover: Recovering from poisoning attacks in federated learning using historical information. In 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 22-25, 2023. IEEE, 2023. Chundawat, V. S., Tarun, A. K., Mandal, M., and Kankanhalli, M. S. Zero-shot machine unlearning. CoRR, abs/2201.05629, 2022. Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pp. 463–480. IEEE Computer Society, 2015. Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 1310–1320, 2019. Cauwenberghs, G. and Poggio, T. A. Incremental and decremental support vector machine learning. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 409–415. MIT Press, 2000. Dhar, S., Guo, J., Liu, J. J., Tripathi, S., Kurup, U., and Shah, M. A survey of on-device machine learning: An algorithms and learning theory perspective. ACM Trans. Internet Things, 2(3):15:1–15:49, 2021. Che, T., Zhang, Z., Zhou, Y., Zhao, X., Liu, J., Jiang, Z., Yan, D., Jin, R., and Dou, D. Federated fingerprint learning with heterogeneous architectures. In Proceedings of the 22nd IEEE International Conference on Data Mining (ICDM’22), pp. 31–40, Orlando, FL, November 28December 1 2022. Fraboni, Y., Vidal, R., Kameni, L., and Lorenzi, M. Sequential informed federated unlearning: Efficient and provable client unlearning in federated optimization. CoRR, abs/2211.11656, 2022. Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Ray, I., Li, N., and Kruegel, C. (eds.), Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015, pp. 1322–1333. ACM, 2015. Chen, C., Sun, F., Zhang, M., and Ding, B. Recommendation unlearning. In Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., and Médini, L. (eds.), WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pp. 2768–2777. ACM, 2022. Fu, S., He, F., and Tao, D. Knowledge removal in samplingbased bayesian inference. In 10th International Conference on Learning Representations, ICLR 2022, Online, April 25-29, 2022, Conference Track Proceedings, 2022. Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. When machine unlearning jeopardizes 10 Fast Federated Machine Unlearning with Nonlinear Functional Theory Gao, X., Ma, X., Wang, J., Sun, Y., Li, B., Ji, S., Cheng, P., and Chen, J. Verifi: Towards verifiable federated unlearning. CoRR, abs/2205.12709, 2022. using nosql. The Journal of Supercomputing (TJSC), 76 (9):6619–6647, 2020. Graves, L., Nagisetty, V., and Ganesh, V. Amnesiac machine learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 11516–11524. AAAI Press, 2021. Garg, S., Goldwasser, S., and Vasudevan, P. N. Formalizing data deletion in the context of the right to be forgotten. In Canteaut, A. and Ishai, Y. (eds.), Advances in Cryptology - EUROCRYPT 2020 - 39th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Zagreb, Croatia, May 10-14, 2020, Proceedings, Part II, volume 12106 of Lecture Notes in Computer Science, pp. 373–402. Springer, 2020. Guo, C., Goldstein, T., Hannun, A. Y., and van der Maaten, L. Certified data removal from machine learning models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 3832–3842. PMLR, 2020. Ginart, A., Guan, M. Y., Valiant, G., and Zou, J. Making AI forget you: Data deletion in machine learning. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3513–3526, 2019. Guo, G., Yan, D., Yuan, L., Khalil, J., Long, C., Jiang, Z., and Zhou, Y. Maximal directed quasi-clique mining. In Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE’22), pp. 1900–1913, Kuala Lumpur, Malaysia, May 9-12 2022. Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9301–9309. Computer Vision Foundation / IEEE, 2020a. Gupta, V., Jung, C., Neel, S., Roth, A., Sharifi-Malvajerdi, S., and Waites, C. Adaptive machine unlearning. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 16319–16330, 2021. Golatkar, A., Achille, A., and Soatto, S. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in Computer Science, pp. 383–398. Springer, 2020b. Halimi, A., Kadhe, S., Rawat, A., and Baracaldo, N. Federated unlearning: How to efficiently erase a client in fl? CoRR, abs/2207.05521, 2022. doi: 10.48550/arXiv. 2207.05521. URL https://doi.org/10.48550/ arXiv.2207.05521. Golatkar, A., Achille, A., Ravichandran, A., Polito, M., and Soatto, S. Mixed-privacy forgetting in deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 792–801. Computer Vision Foundation / IEEE, 2021. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1026–1034, 2015. Gong, J., Simeone, O., and Kang, J. Bayesian variational federated learning and unlearning in decentralized networks. In 22nd IEEE International Workshop on Signal Processing Advances in Wireless Communications, SPAWC 2021, Lucca, Italy, September 27-30, 2021, pp. 216–220. IEEE, 2021a. Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 2008–2016. PMLR, 2021. Gong, J., Simeone, O., Kassab, R., and Kang, J. Forget-svgd: Particle-based bayesian federated unlearning. CoRR, abs/2111.12056, 2021b. Jagielski, M., Thakkar, O., Tramèr, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., and Zhang, C. Measuring forgetting of memorized training examples. In 11th International Conference on Goswami, S., Pokhrel, A., Lee, K., Liu, L., Zhang, Q., and Zhou, Y. Graphmap: Scalable iterative graph processing 11 Fast Federated Machine Unlearning with Nonlinear Functional Theory Lee, K., Liu, L., Ganti, R. L., Srivatsa, M., Zhang, Q., Zhou, Y., and Wang, Q. Lightwieight indexing and querying services for big spatial data. IEEE Transactions on Services Computing (TSC), 12(3):343–355, 2019. Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, Conference Track Proceedings, 2023. Jang, J., Yoon, D., Yang, S., Cha, S., Lee, M., Logeswaran, L., and Seo, M. Knowledge unlearning for mitigating privacy risks in language models. In 11th International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, Conference Track Proceedings, 2023. Legislature, C. S. California consumer privacy act of 2018. cal. civ. code §1798.100, 2018. Li, G., Hu, Y., Zhang, M., Liu, J., Yin, Q., Peng, Y., and Dou, D. Fedhisyn: A hierarchical synchronous federated learning framework for resource and data heterogeneity. In Int. Conf. on Parallel Processing (ICPP), pp. 1–11, 2022. Jin, J., Ren, J., Zhou, Y., Lv, L., Liu, J., and Dou, D. Accelerated federated learning with decoupled adaptive optimization. In Proceedings of the 39th International Conference on Machine Learning (ICML’22), pp. 10298–10322, Baltimore, MD, July 17-23 2022a. Li, Y., Wang, C., and Cheng, G. Online forgetting process for linear regression models. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 217–225. PMLR, 2021. Jin, J., Zhang, Z., Zhou, Y., and Wu, L. Input-agnostic certified group fairness via gaussian parameter smoothing. In Proceedings of the 39th International Conference on Machine Learning (ICML’22), pp. 10340–10361, Baltimore, MD, July 17-23 2022b. Liu, G., Ma, X., Yang, Y., Wang, C., and Liu, J. Federaser: Enabling efficient client-level data removal from federated learning models. In 29th IEEE/ACM International Symposium on Quality of Service, IWQOS 2021, Tokyo, Japan, June 25-28, 2021, pp. 1–10. IEEE, 2021a. Jin, R., Li, D., Gao, J., Liu, Z., Chen, L., and Zhou, Y. Towards a better understanding of linear models for recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21), pp. 776–785, Virtual Event, Singapore, August 14-18 2021. Liu, J., Wu, Z., Yu, D., Ma, Y., Feng, D., Zhang, M., Wu, X., Yao, X., and Dou, D. Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments. Future Generation Computer Systems, 20213. To appear. Karasuyama, M. and Takeuchi, I. Multiple incremental decremental learning of support vector machines. In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada, pp. 907–915. Curran Associates, Inc., 2009. Liu, J., Huang, J., Zhou, Y., Li, X., Ji, S., Xiong, H., and Dou, D. From distributed machine learning to federated learning: A survey. Knowledge and Information Systems (KAIS), 64(4):885–917, 2022a. Liu, J., Jia, J., Ma, B., Zhou, C., Zhou, J., Zhou, Y., Dai, H., and Dou, D. Multi-job intelligent scheduling with cross-device federated learning. IEEE Transactions on Parallel and Distributed Systems (TPDS), 34(2):535–551, 2023. Khan, M. E. and Swaroop, S. Knowledge-adaptation priors. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 19757–19770, 2021. Liu, Y., Ma, Z., Liu, X., Liu, J., Jiang, Z., Ma, J., Yu, P., and Ren, K. Learn to forget: Machine unlearning via neuron masking. CoRR, abs/2003.10933, 2020. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical Report, 2009. Liu, Y., Ma, Z., Yang, Y., Liu, X., Ma, J., and Ren, K. Revfrf: Enabling cross-domain random forest training with revocable federated learning. IEEE Transactions on Dependable and Secure Computing, 2021b. Lee, K., Liu, L., Schwan, K., Pu, C., Zhang, Q., Zhou, Y., Yigitoglu, E., and Yuan, P. Scaling iterative graph computations with graphmap. In Proceedings of the 27th IEEE international conference for High Performance Computing, Networking, Storage and Analysis (SC’15), pp. 57:1– 57:12, Austin, TX, November 15-20 2015. Liu, Y., Fan, M., Chen, C., Liu, X., Ma, Z., Wang, L., and Ma, J. Backdoor defense with machine unlearning. In 41st IEEE Conference on Computer Communications, INFOCOM 2022, Virtual, May 2-5, 2022. IEEE, 2022b. 12 Fast Federated Machine Unlearning with Nonlinear Functional Theory Liu, Y., Xu, L., Yuan, X., Wang, C., and Li, B. The right to be forgotten in federated learning: An efficient realization with rapid retraining. In 41st IEEE Conference on Computer Communications, INFOCOM 2022, Virtual, May 2-5, 2022. IEEE, 2022c. of the EU, C. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation), 2016. Lu, X., Welleck, S., Jiang, L., Hessel, J., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS’22), New Orleans, LA, November 28December 9 2022. Palanisamy, B., Liu, L., Zhou, Y., and Wang, Q. Privacypreserving publishing of multilevel utility-controlled graph datasets. ACM Transactions on Internet Technology (TOIT), 18(2):24:1–24:21, 2018. Pan, C., Sima, J., Prakash, S., Rana, V., and Milenkovic, O. Machine unlearning of federated clusters. CoRR, abs/2210.16424, 2022. Mahadevan, A. and Mathioudakis, M. Certifiable machine unlearning for linear models. CoRR, abs/2106.15093, 2021. Pawelczyk, M., Leemann, T., Biega, A., and Kasneci, G. On the trade-off between actionable explanations and the right to be forgotten. In 11th International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, Conference Track Proceedings, 2023. Marchant, N. G., Rubinstein, B. I. P., and Alfeld, S. Hard to forget: Poisoning attacks on certified machine unlearning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, (AAAI’22), February 22-March 1, 2022, Vancouver, Canada, 2022. Ren, J., Zhou, Y., Jin, R., Zhang, Z., Dou, D., and Wang, P. Dual adversarial learning based network alignment. In Proceedings of the 19th IEEE International Conference on Data Mining (ICDM’19), pp. 1288–1293, Beijing, China, November 8-11 2019. McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Singh, A. and Zhu, X. J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282. PMLR, 2017. Ren, J., Zhang, Z., Jin, J., Zhao, X., Wu, S., Zhou, Y., Shen, Y., Che, T., Jin, R., and Dou, D. Integrated defense for resilient graph matching. In Proceedings of the 38th International Conference on Machine Learning (ICML’21), pp. 8982–8997, Virtual Event, July 18-24 2021. Neel, S., Roth, A., and Sharifi-Malvajerdi, S. Descent-todelete: Gradient-based methods for machine unlearning. In Feldman, V., Ligett, K., and Sabato, S. (eds.), Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pp. 931–962. PMLR, 2021. Riesz, F. and Sz.-Nagy, B. Functional Analysis. Dover Publications, 1955. Schelter, S. ”amnesia” - machine learning models that can forget user data very fast. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org, 2020. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011, 2011. Schelter, S., Grafberger, S., and Dunning, T. Hedgecut: Maintaining randomised trees for low-latency machine unlearning. In Li, G., Li, Z., Idreos, S., and Srivastava, D. (eds.), SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pp. 1545–1557. ACM, 2021. Nguyen, Q. P., Low, B. K. H., and Jaillet, P. Variational bayesian unlearning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine unlearning. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 18075–18086, 2021. Nguyen, Q. P., Oikawa, R., Divakaran, D. M., Chan, M. C., and Low, B. K. H. Markov chain monte carlo-based machine unlearning: Unlearning what needs to be forgotten. In Proceedings of the 2022 ACM Asia Conference on Computer and Communications Security (ASIA CCS ’22), May 30-June 3, 2022, Nagasaki, Japan, 2022. 13 Fast Federated Machine Unlearning with Nonlinear Functional Theory Setlur, A., Eysenbach, B., Smith, V., and Levine, S. Adversarial unlearning: Reducing confidence along adversarial directions. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS’22), New Orleans, LA, November 28-December 9 2022. Wang, J., Guo, S., Xie, X., and Qi, H. Protect privacy from gradient leakage attack in federated learning. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, London, United Kingdom, May 2-5, 2022, pp. 580–589. IEEE, 2022a. Wang, J., Guo, S., Xie, X., and Qi, H. Federated unlearning via class-discriminative pruning. In Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., and Médini, L. (eds.), WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pp. 622–632. ACM, 2022b. Shibata, T., Irie, G., Ikami, D., and Mitsuzumi, Y. Learning with selective forgetting. In Zhou, Z. (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pp. 989–996. ijcai.org, 2021. Warnecke, A., Pirch, L., Wressnegger, C., and Rieck, K. Get a model! model hijacking attack against machine learning models. In 30th Annual Network and Distributed System Security Symposium, NDSS 2022, San Diego, California, USA, February 28- March 4, 2023. The Internet Society, 2023. Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 3–18. IEEE Computer Society, 2017. Su, Z., Liu, L., Li, M., Fan, X., and Zhou, Y. Servicetrust: Trust management in service provision networks. In Proceedings of the 10th IEEE International Conference on Services Computing (SCC’13), pp. 272–279, Santa Clara, CA, June 27-July 2 2013. Wu, C., Zhu, S., and Mitra, P. Federated unlearning with knowledge distillation. CoRR, abs/2201.09441, 2022a. Wu, G., Hashemi, M., and Srinivasa, C. PUMA: performance unchanged model augmentation for training data removal. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, (AAAI’22), February 22-March 1, 2022, Vancouver, Canada, 2022b. Su, Z., Liu, L., Li, M., Fan, X., and Zhou, Y. Reliable and resilient trust management in distributed service provision networks. ACM Transactions on the Web (TWEB), 9(3): 1–37, 2015. Wu, L., Guo, S., Wang, J., Hong, Z., Zhang, J., and Ding, Y. Federated unlearning: Guarantee the right of clients to forget. IEEE Netw., 36(5):129–135, 2022c. Suriyakumar, V. M. and Wilson, A. C. Algorithms that approximate data removal: New results and limitations. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS’22), New Orleans, LA, November 28-December 9 2022. Wu, S., Li, Y., Zhang, D., Zhou, Y., and Wu, Z. Diverse and informative dialogue generation with context-specific commonsense knowledge awareness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), pp. 5811–5820, Online, July 5-10 2020a. Thudi, A., Deza, G., Chandrasekaran, V., and Papernot, N. Unrolling SGD: understanding factors influencing machine unlearning. CoRR, abs/2109.13398, 2021a. Wu, S., Li, Y., Zhang, D., Zhou, Y., and Wu, Z. Topicka: Generating commonsense knowledge-aware dialogue responses towards the recommended topic fact. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI’20), pp. 3766–3772, Online, January 7-15 2021a. Thudi, A., Jia, H., Shumailov, I., and Papernot, N. On the necessity of auditable algorithmic definitions for machine unlearning. CoRR, abs/2110.11891, 2021b. Ullah, E., Mai, T., Rao, A., Rossi, R. A., and Arora, R. Machine unlearning via algorithmic stability. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp. 4126–4142. PMLR, 2021. Wu, S., Wang, M., Zhang, D., Zhou, Y., Li, Y., and Wu, Z. Knowledge-aware dialogue generation via hierarchical infobox accessing and infobox-dialogue interaction graph network. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21), pp. 3964–3970, Virtual Event / Montreal, Canada, August 19-27 2021b. Veale, M., Binns, R., and Edwards, L. Algorithms that remember: Model inversion attacks and data protection law. CoRR, abs/1807.04644, 2018. Wu, Y., Dobriban, E., and Davidson, S. B. Deltagrad: Rapid retraining of machine learning models. In Proceedings 14 Fast Federated Machine Unlearning with Nonlinear Functional Theory Zhang, Z., Zhang, Z., Zhou, Y., Wu, L., Wu, S., Han, X., Dou, D., Che, T., and Yan, D. Adversarial attack against cross-lingual knowledge graph alignment. In Proceedings of the 26th Conference on Empirical Methods in Natural Language Processing (EMNLP’21), pp. 5320–5337, Virtual Event / Punta Cana, Dominican Republic, November 7-11 2021c. of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 10355–10366. PMLR, 2020b. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. Zhang, Z., Zhou, Y., Zhao, X., Che, T., and Lyu, L. Prompt certified machine unlearning with randomized gradient smoothing and quantization. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS’22), New Orleans, LA, November 28December 9 2022b. Yan, D., Qu, W., Guo, G., Wang, X., and Zhou, Y. Prefixfpm: A parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal (VLDBJ), 31(2):253–286, 2022a. Yan, D., Zhou, Y., and Guo, G. Think-like-a-task programming model. In Albert Zomaya, J. T. and Sakr, S. (eds.), Encyclopedia of Big Data Technologies. Springer, 2022b. Zhao, X., Zhang, Z., Zhang, Z., Wu, L., Jin, J., Zhou, Y., Jin, R., Dou, D., and Yan, D. Expressive 1-lipschitz neural networks for robust multiple graph learning against adversarial attacks. In Proceedings of the 38th International Conference on Machine Learning (ICML’21), pp. 12719–12735, Virtual Event, July 18-24 2021. Yan, D., Zhou, Y., Guo, G., and Liu, H. Parallel graph processing. In Albert Zomaya, J. T. and Sakr, S. (eds.), Encyclopedia of Big Data Technologies. Springer, 2022c. Yuan, W., Yin, H., Wu, F., Zhang, S., He, T., and Wang, H. Federated unlearning for on-device recommendation. CoRR, abs/2210.10958, 2022. Zhou, C., Liu, J., Jia, J., Zhou, J., Zhou, Y., Dai, H., and Dou, D. Efficient device scheduling with multi-job federated learning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’22), pp. 9971–9979, Vancouver, Canada, February 22-March 1 2022a. Zeng, Y., Chen, S., Park, W., Mao, Z., Jin, M., and Jia, R. Adversarial unlearning of backdoors via implicit hypergradient. In 10th International Conference on Learning Representations, ICLR 2022, Online, April 25-29, 2022, Conference Track Proceedings, 2022. Zhou, Y. Innovative Mining, Processing, and Application of Big Graphs. PhD thesis, Georgia Institute of Technology, Atlanta, GA, USA, 2017. Zhang, G., Zhou, Y., Wu, S., Zhang, Z., and Dou, D. Cross-lingual entity alignment with adversarial kernel embedding and adversarial knowledge translation. CoRR, abs/2104.07837, 2021a. Zhou, Y. and Liu, L. Clustering analysis in large graphs with rich attributes. In Holmes, D. E. and Jain, L. C. (eds.), Data Mining: Foundations and Intelligent Paradigms: Volume 1: Clustering, Association and Classification. Springer, 2012. Zhang, H., Liu, J., Jia, J., Zhou, Y., Dai, H., and Dou, D. Fedduap: Federated learning with dynamic update and adaptive pruning using shared data on the server. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI’22), pp. 2776–2782, Messe Wien, Vienna, Austria, July 23-29 2022a. Zhou, Y. and Liu, L. Social influence based clustering of heterogeneous information networks. In Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’13), pp. 338–346, Chicago, IL, August 11-14 2013. Zhang, Z., Zhang, Z., Zhou, Y., Shen, Y., Jin, R., and Dou, D. Adversarial attacks on deep graph matching. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS’20), Virtual, December 6-12 2020. Zhou, Y. and Liu, L. Activity-edge centric multi-label classification for mining heterogeneous information networks. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’14), pp. 1276–1285, New York, NY, August 24-27 2014. Zhang, Z., Jin, J., Zhang, Z., Zhou, Y., Zhao, X., Ren, J., Liu, J., Wu, L., Jin, R., and Dou, D. Validating the lottery ticket hypothesis with inertial manifold theory. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021 (NeurIPS’21), pp. 30196–30210, Virtual, December 6-14 2021b. Zhou, Y. and Liu, L. Social influence based clustering and optimization over heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1):1–53, 2015. 15 Fast Federated Machine Unlearning with Nonlinear Functional Theory Zhou, Y. and Liu, L. Approximate deep network embedding for mining large-scale graphs. In Proceedings of the 2019 IEEE International Conference on Cognitive Machine Intelligence (CogMI’19), pp. 53–60, Los Angeles, CA, December 12-14 2019. International Conference on Data Mining (ICDM’18), pp. 1464–1469, Singapore, November 17-20 2018b. Zhou, Y., Jiang, C., Zhang, Z., Dou, D., Jin, R., and Wang, P. Integrating local vertex/edge embedding via deep matrix fusion and siamese multi-label classification. In Proceedings of the 2019 IEEE International Conference on Big Data (BigData’19), pp. 1018–1027, Los Angeles, CA, December 9-12 2019a. Zhou, Y., Cheng, H., and Yu, J. X. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment (PVLDB), 2(1):718–729, 2009. Zhou, Y., Ling Liu, Qi Zhang, K. L., and Palanisamy, B. Enhancing collaborative filtering with multi-label classification. In Proceedings of the 2019 International Conference on Computational Data and Social Networks (CSoNet’19), pp. 323–338, Ho Chi Minh City, Vietnam, November 18-20 2019b. Zhou, Y., Cheng, H., and Yu, J. X. Clustering large attributed graphs: An efficient incremental approach. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM’10), pp. 689–698, Sydney, Australia, December 14-17 2010. Zhou, Y., Seshadri, S., Chiu, L., and Liu, L. Graphlens: Mining enterprise storage workloads using graph analytics. In Proceedings of the 2014 IEEE International Congress on Big Data (BigData’14), pp. 1–8, Anchorage, AK, June 27-July 2 2014. Zhou, Y., Ren, J., Wu, S., Dou, D., Jin, R., Zhang, Z., and Wang, P. Semi-supervised classification-based local vertex ranking via dual generative adversarial nets. In Proceedings of the 2019 IEEE International Conference on Big Data (BigData’19), pp. 1267–1273, Los Angeles, CA, December 9-12 2019c. Zhou, Y., Liu, L., and Buttler, D. Integrating vertex-centric clustering with edge-centric clustering for meta path graph analysis. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’15), pp. 1563–1572, Sydney, Australia, August 10-13 2015a. Zhou, Y., Liu, L., Lee, K., Palanisamy, B., and Zhang, Q. Improving collaborative filtering with social influence over heterogeneous information networks. ACM Transactions on Internet Technology (TOIT), 20(4):36:1–36:29, 2020a. Zhou, Y., Liu, L., Lee, K., Pu, C., and Zhang, Q. Fast iterative graph computation with resource aware graph parallel abstractions. In Proceedings of the 24th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp. 179–190, Portland, OR, June 15-19 2015b. Zhou, Y., Ren, J., Dou, D., Jin, R., Zheng, J., and Lee, K. Robust meta network embedding against adversarial attacks. In Proceedings of the 20th IEEE International Conference on Data Mining (ICDM’20), pp. 1448–1453, Sorrento, Italy, November 17-20 2020b. Zhou, Y., Ren, J., Jin, R., Zhang, Z., Dou, D., and Yan, D. Unsupervised multiple network alignment with multinominal gan and variational inference. In Proceedings of the 2020 IEEE International Conference on Big Data (BigData’20), pp. 868–877, Atlanta, GA, December 10-13 2020c. Zhou, Y., Liu, L., Lee, K., and Zhang, Q. Graphtwist: Fast iterative graph computation with two-tier optimizations. Proceedings of the VLDB Endowment (PVLDB), 8(11): 1262–1273, 2015c. Zhou, Y., Liu, L., Seshadri, S., and Chiu, L. Analyzing enterprise storage workloads with graph modeling and clustering. IEEE Journal on Selected Areas in Communications (JSAC), 34(3):551–574, 2016. Zhou, Y., Zhang, Z., Wu, S., Sheng, V., Han, X., Zhang, Z., and Jin, R. Robust network alignment via attack signal scaling and adversarial perturbation elimination. In Proceedings of the 30th Web Conference (WWW’21), pp. 3884–3895, Virtual Event / Ljubljana, Slovenia, April 19-23 2021. Zhou, Y., Amimeur, A., Jiang, C., Dou, D., Jin, R., and Wang, P. Density-aware local siamese autoencoder network embedding with autoencoder graph clustering. In Proceedings of the 2018 IEEE International Conference on Big Data (BigData’18), pp. 1162–1167, Seattle, WA, December 10-13 2018a. Zhou, Y., Ren, J., Jin, R., Zhang, Z., Zheng, J., Jiang, Z., Yan, D., and Dou, D. Unsupervised adversarial network alignment with reinforcement learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3): 50:1–50:29, 2022b. Zhou, Y., Wu, S., Jiang, C., Zhang, Z., Dou, D., Jin, R., and Wang, P. Density-adaptive local edge representation learning with generative adversarial network multi-label edge classification. In Proceedings of the 18th IEEE 16 Fast Federated Machine Unlearning with Nonlinear Functional Theory A. Appendix A.1. Theoretical Proof Theorem 2. O is Fréchet differentiable and O(R3 ) ⊂ Lp is a smooth manifold. Proof. According to Definition 4, in order to prove that an operator o is Fréchet differentiable, we first need to find a bounded linear operator L(x) and then demonstrate that the limit condition in the definition is satisfied, i.e., Y → 0. lim ko(x+h)−o(x)−L(x)∆k k∆kX k∆kX →0 Without loss of generality, we prove this theorem when σ = 1. As N (0, σ 2 I) = σ −3 N (0, I) in R3 , it is easy to extend the proof process to other σ values by following the same strategy. For ease of presentation, we rewrite O(q) as follows. O(q)(·) = Z 1 (2π)3/2 P (ε)(·)e− kq−εk2 2 2 dε (16) R3 A linear operator L(q)() : R3 → Lp defined as follows. L(q)()(·) = − 1 (2π)3/2 Z P (ε)(·)e− kq−εk2 2 2 (q − ε)dε (17) R3 where L(q) is the bounded linear operator regarding the Fréchet derivative of O(q). We compute kq+s−εk2 d − kq+s−εk22 2 2 2 e = −e− (q + s − ε) ds (18) By integrating Eq.(18) from 0 to 1, we get − e kq+−εk2 2 2 −e − kq−εk2 2 2 Z 1 =− e− kq+s−εk2 2 2 (q + s − ε)ds (19) 0 Next, we compute kO(q + )(·) − O(q)(·) − L(q)(·)kLp (Ω)   Z kq−εk2 kq−εk2 kq+−εk2 1 2 2 2 − − − 2 2 2 = − e + e (q − ε) dε P (ε)(·) e Lp (Ω) (2π)3/2 R3 (20) By combining Eq.(19) and Eq.(20), we obtain kO(q + )(·) − O(q)(·) − L(q)(·)kLp (Ω)  Z 1 Z kq−εk2 kq+s−εk2 1 2 2 − − 2 2 = (q + s − ε) − e (q − ε)ds dε P (ε)(·) e Lp (Ω) (2π)3/2 R3 0 Z 1  Z kq+s−εk2 kq−εk2 kq+s−εk2 1 2 2 2 − − − 2 2 2 2 = P (ε)(·) (e − e )(q − ε) + e s ds dε Lp (Ω) (2π)3/2 R3 0 (21) In terms of the Fubini’s theorem, we have kO(q + )(·) − O(q)(·) − L(q)(·)kLp (Ω)  Z 1 Z kq+s−εk2 kq−εk2 kq+s−εk2 1 2 2 2 − − − 2 2 2 2 = P (ε)(·)(e − e )(q − ε) + e s dε ds Lp (Ω) (2π)3/2 0 R3 Notice that 1 (2π)3/2 Z R3 e− kq+s−εk2 2 2 2 dε = 1 (2π)3/2 17 Z R3 kεk2 2 e− 2 2 dε = kk22 (22) (23) Fast Federated Machine Unlearning with Nonlinear Functional Theory It is obvious that |P (ε)(x)| ≤ 1 for any ε and x. Thus, we have 1 (2π)3/2 Z 1Z P (ε)(·)e− kq+s−εk2 2 2 s2 dεds R3 0 Lp (Ω) ≤ k1kLp (Ω) kk22 (24) Similar to Eq.(19), we derive e− kq+s−εk2 2 2 − e− kq−εk2 2 2 Z 1 =− e− kq+st−εk2 2 2 (q + s − ε)dt (25) 0 In addition, we calculate 1 (2π)3/2 = 1 (2π)3/2 1 ≤ (2π)3/2 Z 1Z P (ε)(·)(e− R3 Z 1Z 1Z kq+s−εk2 2 2 − e− kq−εk2 2 2 )(q − ε)dεds 0 0 0 0 kq+st−εk2 2 2 (q + s − ε)(q − ε)dεdsdt R3 Z 1Z 1Z 0 P (ε)(·)e− Lp (Ω) e− kq+st−εk2 2 2 kq + s − εk2 kq − εk2 kk2 dεdsdt R3 Lp (Ω) (26) Lp (Ω) According to the dominated convergence theorem, when  → 0, we have Z e − kq+st−εk2 2 2 2 Z kq + s − εk2 kq − εk2 kk dε → e− kq−εk2 2 2 R3 R3 kq − εk22 dε (27) By plugging Eq.(27) into Eq.(26), when  → 0, we derive Z 1Z 1Z kq−εk2 kq+s−εk2 1 2 2 − − 2 2 − e )(q − ε)dεdsdt p P (ε)(·)(e L (Ω) (2π)3/2 0 0 R3 Z 0 k2 kq 1 2 ≤ e− 2 kq 0 k22 dq 0 kk2 Lp (Ω) (2π)3/2 R3 (28) By integrating Eq.(24) and Eq.(28), we get kO(q + )(·) − O(q)(·) − L(q)(·)kLp (Ω) = O(2 ) (29) When  → 0, kO(q + )(·) − O(q)(·) − L(q)(·)kLp (Ω) → 0 too. This implies that the limit condition in Definition 4 is satisfied and thus L(q) is Fréchet derivative of O. By following the similar strategy, we can derive that O(q) is infinitely Fréchet differentiable. Notice that O(q) can be treated as a global chart for the manifold O(R3 ). Thus, O(R3 ) is a smooth manifold. Theorem 3. Let Ḡk ∈ RT be the gradient of the local ML model on edge device k, Q be the gradient quantizatier, C be the Lipschitz constant of Q, qk = Q(Ḡk ), and O be the smooth Nemytskii operator. If kḠk − Ḡl k2 ≤ d for any qk , ql ∈ R3 , then we have Cd |Ω|−1 kO(qk )(x) − O(ql )(x)kLp (Ω) ≤ √ if 1 ≤ p < ∞, 2πσ Cd kO(qk )(x) − O(ql )(x)kLp (Ω) ≤ √ if p = ∞ 2πσ for any Ω ⊂ Rn . 18 (30) Fast Federated Machine Unlearning with Nonlinear Functional Theory Proof. Again, without loss of generality, we prove this theorem when σ = 1. By following the same strategy, it is easy to extend the proof process to other σ values. For ease of presentation, we rewrite O(q) as follows. O(q)(·) = 1 (2π)3/2 Z P (ε)(·)e− kq−εk2 2 2 dε (31) R3 We compute Notice that kO(qk )(·) − O(ql )(·)kLp (Ω)   Z kq −εk2 kq −εk2 1 − k2 2 − l2 2 = P (ε)(·) e −e dε Lp (Ω) (2π)3/2 R3 (32) kql −ε+s(qk −ql )k2 d − kql −ε+s(qk −ql )k22 2 2 2 e = −e− (ql − ε + s(qk − ql ))(qk − ql ) ds (33) By integrating Eq.(33) from 0 to 1, we get e − kqk −εk2 2 2 −e − kql −εk2 2 2 Z 1 =− e− kql −ε+s(qk −ql )k2 2 2 (ql − ε + s(qk − ql ))(qk − ql )ds (34) 0 By combining Eq.(32) and Eq.(34), we obtain kO(qk )(·) − O(ql )(·)kLp (Ω)  Z 1 Z kq −ε+s(qk −ql )k2 1 2 − l 2 = (ql − ε + s(qk − ql ))(qk − ql )ds dε P (ε)(·) e Lp (Ω) (2π)3/2 R3 0 Z Z 1 2 kql −ε+s(qk −ql )k2 1 2 = P (ε)(·)e− (ql − ε + s(qk − ql ))(qk − ql )dsdε p 3/2 L (Ω) (2π) R3 0 (35) In terms of the Fubini’s theorem, we have kO(qk )(·) − O(ql )(·)kLp (Ω) Z 1Z kq −ε+s(qk −ql )k2 1 2 − l 2 = P (ε)(·)e (ql − ε + s(qk − ql ))(qk − ql )dεds p L (Ω) (2π)3/2 0 R3 (36) For ease of presentation, let U = ql + s(qk − ql ) and V = qk − ql . We rewrite Eq.(36) as follows. kO(qk )(·) − O(ql )(·)kLp (Ω) Z 1Z −kU −εk2 1 2 2 P (ε)(·)e (U − ε)V dεds = 3/2 Lp (Ω) (2π) 0 R3 Z 1Z −kU −εk2 V 1 2 = P (ε)(·)e 2 (U − ε) dεds p kV k2 3/2 kV k2 L (Ω) (2π) 0 R3 (37) Notice that 0 ≤ P (q)(x) ≤ 1 for any q and x and kVVk2 is a unit vector. By utilizing standard Gaussian integrals, for any unit vector u, we have Z 1Z −kU −εk2 1 2 2 P (ε)(·)e (U − ε)udεds p L (Ω) (2π)3/2 0 R3 Z 1Z 2 −kU −εk2 1 (38) |(U − ε)u|dεds p ≤ e 2 3/2 L (Ω) (2π) 0 R3 Z 1Z −kεk2 1 2 2 = e |εu|dεds p L (Ω) (2π)3/2 0 R3 19 Fast Federated Machine Unlearning with Nonlinear Functional Theory We rotate the coordinate system such that the unit vector u coincides with any coordinate axis. Let u, v1 , v2 be the standard basis of the rotated coordinate system. In other words, for any ε ∈ R3 , we decompose ε into ε = (εu)u + 2 X (εv2 )v2 (39) l=1 As the Jacobian of rotation is 1, by employing standard Gaussian integrals, we obtain Z Z P 2 − 2 −kεk2 −|εu|2 2 2 l=1 |εvl |2 2 2 e |εu|dε = e 2 |εu|dε e 3 R3 ZR − P2 τ 2 −|τm |2 2 l=1 l 2 e 2 dτ1 · · · dτm = 2(2π) 2 = e (40) R3 Accordingly, we derive 1 (2π)3/2 Z 1Z P (ε)(·)e 0 −kU −εk2 2 2 (U − ε)udεds R3 Lp (Ω) =k 1 kLp (Ω) (2π)1/2 (41) Thus, when 1 ≤ p < ∞, we obtain 1 1 |Ω|kqk − ql k2 = √ |Ω|kQ(Ḡk ) − Q(Ḡl )k2 (2π)1/2 2π C Cd ≤ √ |Ω|kḠk − Ḡl k2 ≤ √ |Ω| 2π 2π (42) 1 1 kqk − ql k2 = √ kQ(Ḡk ) − Q(Ḡl )k2 1/2 (2π) 2π Cd C ≤ √ kḠk − Ḡl k2 ≤ √ 2π 2π (43) kO(qk )(·) − O(ql )(·)kLp (Ω) ≤ When p = ∞, we get kO(qk )(·) − O(ql )(·)kLp (Ω) ≤ kO(qk )(·) − O(ql )(·)kL∞ (Ω) ≤ 1 kqk − ql k2 (2π)1/2 (44) Therefore, the proof is concluded. Theorem 4. Let Ḡk ∈ RT be the gradient of the local ML model on edge device k, Q be the gradient quantizatier, C be the Lipschitz constant of Q, qk = Q(Ḡk ), O be the smooth Nemytskii operator, and O(qk )(x) be the smooth local model on ḠK device k. Let Ḡ = Ḡ1 +···+ be the gradient of the global ML model on the server by averaging the gradients of all the K local ML models, q = Q(Ḡ), and O(q)(x) be the smooth global model on the server. If max kḠk − Ḡl k2 = d, then we 1≤k,l≤K have (K − 1)Cd √ if 1 ≤ p < ∞, 2πKσ (K − 1)Cd kO(q)(x) − O(qk )(x)kLp (Ω) ≤ √ if p = ∞ 2πKσ |Ω|−1 kO(q)(x) − O(qk )(x)kLp (Ω) ≤ (45) for any Ω ⊂ Rn . Proof. Similar to the conclusion of Theorem 3, we have |Ω|−1 kO(q)(x) − O(qk )(x)kLp (Ω) ≤ √ kO(q)(x) − O(qk )(x)kLp (Ω) ≤ √ C kḠ − Ḡk k2 if 1 ≤ p < ∞, 2πσ C kḠ − Ḡk k2 if p = ∞ 2πσ 20 (46) Fast Federated Machine Unlearning with Nonlinear Functional Theory ḠK According to the definition of Ḡ = Ḡ1 +···+ , for any k ∈ {1, · · · K}, Ḡ − Ḡk = K PK l=1,l6=k Ḡ−Ḡk K . Thus, based on the norm’s triangle inequality, we obtain kḠ − Ḡk k2 = k PK l=1,l6=k Ḡ − Ḡk k2 K PK ≤ l=1,l6=k kḠ − Ḡk k2 K ≤ (K − 1)d K (47) Therefore, the proof is concluded. Theorem 5. Let Ḡ be the gradient of the global ML model on the server, Q be the gradient quantizatier, q = Q(Ḡ), and O and P be the smooth and non-smooth Nemytskii operators respectively. O(q)(x) → P (q)(x) for any data sample x ∈ Rn if σ → 0. Proof. For ease of presentation, let ε = στ . We rewrite O(q) as follows. O(q)(x) = kτ |2 2 1 (2π)3/2 kτ |2 2 kτ |2 2 Z P (q + στ )(x)e− 2 dτ (48) R3 kτ |2 2 It is obvious that |P (q + στ )(x)e− 2 | ≤ e− 2 and e− 2 is integrable. According to the continuity of P (q + στ )(x), we obtain P (q + στ )(x) → P (q)(x) if σ → 0. In terms of the dominated convergence theorem, we have the following conclusion. 1 (2π)3/2 Z P (q + στ )(x)e R3 − kτ |2 2 2 1 dτ → (2π)3/2 Z kτ |2 2 P (q)(x)e− 2 dτ if σ → 0 (49) R3 Therefore, the proof is concluded. A.2. Additional Experiments Machine unlearning performance and running time with varying ratios of data removal. Tables 5-15 exhibit the classification accuracy, errors, training time, and unlearning time obtained by twelve federated machine unlearning approaches by varying the ratio of unlearning request / data removal between 2% and 20% on three datasets of FashionMNIST, CIFAR-10, and SVHN respectively. Similar trends are observed for the comparison of federated machine unlearning effectiveness and efficiency in these figures: our FFMU method achieves the smallest absolute performance difference with the Retrain model, regarding Accuracy (¡3%), Errort (¡3%), Errorr (¡3%), and Errorf (¡2%) on three datasets respectively. Our FFMU method achieves better efficiency than most baseline methods in most experiments. Our FFMU method performs one-time operation of simultaneous training and unlearning when addressing a series of federated machine unlearning requests. The above experiment results demonstrate that FFMU is effective as well as efficient for addressing the federated machine unlearning problem. This advantage is very important for entitling data owners to the right to have their private data removed from trained complex models at their requests in a timely and cost-efficient manner in privacy-critical applications that usually require near-zero tolerance of data leaking. 21 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 5: Performance with 5% data removal and CNN on Fashion-MNIST Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 84.49 81.61 81.70 80.52 82.07 82.40 80.54 80.35 80.92 82.76 83.00 84.55 Performance Errort Errorr 15.51 14.71 18.39 16.98 18.30 17.99 19.48 17.88 17.93 17.11 17.60 16.76 19.46 18.27 19.65 19.07 19.08 18.77 17.24 16.65 17.00 15.53 15.45 14.87 Errorf 16.10 21.27 14.34 20.37 20.83 19.15 19.30 23.19 17.66 14.48 14.36 15.37 Training 193 212 230 227 361 212 215 205 331 232 229 1,288 Runtime (s) Unlearning 920 1,442 954 1,025 1,766 1,036 871 1,053 3,192 964 959 0 Total 1,113 1,654 1,184 1,252 2,127 1,248 1,032 1,258 3,523 1,196 1,188 1,288 Table 6: Performance with 8% data removal and CNN on Fashion-MNIST Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 84.75 81.65 82.78 82.09 83.78 82.10 81.66 80.38 78.86 81.56 80.35 84.55 Performance Errort Errorr 15.25 14.45 18.35 17.18 17.22 15.57 17.91 17.03 16.22 15.66 17.90 16.85 18.34 17.42 19.62 18.82 21.14 20.50 18.44 17.95 19.65 18.61 15.45 14.86 Errorf 15.58 19.24 17.71 20.09 16.84 16.47 19.04 18.00 20.92 17.08 19.40 15.38 Training 198 218 201 222 358 205 240 210 341 228 206 1,288 Runtime (s) Unlearning 1,416 2,201 1,487 1,592 2,828 1,629 1,353 1,618 5,047 1,536 1,547 0 Total 1,614 2,419 1,688 1,814 3,186 1,834 1,593 1,828 5,388 1,764 1,753 1,288 Table 7: Performance with 15% data removal and CNN on Fashion-MNIST Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 84.66 82.35 83.23 81.83 83.77 82.59 82.63 82.06 80.31 82.26 82.38 84.55 Performance Errort Errorr 15.34 14.84 17.65 16.20 16.77 15.32 18.17 16.46 16.23 15.94 17.41 15.94 17.37 16.23 17.94 16.69 19.69 18.65 17.74 16.99 17.62 16.68 15.45 14.84 22 Errorf 14.84 18.67 18.66 18.26 15.95 16.87 16.69 16.67 19.14 18.03 16.10 14.76 Training 188 220 191 237 364 232 236 217 344 228 222 1,288 Runtime (s) Unlearning 2,378 4,098 2,790 2,722 5,349 2,989 2,553 2,963 5,312 2,742 2,833 0 Total 2,566 4,318 2,981 2,959 5,713 3,221 2,789 3,180 5,656 2,970 3,055 1,288 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 8: Performance with 5% data removal and LeNet on CIFAR-10 Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 54.77 52.60 53.74 51.51 52.75 49.36 51.33 53.79 51.13 51.41 52.84 54.84 Performance Errort Errorr 45.23 41.56 47.40 46.35 46.26 44.65 48.49 47.88 47.25 45.85 50.64 49.61 48.67 47.50 46.21 43.89 48.87 48.45 48.59 47.28 47.16 45.40 45.16 42.99 Errorf 48.04 50.31 47.48 44.93 49.43 44.30 53.32 42.98 42.27 47.20 44.18 48.60 Training 169 144 141 140 489 145 174 170 312 184 145 986 Runtime (s) Unlearning 846 991 697 719 1,988 735 565 721 3,346 697 740 0 Total 1,015 1,135 838 859 2,477 880 739 891 3,658 881 885 986 Table 9: Performance with 8% data removal and LeNet on CIFAR-10 Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 53.87 51.23 49.49 50.60 51.97 49.15 51.40 51.99 50.27 52.59 52.26 54.84 Performance Errort Errorr 46.13 42.41 48.77 47.73 50.51 46.59 49.40 45.75 48.03 46.56 50.85 50.06 48.60 46.98 48.01 46.36 49.73 49.78 47.41 47.67 47.74 45.82 45.16 42.86 Errorf 48.12 49.38 50.43 51.82 45.34 46.36 44.52 51.07 46.06 45.97 43.27 48.02 Training 168 145 141 145 480 140 165 181 313 176 148 986 Runtime (s) Unlearning 1,336 1,600 1,086 1,158 3,088 1,176 905 1,179 5,427 1,132 1,207 0 Total 1,504 1,745 1,227 1,303 3,568 1,316 1,070 1,360 5,740 1,308 1,355 986 Table 10: Performance with 15% data removal and LeNet on CIFAR-10 Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 54.49 50.98 51.34 50.94 53.38 52.03 50.89 52.53 49.77 50.96 52.94 54.84 Performance Errort Errorr 45.51 42.57 49.02 48.59 48.66 44.10 49.06 47.65 46.62 44.74 47.97 47.02 49.11 47.28 47.47 44.43 50.23 49.52 49.04 48.65 47.06 46.08 45.16 42.67 23 Errorf 45.92 47.26 47.20 48.23 47.80 47.06 48.74 48.88 48.32 48.50 46.90 46.34 Training 169 144 142 142 476 141 165 187 334 178 150 986 Runtime (s) Unlearning 2,399 2,950 2,044 2,016 5,660 2,127 1,653 2,156 5,638 2,066 2,217 0 Total 2,568 3,094 2,186 2,158 6,136 2,268 1,818 2,343 5,972 2,244 2,367 986 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 11: Performance with 5% data removal and ResNet-18 on SVHN Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 90.62 87.57 88.46 88.66 78.10 88.12 86.53 88.56 78.49 87.94 89.01 90.71 Performance Errort Errorr 9.38 9.50 12.43 12.21 11.54 11.75 11.34 11.80 21.90 23.36 11.88 11.51 13.47 12.52 11.44 11.42 21.51 21.97 12.06 11.99 10.99 11.92 9.29 9.69 Errorf 10.60 12.97 13.00 13.51 22.52 12.09 14.05 12.00 21.84 13.41 12.26 10.25 Training 341 380 325 311 5,543 350 382 354 1,008 324 302 1,826 Runtime (s) Unlearning 1,744 1,957 1,792 1,739 6,701 1,036 1,004 1,830 1,743 1,000 1,006 0 Total 2,085 2,337 2,117 2,050 12,244 1,386 1,386 2,184 2,751 1,324 1,308 1,826 Table 12: Performance with 8% data removal and ResNet-18 on SVHN Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 90.27 86.51 87.67 88.07 80.39 87.95 86.07 88.18 78.93 87.82 87.91 90.71 Performance Errort Errorr 9.73 9.52 13.49 13.43 12.33 12.29 11.93 12.03 19.61 21.85 12.05 12.08 13.93 13.61 11.82 11.48 21.07 21.14 12.18 12.05 12.09 12.20 9.29 9.67 Errorf 10.92 14.64 13.26 13.82 20.46 13.70 12.84 12.35 21.52 12.66 13.04 10.55 Training 341 385 320 307 5,788 328 328 342 1,011 324 304 1,826 Runtime (s) Unlearning 2,768 3,152 2,865 2,918 10,494 1,629 1,661 2,922 1,824 1,702 1,617 0 Total 3,109 3,537 3,185 3,225 16,282 1,957 1,989 3,264 2,835 2,026 1,921 1,826 Table 13: Performance with 10% data removal and ResNet-18 on SVHN Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 90.93 86.07 87.14 87.97 77.16 87.65 88.49 87.42 80.35 87.72 87.13 90.71 Performance Errort Errorr 9.07 9.25 13.93 10.25 12.86 13.50 12.03 12.86 22.84 12.32 12.35 22.82 11.51 11.77 12.58 11.65 19.65 12.22 12.28 19.85 12.87 12.67 9.29 9.28 24 Errorf 10.41 13.69 13.19 13.71 22.93 13.14 8.29 12.80 19.94 14.35 13.67 11.10 Training 341 374 317 310 5,793 328 386 345 1,013 326 310 1,826 Runtime (s) Unlearning 3,331 4,622 3,382 3,603 13,180 1,942 2,054 3,578 1,860 2,224 2,081 0 Total 3,672 4,996 3,699 3,913 18,973 2,270 2,440 3,923 2,873 2,550 2,391 1,826 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 14: Performance with 15% data removal and ResNet-18 on SVHN Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 90.46 86.20 87.78 86.99 77.81 87.94 87.23 86.42 80.51 87.53 88.38 90.71 Performance Errort Errorr 9.54 9.22 13.80 14.12 12.22 11.70 13.01 12.75 22.19 22.98 12.06 11.96 12.77 13.19 13.58 12.83 19.49 19.40 12.47 12.88 11.62 13.27 9.29 9.68 Errorf 10.37 15.27 12.33 13.54 23.02 13.09 12.66 13.36 20.03 13.72 13.06 10.47 Training 341 393 316 308 5,855 324 324 347 1,005 324 308 1,826 Runtime (s) Unlearning 4,712 6,954 5,069 5,418 18,942 2,989 3,120 5,393 1,784 3,698 3,163 0 Total 5,053 7,347 5,385 5,726 24,797 3,313 3,444 5,740 2,789 4,022 3,471 1,826 Table 15: Performance with 20% data removal and ResNet-18 on SVHN Metric Retrain Knowledge Distillation Rapid Retraining MacForget FedEraser VeriFi Class-Discriminative Pruning UN RCAD IJ Noisy-GD FFMU Accuracy 90.12 85.20 88.05 86.47 80.63 87.23 87.84 88.22 79.11 87.11 87.88 90.71 Performance Errort Errorr 9.88 9.16 14.80 14.89 11.95 12.00 13.53 13.26 19.37 20.59 12.77 12.48 12.16 13.03 11.78 11.51 20.89 21.04 12.89 12.32 12.12 12.60 9.29 9.68 Errorf 10.63 15.95 12.81 14.26 19.52 13.67 8.55 11.97 20.98 13.82 12.32 10.38 Training 341 379 319 310 5,737 322 334 355 1,010 335 308 1,826 Runtime (s) Unlearning 6,256 9,106 6,608 7,682 24,363 3,908 3,994 7,356 2,033 5,293 4,153 0 Total 6,597 9,485 6,927 7,992 30,100 4,230 4,328 7,711 3,043 5,628 4,461 1,826 A.3. Parameter Sensitivity In this section, we conduct more experiments to validate the sensitivity of various parameters in our FFMU method for the federated machine unlearning task. Impact of standard deviation. Figure 3 (a) and (b) measure the effect of standard deviation of the Gaussian distribution in the randomized gradient smoothing for federated machine unlearning on Errorr and Errorf by varying σ from 0.025 to 0.3. The error scores achieved by the Retrain model keep unchanged with varying σ. We have observed similar results in these two figures: The error curves by FFMU initially decrease quickly and then become stable when σ continuously increases. A suitable σ can help utilize the randomized gradient smoothing and quantization for directly training a federated machine unlearning model in advance. A too large σ beyond some thresholds does not affect the performance of machine unlearning any more. Influence of training sample percentage. Figure 4 (a) shows the influence of training sample percentage in our FFMU model by varying it from 20% to 100%. We make the observations on the quality by three machine unlearning methods. (1) The accuracy by our FFMU model is very close to that of the Retrain method in most experiments. (2) The performance curves keep increasing when the number of training samples increases. (3) FFMU outperforms FFMU-N in most tests with the smallest accuracy difference with the Retrain method. When there are many training samples available (≥ 60%), the quality improvement by FFMU is obvious. A reasonable explanation is more training data makes FFMU be more resilient to machine unlearning under suitable ratios of data removals. 25 Fast Federated Machine Unlearning with Nonlinear Functional Theory (a) Errorr (b) Errorf Figure 3: Errors with varying standard deviation on three datasets (a) Training Sample Percentage (b) Training Epoch (c) Batch Size (d) Learning Rate Figure 4: Performance with varying parameters on three datasets Impact of training epochs. Figure 4 (b) exhibits the sensitivity of training epochs of our FFMU model by varying them from 40 and 200. As we can see, the performance curves continuously increase with increasing training epochs. This is consistent with the fact that more training epochs makes the image classification models be resilient to machine unlearning under suitable ratios of data removals. It is observed that the accuracy scores oscillate within the range of 8.7% on three datasets. Sensitivity of batch size. Figure 4 (c) exhibits the sensitivity of batch size of federated machine unlearning models in our FFMU model by varying them from 30 and 70. It is observed that the performance curves keep relatively stable when we continuously change the batch size. This demonstrates that our FFMU method is insensitive to the batch size of machine unlearning. No matter what the batch size is, our FFMU method can always achieve the superior performance in all tests, showing the effectiveness of our FFMU method to the machine unlearning. Influence of learning rates. Figure 4 (d) shows the influence of learning rate in our FFMU model by varying it from 0.04 to 0.12. We have observed that the accuracy initially raises when the learning rate increases. Intuitively, a large learning rate can help the algorithm quickly find the optimal solution and thus help improve the quality of machine unlearning. Later on, the performance curves decrease quickly when the learning rate continuously increases. A reasonable explanation is that a too large learning rate may miss the optimal solution with large step size in the search process. Thus, it is important to determine the optimal learning rate for the machine unlearning. 26 Fast Federated Machine Unlearning with Nonlinear Functional Theory A.4. Experimental Details Environment. The experiments were conducted on a compute server running on Red Hat Enterprise Linux 7.2 with 2 CPUs of Intel Xeon E5-2650 v4 (at 2.66 GHz) and 8 GPUs of NVIDIA GeForce GTX 2080 Ti (with 11GB of GDDR6 on a 352-bit memory bus and memory bandwidth in the neighborhood of 620GB/s), 256GB of RAM, and 1TB of HDD. Overall, the experiments took about 5 days in a shared resource setting. We expect that a consumer-grade single-GPU machine (e.g., with a 2080 Ti GPU) could complete the full set of experiments in around 7-8 days, if its full resources were dedicated. The codes were implemented in Python 3.7.3 and PyTorch 1.0.14. We also employ Numpy 1.16.4 and Scipy 1.3.0 in the implementation. Since the datasets used are all public datasets and our methodologies and the hyperparameter settings are explicitly described in Section 3, 4, and A.4, our codes and experiments can be easily reproduced on top of a GPU server. We promise to release our open-source codes on GitHub and maintain a project website with detailed documentation for long-term access by other researchers and end-users after the paper is accepted. Training. We study image classification networks on three standard image datasets: Fashion-MNIST 1 , CIFAR-10 2 , and SVHN 3 . The above three image datasets are all public datasets, which allow researchers to use for non-commercial research and educational purposes. We use 60,000 examples as training data and 10,000 examples as test data for Fashion-MNIST. We train the machine unlearning model on the CIFAR-10 training set and test it on the CIFAR-10 test set. We use 73,257 digits as training data and 26,032 digits as test data for SVHN. We train a convolutional neural network (CNN) on Fashion-MNIST for clothing classification. We train LeNet over CIFAR-10 for image classification. We apply the ResNet-18 architecture on SVHN for street view house number identification. The neural networks are trained with Kaiming initialization (He et al., 2015) using SGD for 120 epochs with an initial learning rate of 0.05 and batch size 500. The learning rate is decayed by a factor of 0.1 at 1/2 and 3/4 of the total number of epochs. In addition, we run each experiment for 3 trials for obtaining more stable results. Implementation. To our best knowledge, there are only two federated machine unlearning algorithms with open-source implementation: FedEraser 4 and RCAD 5 . We utilized the same model architecture as the official open-source implementation and default parameter settings provided by the original authors for machine unlearning in all experiments. All hyperparameters are standard values from reference codes or prior works. For other regular federated learning or federated optimization approaches, including Knowledge Distillation, Rapid Retraining, MacForget, FedEraser, VeriFi, Class-Discriminative Pruning, UN, IJ, and Noisy-GD, to our best knowledge, there are no publicly available open-source implementations on the Internet. We tried our best to implement these approaches in terms of the algorithm description from the original papers. All hyperparameters are standard values from the reference papers. We validate the performance of different federated machine unlearning methods with a range of ratio of data removals {5%, 8%, 10%, 15%, 20%}. The above open-source codes from the GitHub are licensed under the MIT License, which only requires preservation of copyright and license notices and includes the permissions of commercial use, modification, distribution, and private use. For our FFMU model, we performed hyperparameter selection by performing a parameter sweep on standard deviation σ ∈ {0.025, 0.05, 0.1, 0.2, 0.3, 0.5, 1} in the Gaussian distribution, quantization threshold λ ∈ {σ 2 /4, σ 2 /2, σ 2 , 2σ 2 , 4σ 2 }, ratio of data removals {5%, 8%, 10%, 15%, 20%}, local epochs of the machine unlearning model ∈ {1, 2, 3, 4, 5}, global epochs of the machine unlearning model ∈ {40, 80, 120, 160, 200}, batch size for training the model ∈ {30, 40, 50, 60, 70}, and learning rate ∈ {0.04, 0.06, 0.08, 0.1, 0.12}. We select the best parameters over 50 epochs of training and evaluate the model at test time. Hyperparameter settings. Unless otherwise explicitly stated, we used the following default parameter settings in the experiments. 1 https://github.com/zalandoresearch/fashion-mnist https://www.cs.toronto.edu/∼kriz/cifar.html 3 http://ufldl.stanford.edu/housenumbers/ 4 https://www.dropbox.com/s/1lhx962axovbbom/FedEraser-Code.zip?dl=0 5 https://github.com/ars22/RCAD-regularizer 2 27 Fast Federated Machine Unlearning with Nonlinear Functional Theory Table 16: Hyperparameter Settings Parameter Training data on Fashion-MNIST Test data ratio on Fashion-MNIST Training data on CIFAR-10 Test data on CIFAR-10 Training data on SVHN Test data on SVHN Number of Edge Devices Standard deviation σ in the Gaussian distribution Quantization threshold λ Ratio of data removals Local epochs of the machine unlearning model Global epochs of the machine unlearning model Batch size for training the model Learning rate Value 60,000 10,000 50,000 10,000 73,257 26,032 100 0.1 σ2 20% 2 200 50 0.1 A.5. Potential Negative Societal Impacts and Limitations In this work, the three image datasets are all open-released datasets (Xiao et al., 2017; Krizhevsky, 2009; Netzer et al., 2011), which allow researchers to use for non-commercial research and educational purposes. These three datasets are widely used in training/evaluating the image classification. All baseline codes are open-accessed resources that are from the GitHub and licensed under the MIT License, which only requires preservation of copyright and license notices and includes the permissions of commercial use, modification, distribution, and private use. To the best of our knowledge, motivated the idea of PCMU, this work is the first to simultaneously execute the training and unlearning operations for improving the unlearning efficiency with machine unlearning certificate in federated setting, by leveraging the theory of nonlinear functional analysis, including Nemytskii operator and smooth manifold. Many machine learning applications often need to collect massive amount of data from third parties for model training. This raises a legitimate privacy risk: training data can be practically reconstructed from models (Fredrikson et al., 2015; Shokri et al., 2017; Veale et al., 2018; Bourtoule et al., 2021; Mahadevan & Mathioudakis, 2021; Marchant et al., 2022; Chen et al., 2022). In addition, modern privacy regulations, such as the European Union’s General Data Protection Regulation (GDPR) (of the EU, 2016) and the California Consumer Privacy Act (CCPA) (Legislature, 2018), enforce the right to be forgotten, i.e., entitle data owners to the right to have their private data removed at their requests (Marchant et al., 2022; Liu et al., 2022c; Chundawat et al., 2022). Our framework is able to resolve the requests of data removal in a timely and cost-efficient manner. Our framework can play an important building block for a wide variety of privacy-critical applications that usually require near-zero tolerance of data leaking, such as financial and health data analyses. This paper is primarily of a theoretical nature. We expect our findings to produce positive impact, i.e, significantly improve the efficiency of federated machine unlearning models by simultaneously training and unlearning in advance. To our best knowledge, we do not envision any immediate negative societal impacts of our results, such as security, privacy, and fairness issues. An important product of this paper is to explore the possibility of simultaneous training and unlearning in advance as well as one-time federated unlearning. Due to high-dimensional double integrals or non-integrable mapping between samples and labels in the randomized data smoothing and gradient quantization method, the randomized gradient smoothing and quantization approach is designed to produce high confidence certificates for the certified federated machine unlearning. Our theoretical framework can inspire further improved development and implementations on fast federated machine unlearning with better applicability and efficiency from the academic institutions and industrial research labs. 28