Bag of Tricks for Training Data Extraction from Language Models Weichen Yu * 1 Tianyu Pang 2 Qian Liu 2 Chao Du 2 Bingyi Kang 2 Yan Huang 1 Min Lin 2 Shuicheng Yan 2 Abstract 19.7 20.8 Sampling Strategy With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proofof-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-thenranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. The code is available at https://github.com/weichen-yu/LM-Extraction. 34.4 (Sec 5.1) 33.5 Prob. Dist. Adjustment 19.5 32.6 (Sec 5.2) Exp. Bias Reduction 46.7 24.6 (Sec 5.3) Look-ahead 36.7 (Sec 5.4) 37.1 Sen. Level Criteria 36.6 (Sec 6.1) 37.1 Tok. Level Criteria (Sec 6.2) 40.6 (%) 10 20 30 40 Figure 1. Overview for the bag of tricks explored in this work, with an evaluation of precision (MP ). Bars in pink denote the methods in the improved suffix generation, and bars in orange denote the methods in the improved suffix ranking. The dashed bars indicate the best method in each category. age that hinders the widespread adoption of LMs (Carlini et al., 2021; 2022; Lehman et al., 2021). As privacy security has been an important issue of public concern, a crucial topic is to develop efficient methods for evaluating privacy leakage. Thus, the focus of our research is on the adversarial task of training data extraction from LMs, a relatively new area of study (Carlini et al., 2021; Lehman et al., 2021). Existing extraction methods have yielded successful records, but there are instances in which these methods are even less effective than simply selecting the most popular entity based on prior score. In addition, successful data extraction requires a high generation ratio, i.e., the need to generate and rank a large number of candidates in order to identify a single successful instance. These suboptimal results suggest that, despite the viability of training data extraction and developed pioneering methods as demonstrated in previous research, this task is still relatively new with an abundance of problems to solve. 1. Introduction Recent advances in language models (LMs) have led to impressive performance in a variety of downstream language tasks (Kenton & Toutanova, 2019; Brown et al., 2020). It has been demonstrated, however, that training data can be extracted from LMs due to the memorization effects (Kenton & Toutanova, 2019; Carlini et al., 2019; Feldman, 2020; Brown et al., 2020). These training data may contain sensitive information such as names, email addresses, phone numbers, and physical addresses, resulting in privacy leak∗ Work done during an internship at Sea AI Lab. 1 Institute of Automation, Chinese Academy of Sciences. 2 Sea AI Lab. Correspondence to: Tianyu Pang , Qian Liu , Yan Huang . In this study, we aim to develop techniques for efficient training data extraction. We adhere to the criteria of the recent Training Data Extraction Challenge,1 which employs Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 1 Website link of Training Data Extraction Challenge Bag of Tricks for Training Data Extraction from Language Models a 1.3B parameter GPT-Neo model (Black et al., 2021) for targeted extraction of 1-eidetic memorized data. Targeted extraction refers to the scenario where a prefix of the data is provided, such as ‘Yu’s phone number is’, and the adversary attempts to recover the suffix ‘12345’. Accroding to Carlini et al. (2021), κ-eidetic memorization is defined as the capacity of a language model to memorize a string that appears κ times in the training material. The targeted and 1-eidetic extraction poses a greater risk and is more challenging than non-targeted and κ-eidetic (for κ > 1) settings. provided an explanation for forgetting memorized examples. Carlini et al. (2022) analyzed three factors that affect the memorization of training data. For natural data distributions, Feldman (2020) showed that label memorization is required to achieve near-optimal performance. In the application of model extraction, Lehman et al. (2021) indicated that pretrained BERT, when trained on clinical notes, poses a risk of sensitive data leakage, especially when the data exhibits a high level of repetition or ‘note bloat’ (Liu et al., 2022). Jayaraman et al. (2022) proposed an active extraction attack that employs canonical patterns and differential privacy to defend against pattern extraction attacks. Through ablation studies, we assess a variety of simple techniques in natural language processing (NLP) and empirically evaluate their impact on successful extraction rates, as in Figure 1. Our empirical analysis reveals that the extraction performance may be sensitive to the experimental setup. With proper settings, the results show that several previously overlooked tricks can contribute to significant improvements of training data extraction. Based on the GPTNeo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. Nonetheless, utilizing more than one training data extraction trick does not necessarily boost the performance, and in some cases, even shows incompatibility and results in inferior precision. These findings suggest that judicious selection and combination of the tricks are essential for optimal performance. 2.2. Membership Inference Attacks The membership inference attack (MIA) (Shokri et al., 2017) is another adversarial task in data protection that is closely associated with training data extraction. It aims to determine whether a particular record is present in its training dataset, given black-box access to a model. MIA has been demonstrated to be effective on numerous machine learning tasks, including classification (Sablayrolles et al., 2019; Choquette-Choo et al., 2021; Rezaei & Liu, 2021) and generation (Hayes et al., 2019; Hilprecht et al., 2019). The methods utilized by MIA fall into two categories: classifier-based methods and metric-based methods (Hu et al., 2022). Classifier-based methods involve training a binary classifier to recognize the complex relationship between members and non-members, with shadow training being a commonly used technique (Shokri et al., 2017; He et al., 2020; Wang et al., 2021). Metric-based methods, on the other hand, make membership inferences by first calculating metrics on the model prediction vectors (Yeom et al., 2018; Salem et al., 2018; Sablayrolles et al., 2019; Song & Mittal, 2021; Choquette-Choo et al., 2021). Several defense methods based on differential privacy (Leino & Fredrikson, 2020; Naseri et al., 2020; Choquette-Choo et al., 2021), data pruning (Wang et al., 2021), data augmentation (Kaya & Dumitras, 2021) and causal inference (Tople et al., 2020) have been proposed to mitigate this vulnerability. 2. Related Work We briefly introduce training data extraction, membership inference attacks, and other memorization-based attacks. 2.1. Training Data Extraction The extraction of training data from a pretrained language model, also referred to as language model data extraction, is a method for recovering the examples used to train the model. Despite being a relatively new task, many of the underlying technologies and analysis methods, including membership inference (Shokri et al., 2017) and leveraging network memorization for attacks (Thomas et al., 2020; Leino & Fredrikson, 2020), were introduced much earlier. 2.3. Other Memorization-Based Attacks Carlini et al. (2021) were among the first to define the concepts of model knowledge extraction and κ-eidetic memorization, as well as to propose promising training strategies for data extraction. Both the theoretical properties of memorization and the application of model extraction in sensitive fields, such as the analysis of clinical notes, have been the focus of subsequent research in this field. It has been discovered that large pretrained models are susceptible to memorizing information from the training data, which can lead to a variety of attacks. In addition to training data extraction and membership inference attacks, there are other memorization-based attacks that target these models. Model extraction attacks and the corresponding protection methods (Tramèr et al., 2016; Juuti et al., 2019; Gong et al., 2020; Wu et al., 2022) focus on the issue of duplicating the functionality of a given model. In this type of attacks, the adversary attempts to build a second model with a similar predictive performance to the original black-box model. Recently, Kandpal et al. (2022) demonstrated that in language models, the efficacy of data extraction is frequently attributable to duplication in commonly used web-scraped training sets. Using nondeterminism, Jagielski et al. (2022) 2 Bag of Tricks for Training Data Extraction from Language Models The objective of attribute inference attacks is to extract specific personal attributes such as locations, occupations, and interests from the model (Fredrikson et al., 2015; Gong & Liu, 2016; Ganju et al., 2018; Parisot et al., 2021). The objective of property inference attacks is to extract properties of the training data that the model producer may not have intended to share, such as the environment in which the data was generated or the proportion of the data that belongs to a particular class. The primary distinction between training data extraction attacks and attribute/property inference attacks is that attribute/property inference attacks do not require prior knowledge of the attributes or properties to be extracted, whereas training data extraction attacks require the generated information to be identical to the training data at the sentence level, which is more difficult and dangerous. (Carlini et al., 2021), we assume the language model generates a suffix s by the most-likely criterion. Then we can write a formal definition of targeted extraction as Definition 1. (Targeted extraction) Given a prefix p contained in the training data and a pretrained language model fθ . Targeted extraction is to generate the suffix by s = argmaxs′ fθ (s′ |p). As to κ-eidetic memorized data, we follow the definition in Carlini et al. (2021) that the sentence [p, s] appears in at most κ examples in the training data. In practice, the length of the generated sentence is typically fixed using truncating and concatenation techniques applied to the training dataset. If a generated sentence is shorter than the specified length, padding tokens are used to bring it up to the required length. In this study, the generated sentence length is 100. 3. Preliminary 3.3. Evaluation Metrics We recap the basic setups employed in our study. These setups mainly follow the guidelines of the Training Data Extraction Challenge. We then define the threat model and evaluation metrics. Non-targeted extraction has been evaluated using the number of memorized examples in previous studies (Carlini et al., 2021). To evaluate more comprehensively, we use three metrics to evaluate performance in this targeted data extraction task, including precision MP , recall MR and Hamming distance MH . 3.1. Basic Setups Dataset. The dataset used in this study is a subset of 20,000 examples from the Pile’s training dataset (Gao et al., 2020). Each example consists of a 50-token prefix and a 50-token suffix. The attacker’s task is to predict the suffix given the prefix. All the 100-token long sentences in this dataset appear only once in the training set. For the purposes of this study, we divide the dataset into a training set of 19,000 samples and a testing set of 1,000 samples. Precision MP . The proportion of correctly generated suffixes over the total number of given prefixes is referred to as precision MP . Notice that for a correct generation, the suffix and ground truth must be identical in both sentence length and generated tokens. Recall MR . The proportion of correctly generated suffixes over the total number of generated suffixes is indicated by recall MR . The metric used in the Training Data Extraction Challenge is denoted by ej , which is defined as the number of correctly recovered suffixes when the number of incorrectly generated suffixes reaches a threshold of j. ej can assess the effectiveness of the attack. We define ej MR = ej +j , we will use MR instead of ej in the following paragraphs. In our experiments, the value of j is chosen to be proportional to the size of the test set, and it is set to 100 with a test set of 1,000 prefixes. Language model. We employ the GPT-Neo 1.3B model implemented on HuggingFace Transformers (Wolf et al., 2020), which is a transformer model designed using EleutherAI’s replication of the GPT-3 architecture (Brown et al., 2020), and trained on the Pile dataset. GPT-Neo is an autoregressive language model fθ parameterized by θ, which generates a sequence of tokens x0 , x1 , · · · , xN via the chain rule fθ (x0 , x1 , · · · , xN ) = N Y  fθ xn |x[0,n−1] , (1) Hamming distance MH . The Hamming distance denotes the difference between two equal-length strings, calculated as the number of positions where the corresponding symbols differ. We can quantitatively evaluate the similarity between the generated suffixes and the ground truth using the average Hamming distance, providing a token-level evaluation of the P extraction methods’ performance. MH = N1 n xn ⊕ gtn , where a ⊕ b = 1 if a = b, else 0. N is the number of tokens in a generated sentence, xn is the generated token, and gtn is the corresponding ground truth token. n=0 where x[0,n−1] = x 1 results in decreased confidence in the language model’s predictions but may also increase the diversity of the generated suffixes. The study conducted by Carlini et al. (2021) found that gradually decreasing the temperature throughout the generation process can be beneficial. The effect of the temperature is presented in Table 2. It is important to note that as the temperature is increased, the number of generated suffixes required to include the ground truth also increases, causing the efficiency to degrade. It is important to find a balance between diversity and efficiency. 90 80 70 60 50 40 30 0 5 10 15 20 25 30 35 40 45 50 Generated Token Length Figure 4. Generated token length w.r.t. token precision (%) for the n-th generated token. The generated suffix length is 50. ‘number of Yu is 12345’ or ‘Yu’s address is at XXX. The phone number of Yu is 12345’. The prefix in the training set, as in Table 12, is not always a complete sentence. To better mimic the training settings, we propose to adjust the context window size and adjust the position shifting. 5.3.1. DYNAMIC C ONTEXT W INDOW Repetition penalty is constructed on the conditional language model (Keskar et al., 2019). A repetition penalty is introduced by locally modifying the generation probability of each token based on whether it is a repetition of the previous token. The logit of the repeated token is divided by a value r before entering the softmax layer. Setting r > 1 penalizes repetition while r < 1 encourages it. Our results in Table 3 show that repetition penalty has mostly negative effects on the task of training data extraction. The length of the training window may differ from the length of the extraction window. As a result, we propose adjusting the context window size, i.e. the number of previously generated tokens, as shown in Eq. (3). Furthermore, we encourage the results of different context window sizes to collaborate in determining the next generated token as fθ (xn ; W)   (3) = hW fθ xn |x[n−w1 ,n−1] , ..., fθ xn |x[n−wm ,n−1] , 5.3. Exposure Bias Reduction where hW denotes the ensemble method, W denotes the ensemble hyperparameter, including the number of different context window sizes m and each window size wi . We use m = 4 and wi ∈ {n, n − 1, n − 2, n − 3} in our methods. Carefully chosen hyperparameters m and wi may improve the performance even more. For efficient vectorization, it is common practice to pack multiple sentences into a fixed-length sequence when training language models. As an example, consider the sentence ‘The phone number of Yu is 12345’ may be truncated or prefixed with another sentence in the training set, such as 5 Bag of Tricks for Training Data Extraction from Language Models Table 2. Results of MP , MR , and MH under different temperature. Temperature = 1 is the baseline. All results are reported on 5 trials. Temperature MP (%)(↑) MR (%)(↑) MH (↓) Varying 48.0 76.3 19.614 0.3 1 48.9 37.0 76.4 76.5 16.341 20.245 Table 3. Results of MP , MR , and MH under different repetition penalty. Repetition penalty r = 1 is the baseline. All results are reported on 5 trials. Figure 5. Histogram of the rank of the ground truth perplexity. The x-axis represents the rank of the ground truth perplexity within a list of 100 suffix perplexities. We present two implementation options. The first option, as specified in Eq. (4), entails computing the probabilities generated by utilizing various lengths of previously generated tokens x[n−wi ,n−1] , and then producing the final probabilities via a weighted average sum of these probabilities, as  Pm i=1 ϵi fθ xn |x[n−wi ,n−1] Pm fθ (xn ; Ww ) = , (4) i=1 ϵi fθ (xn ; Wv ) = ( V(fθ (xn )) = m i=1 V(fθ (xn |x[n−wi ,n−1] ); MP (%)(↑) MR (%)(↑) MH (↓) 0.9 1 1.1 1.2 1.3 1.5 19.8 37.0 37.3 37.1 36.7 34.7 66.4 76.5 76.5 76.5 76.4 75.7 27.927 19.614 20.181 20.323 20.332 21.154 ensemble methods can significantly improve on the baseline approach, achieving improvements of 143% and 139%, respectively, and we discovered that the weighted average strategy performs slightly better than the voting strategy. One common failure mode observed is that when a wrong token is generated, it causes subsequent tokens to also be wrong (exemplars shown in Table 12). The window size ensemble introduced here can help reduce this problem. where Ww denotes the hyperparameters in the solution, comprising of m, wi and ϵi . ϵi denotes the weighting coefficient of each probability. The second option as specified in Eq. (5), is based on a voting mechanism, in which each model trained with a distinct context window length casts its vote for the tokens it is most confident in, formulated as m 1 X Repetition penalty 5.3.2. DYNAMIC P OSITION S HIFTING Positional embeddings are added to the token feature in models like GPT-Neo. During training, this is added per batch of sentences, causing the same sentence to have different offsets in positional embedding in different training batches and during generation. To improve the extraction of memorized suffixes, we propose to recover the positions used during training by evaluating different shifted positions and selecting the ’best’ one. Specifically, for a given prefix p, we evaluate different position C = ci , where ci is a list of consecutive natural numbers, ci = {ci1 , · · · }, s.t. |ci | = |p| and calculate the corresponding perplexity values. The position with the lowest perplexity value is then chosen as the position from which to generate the suffix as (5) ρ − R(fθ (xn )), if R(fθ (xn )) ≤ ρ; (6) 0, otherwise, where V(·) denotes the voting function, R(·) denotes the rank function, and it votes for the tokens that it has confidence in. Wv denotes the hyperparameters in the solution, comprising of wi , m and ρ. It is stated in Carlini et al. (2022) that the proportion of extractable sequences increases log-linearly with the number of context tokens. We observed a similar phenomenon in our experiments, where the generation accuracy of a token decreases as the prefix becomes shorter. However, we discovered that combining multiple context window lengths significantly improves accuracy. The probabilities produced by different window lengths can be combined to significantly improve extraction accuracy. Our implementation of Ww employs the weighting coefficient ϵi = 0.9i , and Wv assigns [5, 4, 3, 2, 1] points to its top-5 confident tokens, ρ = 5. The results presented in Table 4 show that c = argmin P(p, ci ); ϕ̂(xi ) = ψ(cn ) + ϕ(xn ), (7) ci ∈C where ψ(·) denotes positional encoding layers, ϕ(·) denotes the feature mapping function, ϕ̂ denotes the feature mapping function consisting positional encoding, and P computes the perplexity of the prefix. The experimental results are presented in Table 4. C = {[0, 1, · · · , |p|], [1, 2, · · · , |p| + 1], · · · } is evaluated. The data show that, while using posi6 Bag of Tricks for Training Data Extraction from Language Models Table 4. Results of MP , MR , and MH under context window length adjustments. All results are reported on a single trial. Table 6. Hyper-parameters selection for auto-tuning. Multiple configurations of final hyperparameters are found to yield equivalent performances, and a representative example is presented. MP (%)(↑) MR (%)(↑) MH (↓) Baseline 19.5 65.6 26.948 Parameters Range Step Initial Final Context Win Ww Context Win Wv 47.4 46.7 77.6 77.5 16.993 17.164 Position Shifting 16.4 39.0 21.154 Top-k Nucleus-η Typical-ϕ Temperature T Repetition Penalty [1, 50] [0.1, 1] [0.1, 1] [0.1, 5] [0.8, 1.3] 1 0.01 0.01 0.1 0.01 10 0.6 0.6 0.3 1 24 0.8 0.9 0.58 1.04 Table 5. Results of MP , MR , and MH under auto-tuning. All results are reported on 5 trials. Strategy Baseline Manual selection Auto-tuning MP (%)(↑) MR (%)(↑) MH (↓) 37.0 48.8 49.4 76.5 76.4 76.6 19.614 16.379 16.127 Table 7. Experimental results of MP , MR , and MH under lookahead. All results are reported on a single trial. Baseline Look-ahead tion shifting improves the MH metric, it may have a negative impact on precision and recall. Table 12 highlights a common issue encountered during the training data extraction process, where only one or two tokens are incorrectly generated or placed in an inappropriate position. To address this problem, we propose a technique that involves looking ν steps ahead and using the probability of the subsequent tokens to inform the generation of the current token. The goal of look-ahead is to use the posterior distribution to help compute the current token generation probability. We begin by presenting the precise mathematical formulation of the optimal probability and then introduce the implementation, which employs an estimation due to efficiency considerations. The posterior is calculated as (8) (9) Track(xstart , xend |xcond ) X X X = ... fθ (xstart |xcond ) x′end−1 26.948 33.1 35.5 36.7 71.6 72.6 73.0 24.262 22.157 21.333 (11) The aforementioned tricks in Section 5.1 involve various hyperparameters, and simply using the best parameters is usually suboptimal. Manually searching for the best hyperparameters, also known as ’babysitting,’ can be timeconsuming. We use a versatile architecture auto-tuning method (Akiba et al., 2019), which incorporates efficient search and pruning strategies, to determine the optimized hyperparameters following advanced frameworks (Snoek et al., 2012; Koch et al., 2018; Akiba et al., 2019). As the search algorithm, we use covariance matrix adaptation evolutionary strategies (CMA-ES) (Hansen et al., 2003). The search objective in our experiment is set to MP , and the parameters that are searched over include top-k, nucleus-η, typical-ϕ, temperature T , and repetition penalty r. where Track is calculated as, x′start+1 x′start+2 65.6 5.5. Hyperparameter Optimization More generally, let Track(xstart , xend |xcond ) be the probability product of the track starting from xstart and ending at xend , conditioned on xcond . Then we can write ν-step posterior as Track(xn , xn+ν |x