CPM: A Large-scale Generative Chinese Pre-trained Language Model Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu† , Minlie Huang† , Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun Department of Computer Science and Technology, Tsinghua University & BAAI arXiv:2012.00413v1 [cs.CL] 1 Dec 2020 Abstract Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of fewshot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pretrained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pretrained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPMGenerate. 1 Introduction Pre-trained Language Models (PLMs) (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Brown et al., 2020) have been developed for a variety of tasks in Natural Language Processing (NLP), as they can learn rich language knowledge from large-scale corpora, which is beneficial for downstream tasks. ELMo (Peters et al., 2018) first introduces bidirectional language models to learn contextual word vectors via large-scale pretraining. GPT (Radford et al., 2018) applies generative pre-training to a Transformer-based language † Corresponding authors: Z. Liu (liuzy@tsinghua.edu.cn) and M. Huang (aihuang@tsinghua.edu.cn) model (Vaswani et al., 2017), which improves natural language understanding on a wide range of benchmarks. BERT (Devlin et al., 2019) is proposed to pre-train deep bidirectional representations on unlabeled texts by jointly conditioning on both left and right contexts. RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020) enhance BERT (Devlin et al., 2019) by dynamic masking, parameter sharing and modifying pre-training tasks. ERNIE (Zhang et al., 2019), KEPLER (Wang et al., 2019) and SentiLARE (Ke et al., 2020) introduce external knowledge to language representation learning by auxiliary pre-training tasks. Among these PLMs, GPT-3 (Brown et al., 2020), with 175 billion parameters and 570GB training data, has been the center of attention and proven to be effective in various few-shot (even zero-shot) NLP tasks. The powerful text generation capability of GPT-3 makes it available to diverse applications, such as question answering, summarization, conversation, computing basic arithmetic, and generating kinds of text, including essay, fiction, code, spreadsheets, etc. However, incorporating GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English (93% by word counting as reported by Brown et al. (2020)), and the parameters are not publicly available. Although there are some previous works providing powerful Chinese pre-trained language models (Cui et al., 2020; Xu et al., 2020; Wei et al., 2019; Sun et al., 2019; Cui et al., 2019a), their capabilities are limited due to the model size. Hence, how to pre-train a large-scale Chinese language model needs more exploration, such as the construction of Chinese vocabulary and the design of the training strategy. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese corpora. CPM is a Transformer-based autoregressive CPM-Small CPM-Medium CPM-Large nparam 109M 334M 2.6B nlayers 12 24 32 dmodel 768 1,024 2,560 nheads 12 16 32 dhead 64 64 80 Table 1: Model sizes. nparam is the number of parameters. nlayers is the number of layers. dmodel is the dimension of hidden states, which is consistent in each layer. nheads is the number of attention heads in each layer. dhead is the dimension of each attention head. language model, with 2.6 billion parameters and 100GB Chinese training data. To the best of our knowledge, CPM is the largest Chinese pre-trained language model, which could facilitate downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Experiments on various Chinese NLP tasks demonstrate that CPM achieves strong performance on many NLP tasks in the few-shot (even zero-shot) settings. With the increase of parameters, CPM performs better on most datasets, indicating that larger models are more proficient at language generation and language understanding. The main contributions of this technical report can be summarized as follows: • We release a Chinese autoregressive language model with generative pre-training, called CPM, which has 2.6 billion parameters. • We construct a new sub-word vocabulary based on the word segmented corpus to adapt for Chinese corpora and increase the batch size to 3, 072 for more stable model training. • Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the few-shot (even zero-shot) settings. 2 Our Approach 2.1 Chinese PLM Our current model is a left-to-right Transformer decoder, which is similar to the model architecture of GPT (Radford et al., 2019). We pre-train three models with different sizes, as shown in Table 1. In order to adapt CPM to Chinese corpora, we build a new sub-word vocabulary and adjust the training batch size. Vocabulary Construction: Previous works on Chinese pre-trained models usually adopt the subword vocabulary of BERT-Chinese (Devlin et al., 2019), which would split the input text to a Data Source Size Encyclopedia Webpage Story News Dialog ∼ 40GB ∼ 39GB ∼ 10GB ∼ 10GB ∼ 1GB Table 2: Details of training data. character-level sequence. However, Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence. To solve this problem, we construct a new sub-word vocabulary, containing both words and characters. For example, some common words would be added to the vocabulary. Training Strategy: Since the sparseness of word distributions of Chinese is more serious than that of English, we adopt a large batch size to make the model training more stable. Compared to the batch size (1 million tokens) used in GPT-3 2.7B (Brown et al., 2020), our batch size (3 million tokens) is two times larger. For the largest model, which cannot be stored in a single GPU during training, we partition the model across GPUs along the width dimension to make the large-scale training available and reduce data-transfer among nodes. 2.2 Data Processing Specifically, we construct a new sub-word vocabulary based on the word segmented corpus using unigram language model (Kudo and Richardson, 2018). Meanwhile, considering that the word segmentation introduces extra splitters between words, we set a special token as the splitter to make the sub-word process reversible. In contrast, the tokenizer of BERT-Chinese is irreversible because it will insert extra spaces between Chinese characters and treat the extra spaces as the same as the original spaces in the text. We collect different kinds of texts in our pretraining, including encyclopedia, news, novels, and Q&A. The details of our training data are shown in Table 2. Since the input sequence length is usually larger than that of a single document, we concatenate different documents together by adding “end of document” token after each document to make full use of the input length. 2.3 Pre-training Details Based on the hyper-parameter searching on the learning rate and batch size, we set the learning rate as 1.5 × 10−4 and the batch size as 3, 072, which makes the model training more stable. In the first version, we still adopt the dense attention and the max sequence length is 1, 024. We will implement sparse attention in the future. We pre-train our model for 20, 000 steps, and the first 5, 000 steps are for warm-up. The optimizer is Adam (Kingma and Ba, 2015). It takes two weeks to train our largest model using 64 NVIDIA V100. 3 Experiments CPM-Small CPM-Medium CPM-Large TNEWS: 这是关于L的文章：P (This passage is about L: P ), IFLYTEK: 这是关于L的应用程序：P (This application is about L: P ), OCNLI: S1 ？对，S2 (S1 ? Yes, S2 ), S1 ？错，S2 (S1 ? No, S2 ), S1 ？也许，S2 (S1 ? Maybe, S2 ), where L is the label name, P is the input text, S1 and S2 are the premise and hypothesis. Since TNEWS and IFLYTEK have more than 10 kinds of labels, we adopt a simpler validation setting, which randomly samples 3 false labels for each instance and performing 4-class classification for better efficiency. To make it more stable, we repeat it three times and report the averaged results. For OCNLI, which only has 3 kinds of labels, we hold the original validation set. However, the validation set of OCNLI is unbalanced, where the amount of “entailment” / “neutral” / “contradiction” is 947 / 1103 / 900. If the model only predicts the label “neutral”, the accuracy is about 0.374. Results: As shown in Table 3, CPM-large achieves promising results on these classification IFLYTEK 0.584 0.635 0.708 OCNLI 0.378 0.379 0.442 Table 3: Zero-shot performance on text classification tasks (accuracy). Random prediction would have 0.25 on TNEWS and IFLYTEK, 0.33 on OCNLI. 3.1 Text Classification Dataset: We use TouTiao News Titles Classification (TNEWS), IFLYTEK app description classification (IFLYTEK), and Original Chinese NLI (OCNLI) as our benchmark datasets for text classification (Xu et al., 2020; Hu et al., 2020). Since we aim to evaluate the zero-shot ability of CPM on text classification tasks, we directly use the validation sets of these three datasets without any training instance. The amount of the validation set of TNEWS / IFLYTEK / OCNLI is 10K / 2.6K / 3K. Note that, we exclude the instances with the label “-” in OCNLI. Implementation Details: We calculate the perplexity of each candidate sentence-label pair and treat the pair having the lowest perplexity as the prediction. The templates of these three tasks are formulated by TNEWS 0.626 0.618 0.703 CPM-Small CPM-Medium CPM-Large Supervised 0.657 0.695 0.804 Unsupervised 0.433 0.524 0.685 Table 4: Results on ChID dataset in the supervised and unsupervised settings. The random prediction would have 0.10 in the unsupervised setting. datasets without any training samples. Compared to random prediction, the knowledge learned from pre-training significantly improves the performance. Although the medium model is three times as large as the small model, the performances on TNEWS and OCNLI are very close. However, CPM-Large significantly outperforms these two smaller models on all three datasets. It indicates that the magic of the model size is not linear and would happen when the model size exceeds a specific boundary. Besides, the results of CPM-small and CPM-medium on OCNLI are close to that of the strategy only predicting the label “neutral”. It suggests that natural language inference is harder than other downstream tasks in the setting of zeroshot learning, which is consistent with the observation in Brown et al.. 3.2 Chinese Idiom Cloze Dataset: We use the Chinese IDiom cloze test dataset (ChID) (Zheng et al., 2019) as our benchmark dataset. Each passage in the dataset may contain multiple blanks. For each blank, there are 10 candidate idioms with 1 golden truth. Some of the false candidates are similar to the answer in meanings. The amount of training / validation / test set is 520K / 20K / 20K. Implementation Details: For the supervised setting, we use a template to convert the passage and the candidates to a natural language question. Given the passage P and 10 candidate idioms I1 , I2 , ..., I10 , the template can be formulated as Average Extrema Greedy Few-shot (Unsupervised) CDial-GPT CPM-Large Supervised CDial-GPT CPM-Large Dist-1 Dist-2 0.899 0.928 0.797 0.805 0.810 0.815 1,963 / 0.011 20,814 / 0.126 3,229 / 0.007 68,008 / 0.154 0.933 0.934 0.814 0.810 0.826 0.819 2,468 / 0.008 35,634 / 0.127 3,352 / 0.011 67,310 / 0.233 Table 5: Results on STC dataset in the few-shot and supervised settings. 现在高中数学竞赛这么凶残了！ How crazy the math competition of high school is! CDial-GPT: 你还真信？ (Few-shot) Do you really believe it? CDial-GPT: 数学让我们学会了很多，好好学，明天 (Supervised) 回家 Math lets us learn a lot. Study hard and go home, tomorrow. CPM-Large: 现在都是微积分,大学的东西比高中简单 (Few-shot) 多了,还有什么呢? Now it’s all calculus. The things in college are much simpler than those of high school. What else? CPM-Large: 这是要逼死我吗?每道题都不会做啊 (Supervised) Is this going to kill me? I can’t answer any question. Post: 每天嘻嘻哈哈的笑着，只有自己才知道活的有多累。 I laugh every day, but only I know how tired I am. CDial-GPT: 我也是 (Few-shot) Me, too. CDial-GPT: 只有自己才知道活的有多累。 (Supervised) Only I know how tired I am. CPM-Large: 当你想要哭泣的时候,就去那些熟悉的地 (Few-shot) 方吧。 When you want to cry, go to those familiar places. CPM-Large: 真的不知道,生活怎么会这么累 (Supervised) I really don’t know how my life could be so tiring. Post: 选项1: I1 ... 选项10: I10 P 答案是:L (Option 1: I1 ... Option 10: I10 P Answer:L). Then, we train the model to predict the answer L. Note that if there exists more than one idiom in a passage, we predict each one independently. Specifically, When we are predicting one idiom, we leave its blank in the passage and remove the blanks of other idioms from the passage. For the unsupervised setting, we fill the candidate idioms into the blank to form a group of complete passages. We also consider each idiom blank individually if there are multiple blanks in a passage. For each blank, we can get 10 passages corresponding to the 10 candidate idioms. Then we calculate the perplexity of each passage and treat the one with the lowest perplexity as the prediction. Results: The results are shown in Table 4. We report the accuracy on the test set of each model. For the fully supervised setting, we can see that CPM can be fine-tuned for the specific input template, solving multiple-choice tasks by uni-direction auto-regressive language modeling. In our experiments, we didn’t take much time to design the input template for this task, and thus there might exist better templates that can help the model to show its full ability. We will leave this part as future work. For the unsupervised setting, we can see that CPM produces promising results. The unsupervised result of CPM-Large even outperforms the result of CPM-Small and is comparable to CPM-Medium in the supervised setting, reflecting the strong power of CPM in Chinese language modeling. 3.3 Dialogue Generation Dataset: We use Short-Text Conversation (STC) (Shang et al., 2015) as our benchmark dataset for dialogue generation, which consists of post-response pairs from Weibo. We adopt the same data split as the existing work (Wang et al., 2020). The amount of training / validation / test set is 4.4M / 20K / Table 6: Examples of generated responses on STC. 20K, respectively. The average length of posts / responses is 20.6 / 15.4. Baseline: We choose CDial-GPT (Wang et al., 2020) as our baseline, which is the state-of-the-art pre-trained model for Chinese dialogue generation. We directly use the codes and the pre-trained model released by the original paper. Implementation Details: In the supervised experiment, we utilize a similar hyper-parameter setting as pre-training and fine-tune CPM on the training set of STC. In the few-shot experiment which doesn’t include the fine-tuning process, we follow the existing work (Radford et al., 2019; Brown et al., 2020) to condition the language model on a context of 4 examples pairs of the format Context: CPM-Small CPM-Medium CPM-Large Average Dist-1 Dist-2 0.928 2,201 / 0.004 22,754 / 0.046 0.910 2,842 / 0.005 31,934 / 0.058 0.928 3,229 / 0.007 68,008 / 0.154 Table 7: Results of CPM with different amounts of parameters on STC dataset in the few-shot setting. sentence Response: sentence. After a final prompt Context: sentence Response:, we acquire the generation results with Top-p sampling (Holtzman et al., 2020), where p is set to 0.9. The temperature of sampling is 0.9 in both few-shot and supervised experiments. Metrics: Since BLEU is not a proper metric for dialogue generation, we use embedding-based metrics (including greedy matching, embedding average, and vector extrema) to evaluate the similarity between generated responses and references (Liu et al., 2016). For diversity, we choose the number and proportion of distinct n-grams (Li et al., 2016; Xing et al., 2017; Ke et al., 2018) as our metric. Results: We present the main results in the fewshot and supervised settings in Table 5. We can see that CPM outperforms CDial-GPT with a large margin in the few-shot experiment, showing the generalization ability of our model. As for the supervised experiment, our model still performs better, especially on the diversity metrics. Since fine-tuning large pre-trained models on the supervised downstream tasks is often challenging (Dodge et al., 2020; Mosbach et al., 2020; Lee et al., 2020), we leave how to further improve the performance in the supervised setting as future work. Some cases are provided in Table 6 to intuitively show the effectiveness of our model. We also conduct experiments to show the fewshot performance of CPM with different parameter sizes in Table 7. As the number of parameters grows, CPM can generate more diverse responses with reasonable values on the embedding-based metrics. 3.4 Question Answering Dataset: We adopt CMRC2018 (Cui et al., 2019b) and DuReader (He et al., 2018) as our benchmark for Question Answering (QA). CMRC2018 requires the model to extract an answer span from a Wikipedia passage for the given question, which is similar to SQuAD (Rajpurkar et al., 2016). DuReader consists of questions from real-world user logs from Baidu Search and Baidu Zhidao. s + zs s + os m + zs m + os l + zs l + os Zhidao F1 EM 4.01 0.18 4.75 0.34 5.29 0.29 5.76 0.47 5.18 0.27 6.08 0.56 Search F1 EM 4.15 0.65 4.45 0.59 5.03 0.61 5.14 0.55 5.08 0.59 5.19 0.68 CMRC2018 F1 EM 6.03 0.20 6.14 0.22 8.60 0.53 9.00 0.75 13.37 1.31 16.56 3.73 Table 8: Zero-shot (zs) and one-shot (os) results on Question Answering (QA) datasets, including DuReader (Zhidao and Search) and CMRC2018, we did experiments on models with three different sizes: small (s), medium (m) and large(l). The answers in DuReader are manifold, such as an entity or a description. We treat DuReader as an extractive QA task and thus ignore those instances with yes-or-no answers during evaluation. Implementation Details: We evaluate CPM on zero-shot (zs) and one-shot (os) setting and report F1 score (F1) and Exact Match (EM) for both CMRC2018 and DuReader. For the zero-shot setting, we concatenate the passage and question as input to CPM, and CPM is then required to generate an answer according to the observed (passage, question) pair. For the one-shot setting, we randomly select a ground truth triple (passage, question, answer) in the training set and insert it to the front of the instance to be treated as a hint for CPM to generate the answer. Results: As shown in Table 8, we perform the experiments on three datasets and compare models with different sizes: small (s), medium (m) and large (l). From the table, we can see that with the size growing, CPM is performing better. Among all the models, large is always the best. And, the results in the one-shot setting are better than those in the zero-shot setting. We guess CPM is able to imitate the format in previous sequences and organize the language accordingly. We also analyze the generated answer and find that CPM prefers to generate long and repetitive sentences instead of a short and precise one, which results in low scores. We believe it is worth exploring how to make CPM generate brief and proper answers in the future. In general, CPM does not achieve very high scores in either benchmark. We guess it’s related to the format of the pre-training data. 3.5 Entity Generation Dataset: We use XLORE, which includes 446,236 relations and 16,284,901 entities, as our benchmark dataset for entity generation. These relations and N =2 N =4 CPM Small Medium Large Small Medium Large 主要工艺 (Main Process) 0.500 0.500 0.700 0.400 0.200 0.400 0.000 0.000 0.071 0.000 0.000 0.075 释义 (Explanation) 商品品牌 (Brand) 0.098 0.033 0.483 0.183 0.050 0.450 0.000 0.025 0.124 0.059 0.053 0.108 学科 (Subject) 0.035 0.010 0.108 0.000 0.014 0.122 全名 (Full Name) 0.042 0.065 0.104 0.063 0.037 0.125 涉及领域 (Related Field) 主要作物 (Main Crop) 0.000 0.150 0.050 0.100 0.150 0.100 0.033 0.033 0.033 0.050 0.000 0.050 所在国家 (In Country) 病原类型 (Pathogen Type) 0.250 0.220 0.370 0.200 0.300 0.340 首任总统 (The First President) 0.000 0.000 0.000 0.016 0.009 0.014 Table 9: BLEU-1 results of CPM with different amounts of parameters on XLORE dataset in the few-shot setting. Relation: 首都 (Capital) Prompt: 美国首都华盛顿 America Capital Washington 中国首都北京 China Capital Beijing 日本首都 Japan Capital CPM: 东京 Tokyo Relation: 主要工艺 (Main Process) Prompt: 酱焖辣椒主要工艺焖 (Sauce Braised Chili) (Main Process) Stew 当归鸭肉煲主要工艺煲 (Duck with Angelica) (Main Process) Boil 韭菜煎蛋饼主要工艺 (Leek Omelette) (Main Process) CPM: 煎 Fried Relation: 学科 (Subject) Prompt: 恒星级黑洞学科宇宙论 (Stellar Black Hole) Subject Cosmology 品类需求强度学科品牌经济学 (Category Demand Intensity) Subject Economics 大地构造学学科 (Tectonic Geology) Subject CPM: 地质学 Geology Table 10: Examples of generated entities on XLORE with CPM-Large. entities are from Wikipedia and Baidu Baike. Implementation Details: We evaluate CPM on the few-shot setting with different amounts of parameters and report BLEU-1 results. In detail, we randomly select triples (head entity, relation, tail entity) by the same relations from XLORE and combine N triples and an incomplete triple (head entity, relation) into a prompt. Then, given the prompt, the models need to predict the corresponding tail entity. Results: We present the results in Table 9. As we can see from the table, CPM-large achieves the best performance among these three models. Surprisingly, given a prompt with two triples, CPM can achieve comparable results to that with four triples. It indicates that CPM can imitate the format and probe factual knowledge to generate a proper tail entity in the extreme few-shot scenarios. We also provide some cases in Table 10 to demonstrate the ability of CPM. 4 Future Work In the future, we will further explore the power of large-scale pre-trained models on Chinese by adding more training data and increasing the model size. Due to the extremely expensive cost of pretraining, we will try to optimize the training framework, such as the data-transfer scheme between different nodes, to accelerate the process. There are some previous works including LAMB (You et al., 2020) and DeepSpeed (Rasley et al., 2020). Besides, it is important to reduce the model size by model compression (Sanh et al., 2019; Jiao et al., 2019; Zhang et al., 2020). Meanwhile, we will also include diverse data to enhance model performance. For text data, we will add a multi-lingual corpus to train a large-scale Chinese-centered multi-lingual language model. For structured data such as knowledge graphs, which is important for PLMs (Peters et al., 2019; Xiong et al., 2020; Su et al., 2020), we will explore new learning algorithms to train a joint model, which can learn from both texts and knowledge graphs for better general intelligence. Acknowledgments Thanks to the Beijing Academy of Artiicial Intelligence (BAAI) for providing the computing resources and web services of this work. In addition, we would like to thank NetEase Inc., zhihu.com, and aminer.cn for the support in collecting the Chinese corpus. Disclaimer of Warranties The text generated by CPM is automatically generated by a neural network model trained on a large number of texts, which does not represent our official attitudes and preferences. The text generated by CPM is only used for technical and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it, but contact us and we will deal with it promptly. References Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In Proceedings of ICLR. Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, and Lawrence S Moss. 2020. Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Pei Ke, Jian Guan, Minlie Huang, and Xiaoyan Zhu. 2018. Generating informative responses with controlled sentence function. In Proceedings of ACL. Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Minlie Huang. 2020. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. In Proceedings of EMNLP. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pretrained models for chinese natural language processing. In Findings of EMNLP. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In Proceedings of ICLR. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019a. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101. Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. 2020. Mixout: Effective regularization to finetune large-scale pretrained language models. In Proceedings of ICLR. Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019b. A span-extraction dataset for chinese machine reading comprehension. In Proceedings of EMNLP. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-HLT. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2018. DuReader: a chinese machine reading comprehension dataset from real-world applications. In Proceedings of ACL Workshop. Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of EMNLP. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. 2020. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT. Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, and Minlie Huang. 2020. A large-scale chinese short-text conversation dataset. In Proceedings of NLPCC. Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of EMNLP, pages 43–54. Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, and Qun Liu. 2019. Nezha: Neural contextualized representation for chinese language understanding. arXiv preprint arXiv:1909.00204. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In Proceedings of OpenAI Technical report. Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In Proceedings of AAAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. In Proceedings of OpenAI Technical report. Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. 2020. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In Proceedings of ICLR. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of EMNLP. Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li, Kai Sun, Yechen Xu, et al. 2020. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD, pages 3505–3506. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of ACL-IJCNLP. Yusheng Su, Xu Han, Zhengyan Zhang, Peng Li, Zhiyuan Liu, Yankai Lin, Jie Zhou, and Maosong Sun. 2020. Contextual knowledge selection and embedding towards enhanced pre-trained language models. arXiv preprint arXiv:2009.13964. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS. Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2019. Kepler: A unified model for knowledge embedding and pretrained language representation. arXiv preprint arXiv:1911.06136. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large batch optimization for deep learning: Training bert in 76 minutes. In Proceedings of ICLR. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Proceedings of ACL. Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, and Maosong Sun. 2020. Know what you don’t need: Single-shot meta-pruning for attention heads. arXiv preprint arXiv:2011.03770. Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A large-scale Chinese IDiom dataset for cloze test. In Proceedings of ACL. A Contributions Zhengyan Zhang, Xu Han, and Hao Zhou implemented the large-scale models and modelparallel strategies. Huanqi Cao, Shengqi Chen, Daixuan Li, and Zhenbo Sun built the training infrastructure. Pei Ke, Deming Ye, Jian Guan, Fanchao Qi, and Xiaozhi Wang collected, filtered, deduplicated the training data. Zhengyan Zhang, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, and Haozhe Ji implemented the downstream tasks and the software framework for supporting them. Hao Zhou, Guoyang Zeng, Xu Han, and Yanan Zheng implemented the demos of language generation and knowledge retrieval using our CPM. Guoyang Zeng conducted the human evaluations of the model. Hao Zhou, Zhengyan Zhang, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, and Yusheng Su wrote the paper. Zhiyuan Liu, Minlie Huang, and Wentao Han designed and led the research. Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun provided valuable advices to the research.