Multi-task Learning with BERT in NLP Stanford CS224N Default Project Fan Wang Department of Computer Science Stanford University wang420@stanford.edu Abstract In natural language processing, while deep learning techniques have achieved remarkable success in many different problems, these models are mostly task specific. When facing a situation in which multiple tasks are solved at the same time, multi-task learning is a popular paradigm to sufficiently utilize information across different tasks and to learn more efficiently or effectively. With the adoption of large language models like BERT in NLP, however, there is little research to enhance BERT to improve the performance on additional target tasks further. The downstream tasks we are interested in are Sentiment Analysis, Paraphrase Detection and Semantic Textual Similarity. In this project, we propose a multi-task learning framework based on BERT, then experiment extensions to improve the performance on the 3 tasks simultaneously. We have explored models with different architectures, different loss functions, different optimizers, adversarial learning and contrastive learning. 1 Key Information to include • Mentor: • External Collaborators (if you have any): • Sharing project: 2 Introduction Text classification as a classical problem in NLP has been studied extensively by various neural network techniques, such as convolution model (Kalchbrenner et al. (2014), Zhang et al. (2015), Conneau et al. (2016), Johnson and Zhang (2017), Zhang et al. (2017), Shen et al. (2018)), recurrent models (Liu et al. (2019), Dani et al. (2017), Seo et al. (2017)) and attention mechanisms (Yang et al. (2016), Lin et al. (2017)). On the other hand, recently, pre-trained models on large corpus are are shown that are beneficial for text classification and other NLP tasks. As a state-of-the-art language pre-trained model, BERT(Devlin et al. (2018)) has achieved amazing results in many natural language understanding (NLU) tasks, however, there is little research to enhance BERT to improve the performance on target tasks further(Sun et al. (2020)). In this research report, we investigate how to maximize the utilization of BERT for three downstream tasks simultaneously. The tasks we have studied are Sentiment Analysis, Paraphrase Detection and Semantic Textual Similarity. We explore several extensions of fine-tuning BERT to enhance its performance. We design exhaustive experiments to make a detailed analysis of different extensions. The extensions we experiments are the following: • Model architecture and number of layers • Multi-task learning and training tactics • Adversarial learning Stanford CS224N Natural Language Processing with Deep Learning • Contrastive learning Our first extension is Sentence-BERT (Reimers and Gurevych (2019)), a modification of the BERT network using siamese networks which takes a pair of sentences as input to derive semantically meaningful sentence embeddings. The next step is to apply our Sentencet-BERT model in multitask setting which presents a number of optimization challenges due to the complicated shape of multiple loss functions, making it difficult to optimize and learn efficiently compared to single task problems. To overcome, we study the round-robin strategy, training on the total loss function instead of individual loss functions, and the gradient surgery strategy (Yu et al. (2020)). One of the issues in aggressive fine-tuning is over-fitting. To alleviate the problem, we have explored adversarial learning (Goodfellow et al. (2015)) as the regularization method. The last extension we have studied is contrastive learning, a simple framework which uses entailment pairs as positives and contraction pairs (Gao et al. (2021), Jiang et al. (2022)) as hard negatives. 3 Related Work We first introduce BERT, then, we briefly review the information related to extensions that we have applied in fine-tuning the BERT model on our multi-task problem. BERT (Devlin et al. (2018)) is a pre-trained large language model based on the transformer (Vaswani et al. (2017)) structure. It sets for various NLP tasks the new state-of-the-art results, including question answering, sentence classification, and sentence-pair regression. To apply BERT in sentence-pair regression, the input consists of the two sentences, separated by a special [SEP] token. Multi-head attention algorithm is applied and the output is passed to another activation function or a simple regression function is derive the final prediction for backpropagation. A drawback of this input structure is that no independent sentence embeddings are computed, which makes it difficult to derive sentence embeddings from BERT. 3.1 Sentence-BERT The Sentence-BERT (Reimers and Gurevych (2019)) method adds a pooling operation to the output of BERT to derive a fixed sized sentence embedding. Then use a siamese network, the sentence embedding generated by BERT are semantically meaningful and can be compared with a similarity measure, such like cosine-similarity or Euclidean distance. In the siamese network, the two BERT models have tied weights. The structure is illustrated in figure1. Figure 1: SBERT architecture with siamese network structure 2 3.2 Gradient surgery While the BERT model and deep learning techniques in general have shown remarkable promise in NLP. The tasks are usually learned individually. Sometimes data availability / efficiency is a challenge. Multi-task learning has emerged as the approach to learn more efficiently and shared knowledge across different tasks. However, multi-Task learning can be very challenging when gradients of different tasks are of severely different magnitudes or point into conflicting directions. Gradient surgery method (Yu et al. (2020)) is a simple yet general approach for avoiding the interference between gradients of different loss functions. As a model-agnostic approach, the method involves projecting conflicting gradients (PCGrad) and will retain optimality guarantees. Figure 2: Conflicting gradient and PCGrad 3.3 Adversarial learning Aggressive fine-tuning in multi-task setting can cause over-fitting. Adversarial training provides a meaningful way of regularizing supervised learning algorithms to combat the problem. Among the many available adversarial techniques, FGSM (Fast Gradient Sign Method), FGM (Fast Gradient Method), and PGD (Projected Gradient Descent) are the three related to this project. FGSM (Goodfellow et al. (2015)) linearizes the cost function around the current value of parameters, obtaining an optimal max-norm constrained pertubation, which can be computed efficiently by backpropagation. FGM (Miyato et al. (2016)) takes one step further by removing the sign function used in FGSM and normarlizes the gradient of loss function. One can intepret that FGM and FGSM as a simple one-step scheme to solve the inner maximization problem. A more powerful adversary is the multi-step variant, which is essentially PGD (Madry et al. (2017)) on the loss function. 3.4 Contrastive learning The idea of contrastive learning is originally from computer vision. In the paper(Chen et al. (2020)), the authors show that by combining the right data augmentations and the normalized temperaturescaled cross entropy loss function, the model achieves a new state-of-the-art performance. The key idea is to to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors. The application of contrastive learning in NLP is summerized in SimCSE, Simple Contrastive Learning of Sentence Embeddings (Gao et al. (2021)) and PromCSE, Prompt-based Contrstive Learning for Sentence Embedding (Jiang et al. (2022)). Supervised SimCSE uses labels from the training data into contrastive learning and PromCSE connects contrastive learning with energy-based learning. Both approaches have shown improvements over the previous best results. 4 Approach We start from the BERT model as the bottom and build three towers on top for the multiple tasks. The BERT layers across different tasks share the same weights. We have experimented a structure in which the weights of pooling layers between taks 2 and task 3 are shared, it yields suboptimal performance. Thus we have decided not to share any additional layers beside the BERT layer. The overall network strcture is depicted in figure3. As for the baseline, we use the vanilla BERT fine-tuned on task 1 only and evaluate the model on all 3 tasks. 4.1 Concatenated layer and feedforward network There is no concatenated layer in task 1 since only one sentence is used as the input, we use a dropout layer and a linear(BERT_Hidden_Size, Number_of_Labels) layer as the feedforward network. In task 2 and task 3, we have explored the following three options shown in figure 44. 3 Figure 3: Multi-task network structure Figure 4: Concatenated layer and feedforward network 4.2 Gradient surgery, round-robin and total loss To train the multi-task model efficiently, we use three methods. The gradient surgery method is introduced in paper (Yu et al. (2020)); we use an off-the-shelf implementation of the algorithm from pytorch-optimizer package in github 1 . The full update procedure is described in table1 below: Gradient Surgery Algorithm 1. Compute number of batches for each tasks 2. While All(number of batches) > 0 do: 3. calculate loss of task 1, update batch counts 4. calculate loss of task 2, update batch counts 5. calculate loss of task 3, update batch counts 6. apply PCGrad and backpropagate Table 1: Gradient surgery algorithm 1 https://github.com/jettify/pytorch-optimizer 4 The round-robin and the total loss methods are relatively straightforward compared to the gradient surgery approach, with the algorithms listed in table2 below. We implement the logic ourselves. Round-robin 1 2 3 4 5 6 4.3 Total Loss Compute number of batches for each tasks Compute number of batches for each tasks While sum(number of batches) > 0 do: While All(number of batches) > 0 do: calculate loss of task 1, if loss, backpropagate calculate loss of task 1, update batch counts calculate loss of task 2, if loss, backpropagate calculate loss of task 2, update batch counts calculate loss of task 3, if loss, backpropagate calculate loss of task 3, update batch counts update batch counts calculate the sum of loss, backpropagate Table 2: Round-robin and total loss algorithms Adversarial learning The code to implement the 3 adversarial methods, FGSM (Goodfellow et al. (2015)), FGM (Miyato et al. (2016)) and PGD (Madry et al. (2017)) is from github2 . We list the key functions of each method below: • FGSM: xadv = x + ϵ ∗ sign(∇x Loss) • FGM: xadv = x + ϵ ∗ ∇x Loss/L2norm (∇x Loss) Q • PGD: xadv = x+S x + α ∗ sign(∇x Loss) 4.4 Contrastive learning We view contrastive learning as a data augmentation method. To set up the contrastive learning with our labeled dataset, we implement the logic below ourselves. The current logic is based on a double for-loop, and thus is very slow in running time. We note the vectorization of the logic as a TODO for improvement. We have applied this logic in calculating the loss function in task 2. Contrastive Learning 1. While number of batches > 0 do: 2. calculate logits and extract label from the batch input data 3. do i from 1 to batch size: 4. do j from 1 to batch size: 5. if i != j: 6. calculate logits based on input index i and input index j 7. set label to 0 8. concatenate logits from step 2 and logits from step 6 9. concatenate label from step 2 and label from step 7 10. calculate the updated loss function Table 3: Contrastive learning 5 Experiments 5.1 Data We use Stanford Sentiment Treebank Dataset for Task 1: Sentiment Analysis; Quora Dataset for Task 2: Paraphrase Detection; and SemEval STS Benchmark Dataset for Task 3: Semantic Textual Similarity. Stanford Sentiment Treebank Dataset The Stanford Sentiment Treebank consists of 11,855 single sentences extracted from movie reviews. Each phrase has a label if negative, somewhat negative, neutral, somewhat positive, or positives. We the following splits: 2 https://github.com/xiaopp123/adversarial_train 5 • train (8,544 samples) • dev (1,101 samples) • test (2,210 samples) Quora Dataset The quora dataset consists of 400,000 question pairs with labels indicating whether particular instances are paraphrases of one another. A subset of this dataset is provided with the following splits: • train (141,506 samples) • dev (20,215 samples) • test (40,431 samples) SemEval STS Benchmark Dataset The STSB dataset consists of 8,628 different sentence pairs of varying similarity on a scale from 0 (unrelated) to 5 (equivalent meaning). The splits are: • train (6,041 samples) • dev (864 samples) • test (1,716 samples) 5.2 Evaluation method Accuracy is used as the metric in the tasks of sentiment analysis and paraphrase detection. Pearson correlation coefficient is used in the task of semantic textual similarity. 5.3 Experimental details Our BERT model backbone is the provided default one: "bert-base-uncased", together with the default learning rate (1e-5). We have explored different learning rate (2e-5 and 3e-5), it doesn’t seem to have a big impact on the performance. We use the Adam optimizer in all experiments. The batch size are modified to fully use the available GPU and to ensure the number of batches in each task is roughly the same due to the imbalanced training data. The table below summarizes the details of experiments we have performed. The option 2 and option 3 in model structure column are defined in figure 44 of section 4.1. We have also explored models with option 1 as the structure, but their performance are usually worse than the other two options. We feel that the two terms, abs(u-v) and u*v, are sufficient if not have more information than a simple cosine-similarity term. Thus we haven’t combined option 1 with other extensions in our experiments. In addition, we have observed a 10% improvement in performance after switching from option 2 to option 3. Therefore, option 3 is selected. 5.4 Number Batch Size Model Structure Multi-task Loss 1 2 3 4 5 6 7 8 8/64/8 8/64/8 4/64/2 4/64/2 4/64/2 4/64/2 4/64/2 4/64/2 option 2 round-robin option 3 round-robin option 3 total loss option 3 gradient surgery option 3 round-robin option 3 round-robin option 3 round-robin option 3 round-robin Table 4: Experimental details Adversarial Contrastive FGM FGSM PGD Contrastive Results Our baseline is the model fine-tuned only on SST dataset. We note the baseline model as experiment 0 and report its performance together with other experiments described in section 5.3 in the table5. 6 Description sst-dev experiment 0 experiment 1 experiment 2 experiment 3 experiment 4 experiment 5 experiment 6 experiment 7 experiment 8 0.529 0.453 0.018 0.496 0.844 0.623 0.502 0.869 0.765 0.500 0.856 0.786 0.495 0.864 0.789 0.500 0.856 0.806 0.469 0.870 0.791 0.509 0.845 0.792 0.393 0.843 0.575 Table 5: Experimental results quora-dev sts-dev overall average 0.333 0.654 0.712 0.714 0.718 0.721 0.710 0.715 0.603 Due to the limited numbers of submissions allowed to test leaderboard (i.e., 3 times in total), we only report the model performance on development data. The model submitted to leaderboard is from experiment 5, with overall performance metric 0.717 in the test leaderboard and performance in each of the three tasks: • SST test Accuracy: 0.498 • Paraphrase test Accuracy: 0.853 • STS test Correlation: 0.799 In experiment 8, the model is only trained with 1 epoch and takes 6 hours to finish due to the double for-loop logic. Other than that, all our experiments have a better performance than the baseline model. Using more layers results into a 10% increase starting from experiment 3. However, the overall performance hits a plateau in the (0.71, 0.72) range. The multiple extensions we have tried don’t improve the performance much. Our model performs best in the Paraphrase Detection task since the we have the most samples in the Quora dataset, around 144K. The performance in the Semantic Textual Similarity task is the second best one since the input data to Paraphrase Detection and Semantic Textual Similarity are both sentence pairs. The model performs worst in the Sentiment Analysis task, only around 0.5. This might not be a surprise since the input data is single sentence. 6 Analysis Without adding additional data into our training sample or ensembling with another model, we think the peak performance is 0.5 for Sentiment Analysis task, 0.87 for Paraphrase Detection task and 0.80 for Semantic Textual Similarity task. Due to the high imbalance between training samples (task 1: 8K, task 2: 144K and task 3: 6K), the knowledge extracted from task 2 dominates our multi-task models. Various extensions do not have a huge impact on the performance. Though it would be interested to know the result if we are able to run experiment 8 for full 10 epochs. 7 Conclusion In this research project, we have fine-tuned the BERT model in a multi-tasks setting. We think that adding another Sentiment Analysis alike data will help to improve the model performance on task 1 and adding another Semantic Textual Similarity alike data will help to improve the model performance on task 3. In addition, we would like to fix the contrastive learning logic to observe its impact and include another model with a different pre-trained BERT model as the backbone for the next step. References Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In arxiv. 7 Alexis Conneau, Holger Schwenk, Loıc Barrault, and Yann Lecun. 2016. Very deep convolutional net- works for natural language processing. In arXiv. Yogatama Dani, Chris Dyer, Wang Ling, and Phil Blunsom. 2017. Generative and discriminative text clas- sification with recurrent neural networks. In arxiv. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In arxiv. Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. In arxiv. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In arxiv. Yuxin Jiang, Linhan Zhang, and Wei Wang. 2022. Improved universal sentence embeddings with prompt-based contrastive learning and energy-based learning. In arxiv. Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 562–570. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural net- work for modelling sentences. In arxiv. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In arxiv. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. In arxiv. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attack. In arxiv. Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised text classification. In arxiv. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bertnetworks. In arxiv. Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Neural speed reading via skim-rnn. In arxiv. Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, and Lawrence Carin. 2018. Deconvolutional latent-variable model for text sequence matching. In In Thirty-Second AAAI Conference on Artificial Intelligence. Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2020. How to fine-tune bert for text classification? In arxiv. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In arxiv. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. In arxiv. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In In Advances in neural information pro- cessing systems, page 649–657. Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, Ricardo Henao, and Lawrence Carin. 2017. Deconvolutional paragraph representation learning. In In Advances in Neural Information Processing Systems, pages 4169–4179. 8