Exploring Multi-Task Learning for Robust Language Encoding with BERT Stanford CS224N Default Project Alejandro Lozano Department of Biomedical Data Science Stanford University lozanoe@stanford.edu Laura Bravo Department of Biomedical Data Science Stanford University lmbravo@stanford.edu Abstract Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP). By analyzing large amounts of text data, LLMs are capable of identifying relationships between words and phrases, as well as their context, resulting in a more nuanced language understanding. LLMs are transferable, allowing them to be pre-trained on large data sets and later finetuned on smaller downstream-specific tasks. However, fine-tuning can lead to catastrophic forgetting, where previously learned information is lost. In this work we propose a BERT-based architecture that promotes representation generalization by training on multiple tasks: Sentiment Analysis (SA), Paraphrase Detection (PD), and Semantic Textual Similarity (STS). Our experiments suggest that even when accounting for task interference, a Multi-task Learning (MTL) framework is only effective when it can leverage related tasks. 1 Key Information to include • Mentor: Hans Hanley 2 Introduction In recent years, Natural Language Processing (NLP) has been revolutionized by transformer-based Large Language Models (LLMs) [1]. By analyzing vast amounts of text data, LLMs can identify subtle relationships between words and phrases, as well as the context in which they are used, enabling them to better capture the nuances of human language, compared to previous approaches. Additionally, LLMs are transferable, meaning that they can be pre-trained on large amounts of data and then fine-tuned on smaller task-specific datasets. Thus enabling the development of more efficient and effective models for specific NLP tasks, without the need for extensive training data. For instance, BERT [2] (one of the first LLMs) led to state-of-the-art performance in multiple downstream tasks, out-competing specialized systems at the time [3]. However, the fine-tuning process can often lead to catastrophic forgetting, that is, when a neural network forgets previously learned information when trained on new and unrelated data. Thus, leading to a loss of generalizability in the learned representations [4]. This phenomenon has motivated research avenues that improve the robustness of learned representations for downstream tasks. One such approach is Multi-task learning (MTL) which promotes representation generalization by training on multiple related tasks simultaneously. In MTL, models learn to share information across tasks via a common feature extraction backbone, while capturing relevant features for each task with task-specific components [5]. Nevertheless, the nature of MTL involves facing challenges that remain open-ended research questions, for example task interference, which occurs when tasks negatively impact each other’s performance [6]. This problem is particularly relevant when tasks have different requirements or are not well-aligned. In this work, we study the problem of developing robust embeddings with LLMs for NLP downstream tasks by proposing a MTL BERT-based architecture and involving strategies to Stanford CS224N Natural Language Processing with Deep Learning handle task interference. In particular, we apply our approach to the tasks of Sentiment Analysis (SA), Paraphrase Detection (PD) and Semantic Textual Similarity (STS). We validate our approach across all tasks by comparing it to the single-task fine-tuned paradigm and find that only when the tasks are "related" the MTL approach provides a significant boost in performance (e.g. increased accuracy). 3 Related Work 3.1 Transformer-based Models for Natural Language Understanding Since the introduction of BERT [2], Transformer based pre-trained language models have dominated the field of Natural Language Understanding. Such architectures have provided unprecedented improvement of accuracy on various tasks compared to traditional models (e.g LSTM) at the cost of increasing the number of parameters, making them computationally expensive and unreliable due to memory limitations of available hardware [7]. The latter hinders their adoption for applications such as sentence-pair regression tasks e.g. large-scale semantic textual similarity, clustering, paraphrasing detection, and information retrieval via semantic search. Hence, several works have addressed this issue with architectonic solutions. Poly-encoders proposed by Humeau et al. [8] addresses the run-time overhead from BERT by computing candidate embeddings using attention. However, the score function used by this approach is not symmetric and the computational overhead is still large for some applications (e.g clustering). In [9] a Siamese triplet network architecture (Sentence-BERT) was proposed as a computationally efficient replacement for BERT. As an example, on a modern V100 GPU, hierarchical clustering of 10,000 sentences on standard BERT requires 65 hours (since 50 Million sentence combinations must be compared), while it only takes 5 seconds with SentenceBERT. Another successful approach was ALBERT [10], where two parameter reduction techniques to lower memory consumption and increase the training speed of BERT were introduced, showing comprehensive empirical evidence that such methods lead to models that scale much better compared to the original BERT. 3.2 General Robust Embeddings Since Contextualized representations retrieved from pre-trained LLM’s are central to achieve high performance on downstream NLP tasks, the search for an optimal sentence embedding scheme remains an active research area in computational linguistics [11]. Although BERT-based models employ the [CLS] token vector as a "reasonable" sentence embedding (common starting point in several works), it has been proven that such embeddings yield worse results than GloVe embeddings [12] for several applications such as sarcasm detection [13], semantic co-occurrence [14], and STS [9]. Thus exhaustive analysis comparing different strategies to retrieve embeddings from BERT (and other Encoders such as ALBERT [10]) have been explored. For example, it has been observed that averaging the BERT tokens (a type token pooling strategy along max pooling) for every word in a sentence provides a better result than the [CLS] token [9]. Furthermore applying a CNN after the pooling operation might slightly boost performance for some tasks, but it also has a negative impact for other tasks [11]. Lastly Merchant et al. [15] suggest that the linguistic features are not always incorporated into the final prediction layer (nor in earlier layers) showing that the layers closer to the final BERT embedding (such as the second to last) might provide a boost in performance for downstream task. A possible reason for this observation might be that the last layer provides embeddings that capture features necessary for BERT’s pre-training pretext task (MASK LM and Next Sentence Prediction) 3.3 Multi-task-based Robust Embeddings In contrast to the selection of a pooling strategy or layer representation selection, other works have focused on training strategies to obtain robust embeddings. Hence Representation Learning, Meta-learning, and Multi-Task Learning frameworks have been exhaustively studied as plausible solutions [16, 17, 18]. Under the MTL paradigm, models are jointly trained from multiple related tasks, using shared representations to learn generalized features from a collection of tasks, integrating knowledge across domains [19, 20]. At its most general form, the MTL paradigm only requires tailored architecture design [21, 22]. Most common approaches are parallel, hierarchical, modular, and generative adversarial architectures. In the parallel architecture, a model is shared among 2 multiple tasks while each task has its own output layer. Hierarchical architectures explicitly model the interaction between tasks, while modular architectures decompose a given model into shared components and task-specific (learning task-invariant and task-specific features respectively) [23, 24]. As a common extension, several MTL workflows aim to improve the robustness among tasks via custom optimization processes such as dynamically tuning gradient magnitudes [25], employing meta objectives[26], and hybrid balance methods [27]. However, these types of approaches simply neglect the possibility of getting feedback from conflicting gradients, which might improve performance, but waste the potential conjoined learning space of an MTL setting. Thus Yu et al. proposed Gradient Surgery [28], an algorithm to project a conflicting tasks’s gradient onto the normal planes, which has shown state-of-the-art performance in several applications [29, 30, 31, 32]. Theoretically, shared embedding increases data efficiency while making a representation more robust for related downstream tasks (providing attention to relevant features), but in practice, it may lead to degraded performance (this is especially true when tasks compete for model capacity) [22, 33]. Hence, recent observations on MTL have questioned the utility of gradiant surgery, meta objectives, and dynamic gradient tuning paradigms. Xin et al.[34] suggest that such MTL strategies might not provide better performance than simply optimizing a weighted average of the task losses (obtained via hyper-parameter search). Other studies have showed that a possible reason for such observation may not be a combination of several factors since, unlike transfer task affinities, multi-task affinities are highly sensitive to a number of components external to conjoined loss functions such as dataset size and network capacity [35]. 4 Approach Figure 1: Final architecture for Multi-Task Learning approach. All tasks share the BERT-Based Encoding Module for embedding the input sentence. To (theoretically) improve the robustness of the learned embeddings we experimented with the MLT paradigm. Figure 1 depicts our proposed final architecture (after exploring different combinations). We then compare the performance of MTL against single-task training using the same model (this serves as our baseline). Shared BERT Backbone. To promote shared learning all tasks share a BERT-Based Encoding Module h, which takes a sentence s as input and outputs an encoded representation e = h(s) which is then used for the task-specific heads. We process the sentence by first tokenizing the words, adding positional embeddings, and encoding them with the BERT model. Task-specific heads. For SA we use a straightforward head composed by a MLP and Softmax function to predict the degree of similarity ŷSA ∈ [0, 1, 2, 3, 5]. For PD and STS, as they are based on comparing two sentences our design is inspired by the work from Reimers et al. [36]. We employ a siamese network approach to process two input sentences A and B and obtain their representations Xa and Xb . The final output (for our best performing model) is predicted by using a weighted 3 average between cosine similarity and a mapping function f over these representations to predict ŷST S ∈ R ∈ [0, 5] and ŷP D ∈ [0, 1]. In particular ŷtask = αf (concat(Xa , Xb , |Xa − Xb |)) + (α − 1)cosine − similarity(Xa , Xb ) (1) Note that for STS the cosine-similarity function is multiplied by 5 to get ŷST S ∈ R ∈ [0, 5] Task-specific head selection We proposed four different architectures and compared their performance in every task (decoupled). For every architecture version, the most significant variations are applied to the Siamese neural network used for STS and PD (As the exploration for an SA architecture was done in the first part of the project). 1. Architecture V1 (shown in figure Figure 2 located on the Appendix) is a simplified version of V3 (depicted in Figure 1 ). Every sentence is encoded into the CLS token and then fed directly into an MLP (different for every task), then the resulting embeddings are fed to a cosine similarity function for PD and STS and soft-max for SA . 2. For V2 we introduced the BERT-Base Encoding Module of architecture V3 and left everything else equal to V 1 3. Architecture V3 a and V3 b (shown in Figure 1) essentially the same, but with different values of αtask (shown in equation 1), which reflects a trade-off between an MLP and cosinesimilarity function. Hence the two difference between this and the previous architectures is the addition of a weighted average between an M LP and Cosine − similarity function as a final layer and the fact that the M LP applied after to the BERT-Based Encoding Modules is shared among all tasks. (a) For V3 a we selected the αP D = 0.99 (adding more weight to the MLP layer), αST S = 0.01 (adding more weight to the Cosine-similarity layer) (b) for V3 b αP D = 0.01 (adding more weight to the Cosine-similarity layer) and αST S = 0.99(adding more weight to the MLP layer) BERT Embeddings. We note that as the sentence representation four our final model V3 , instead of using the hidden state of the < CLS > token of the last BERT Layer, we use a mean pooling across all tokens of a given word from the second to last BERT Layer and multiplied this representation by its respective attention weight, as shown in Figure 1. Following [11] we added an optional 1D-convolutional layer on top of these embeddings (The selection of the best performing BERT Embedding is explained in the next section). Loss functions. For joint training we define a MTL loss function Loverall composed by adding the individual task loss functions Ltask . LSA is defined by a standard cross-entropy loss. PD is formulated as a binary classification problem, thus we used a binary-cross entropy, cosine embedding loss, and triplet loss to define LP D inorder to ensure that the embeddings are expressive, meaning that the model does not trivially learn to increment the norm of some word embeddings to capture similarity as denoted in Bordes et al. [37]. Finally, STS is a regression problem, therefore we define its loss (LST S ) as the weighted average between the mean-square-error and Lasso losses. Loverall = c1 LSA + c2 LP D + c3 LST S LSA = Lce LP D = Lbce + Lcosine−sim−emb + Ltripplet−margin−loss LST S = (γ)Lmse + (γ − 1)Llasso where the ci are the scaling constants to project all Ltask to the same magnitude. Time-efficient cosine triplet loss functions for Paraphrase Detection The triplet margin loss function requires an anchor, positive, and negative examples respectively. We compute the cosine triplet loss functions when the label is equal to one, setting the two encoded sentences as an anchor and positive example and a randomly sampled sentence as the negative example. A naive approach to obtain such negative representations would be to create a second data loader (shuffled differently) and sample from two loaders at the same time in order to calculate the BERT embeddings for 4 positive negative, and anchor examples. To avoid this implementation (which could potentially add more hardware constraints), we proposed a time-efficient implementation where we store the BERT representations of a given batch to be used as the negative samples for the next batch. Though this idea was implemented from scratch, it is a popular approach in Computer Vision (Contrastive Learning) [38, 39] and Knowledge Graph embedding algorithms [40]. Multi-task training strategy. We handle task interference by following the work from Yu et al. [28], which identified conflicting gradients (gradients for different tasks pointing away from one another ) as the primary optimization issue in MTL. Though averaging over the task gradients could provide (under specific assumptions) a correct solution, there are key scenarios in which this may lead to degraded performance. Hence, we implement Gradient Surgery (GS) [28], a method for handling conflicting gradients across tasks, to jointly train our MTL model. Formally, given a collection of tasks Ti ∈ T for i = {1, 2, ..n}. The gradient for task Tj is denoted as gj , while the gradient for task Ti i ̸= j is given by gi . If a gradient conflict is encountered (a negative cosine similarity ) g ·g between gi and gj , gi is replaced by its projection onto the normal plane of gj : gi = gi − ||gij ||j2 gj . Otherwise, when the cosine similarity is non-negative (non-conflict gradients), the original gradient gi is unaltered. This process is repeated across all of the other tasks, sampled randomly. 5 Experiments 5.1 Data For each task we use the datasets from the project instructions including the proposed data splits. For SA: Stanford Sentiment Treebank [41]. For PD: Quora question pairs dataset [42], and for STS: SemEval STS Benchmark dataset [43]. 5.2 Evaluation method We use accuracy and F1 to evaluate the performance for the SA and PD tasks as they are formulated as classification tasks. For the STS task we calculate the Pearson correlation and EM between the predicted and true similarity values. 5.3 Experimental details For all experiments, we use an NVIDIA A10 GPU. We perform our experiments from a pre-trained checkpoint (provided for the project), thus every subsequent experiment starts (fine-tunes) from this model. In some experiments, we freeze a previously fine-tuned model to further fine-tuned for a single task (fine-tune of the fine-tune), we will refer to this as fine-tuning (and use the word training when we fine-tune the original checkpoint) Single Task fine-tuning We perform this experiment twice, with two different gaols 1. Architecture Selection . To select the best architecture we train every model for 3 epochs with learning rate 1 × 10e−5 batch size 16 and dropout probability rate of 0.7 (This was done due to resource constraints). 2. Base line. To compare The performance against MTL we train the best performing model (from the previous experiment) on a single task using the following parameters: epochs 15 (with early stopping), learning rate 5 × 10e−6 batch size 32, dropout probability 0.7. Embedding Selection After selecting the best architecture, we explore three BERT embeddings strategies (CLS token, mean pooling, mean pooling + 1D-CNN), due to resource constraints we only train the models for 3 epochs with a dropout probability of 0.7, batch size of 8, a learning rate of 5 × 10e−6 , and a gradient surgery workflow, MTL GS training For MTL we train our best-performing architecture from table 2 (V3 ) using a dropout probability of 0.65, batch size of 8 with 4 gradient accumulation steps, a learning rate of 8 × 10e−6 , 15 epochs (01:23 hours per epoch), a gradient surgery workflow, and average pooling BERT embeddings. 5 Table 1: Test Set Results for all tasks (a) Sentiment Analysis Model Final F1 52.35 (b) Paraphrase Detection Model Final F1 76.37 (c) Semantic Textual Similarity Model Final EM 61.08 MTL to Single Task fine-tuning After training the model with a MTL paradigm we fine-tuned the resulting checkpoint for every task using the following parameters: epochs 10, batch size 32, dropout probability 0.7, learning rate 6 × 10e−7 Ablation studies 1. We train MTL-GS with "similar" tasks (removing SA and leaving PD and STS) with the following parameters. Epochs 10, batch size 16, hidden dropout probability 0.7, learning rate 2 × 10e−6 with equal weight for both loss functions (unscaled) c2 = c3 = 1 and scaled loss functions c2 = 1c3 = 4. For both scenarios, every epoch took 00:37:15 minutes on average. 2. To see the impact of the loss we remove Lcosine−sim−emb and Ltripplet−loss from LP D , this experiemnt is denoted as P Dce (since it only uses cross-entropy) . We also train this experiment with equal weight for both loss functions (unscaled) c2 = c3 = 1 and scaled loss functions c2 = 1c3 = 4. All the other parameters are left equal to the first ablation study (described above). For both scenarios, every epoch took 00:30:23 minutes on average. 5.4 Results and Analysis Test set results. Table 1 shows the results of our MTL approach on the test set (obtained fom the lead board). As expected, the results are lower than those in the Dev set. We also note that the relative performance ordering of the tasks is maintained. We obtain position 87 on the lead-board with a global score of 63.27. Dev Set Results: Architecture Selection. Table 2 shows the results of experimenting with variations of our proposed MTL architecture V3 (f inal) (see Figure1). Note that these results were obtained by training the model for every single task for 3 epochs. From this, we can derive four conclusions. 1. Adding a layer on top of the BERT embeddings (Before feeding them to task-specific MLPs) further increases the performance of architecture. The latter can be seen by comparing V1 against V2 . Our hypothesis is that this approach works since we are adding an extra layer to learn semantic features important for a downstream task, which helps the model "diverge" from the feature learned for BERT’s pre-train strategy. 2. Comparing the different variations of V3 a we can conclude that Mean-Pooling tokens provide more expressive embeddings than CLS tokens. This result aligns with Wang et al. [44] and Reimers et al. [9] (an example of initial comparison between Mean-Pooling and CLS token is provided in the Appendix) 3. Seemingly similar tasks require different architecture designs. As shown V3 a vs V3 b, PD greatly benefitted from having an MLP as a final layer while STS performed the best when a scaled cosine-similarity layer was used. This aligns with previous observations from Reimers et al. [9] and Viji et al. [45] that using cosine similarity for STS yields the best results compared to other methods such as imputing two sentences to BERT (separated with the [SEP] token) and applying an MLP head on top. 4. Sharing an MLP among across tasks benefits* the performance of a model. This observation can be seen by comparing V2 vs V3 a and V3 b (only using the results of the cosine embedding to set all things equal). However, it is important to clarify (as pointed out in the next subsection) that this only benefits tasks if they are similar in nature (e.g. PD and STS). This insight is supported by recent progress in MTL, where researchers are trying to exploit the complex relationships across different tasks by sharing layers across different task heads [46, 47, 48]. 6 Dev Set Results: MTL-GS Does it really work? able 3 shows the comparison of our final method (V3 architecture with MTL and GS) with the baselines of training a single task by decoupling our proposed architecture’s task-specific heads. For clarity Table 3 also notes how the model was fine-tuned. we hilight the following: 1. Version V3 b shows how the BERT-Based Encoding Module improves the results for STS, but does not perform better than V1 for SA and PD (which uses CLS tokens). We hypothesize that for our proposed sentence representation to be effective, it needs to be paired with an effective architecture design (especially at the final layer)in order to benefit all tasks. 2. As it has been reported previously, jointly training all tasks with (GS-MTL) provided robust embeddings (best average overall tasks), but it leads to degraded performance [22, 33], meaning that jointly training all tasks does not yield the best result possible since single-task training for SA and PD outperforms MTL-GS (only STS benefits). 3. However, from previous experiments we observed an improvement when including GS to MTL optimization. Thus, we hypothesize that there was task interference (even after projecting the gradients) which could be caused by dissimilar tasks interacting. It is possible that by using all tasks, the additional examples lead to the creation of an even sparser embedding space to the detriment of the representation of individual tasks. Following this idea, we perform an ablation study to see if training "similar" tasks would increase performance. As shown by all the variations of GS-MTL (only PD + STS) and GS-MTL, our hypothesis is correct, since this lead to the best-performing models for both PD and STS. 4. For PD and STS, the Best Performing model we obtained was obtained via scaled weights shown by MTL-GS (only PD +STS). This means that GS benefits greatly when the loss functions of different tasks have similar magnitudes. 5. Comparing our loss function ablation study we conclude that adding a triplet loss function benefited the overall performance of MTL-GS for PD and STS. Furthermore, our implementation only adds 7 extra minutes per epoch 6. As a final highlight, we also notice that even when not directly training with MTL tasks some tasks can benefit from the other tasks in a low data regime. For example, there is a marked improvement for SA considering that the model has never been trained for SA (see rows 2 and 3). Table 2: Results of Architecture Variations with Multi-task Learning on the Dev Set. Architecture Embedding Type SA Acc PD Acc STS PC V1 CLS Token 45.4 45.6 9.9 V2 Mean-Pooling 66.6 13.6 V3 a(F inal) Mean-Pooling 46.2 70.1 40.2 Mean-Pooling + CNN 67.9 36.9 V3 a V3 a CLS Token 66.8 38.9 V3 b Mean-Pooling 37.5 12.3 6 Conclusion Our analysis shows that multitask learning with gradient surgery using all tasks provides the most robust embeddings based on the average performance. However, it does not provide the best results possible for all individual tasks (compared to single-task fine-tuning) unless the tasks are similar in nature (e.g PD and STS). This finding partially aligns with current views on MTL published by Standly et al. [35]. Furthermore, MTL-GS significantly boosts performance when conflicting tasks are excluded and the losses are scaled to a similar magnitude. This shows that even after using MTL-GS it is necessary to use a workflow to regularize the loss functions of all tasks in an MTL paradigm. This opens up future work on how to systematically define similar tasks and how to scale them jointly. Additionally, fine-tuning on top of MTL-GS with conflicting tasks does not provide a substantial benefit. Contrary to what we expected, it yields lower performance than fine-tuning a single task at a 7 Table 3: Results of Training Strategies for Multi-task Learning on the Dev Set. GS: Gradient Surgery; MTL: Multi-Task Learning. Ft: Fine tuning. Model V3 a V3 a V3 a V3 a V3 b V3 a V3 a V3 a V3 a V3 a V3 a V3 a V3 a Training Strategy Single task Single task Single task MTL GS-MTL GS-MTL GS-MTL GS-MTL GS-MTL GS-MTL (only PD + STS unscaled) GS-MTL (only PD + STS scaled) GS-MTL (only P Dce + STS unscaled) GS-MTL (only P Dce + STS scaled) SA Ft ✓ PD Ft STS Ft ✓ ✓ ✓ ✓ ✓ SA Acc 51.9 20.8 37.6 32.6 34.8 46.3 49.9 46.3 45.5 22.3 23.8 21.4 21.0 PD Acc 37.5 71.7 25.5 58.7 37.5 60.1 37.8 65.6 50.7 69.6 76.3 75.3 76.6 STS PC 43.3 30.7 43.9 52.7 10.2 63.0 45.7 58.9 60.1 65.2 63.4 39.6 43.3 Average 44.23 41.1 35.67 48.0 27.5 56.47 44.47 56.93 52.12 52.36 54.50 45.43 46.96 time. From our architecture experiments, we found that for a PD task, a final MLP layer performed better than cosine similarity. On the other hand for STS a cosine similarity layer performed better than an MLP. This behavior highlights that fairly similar tasks can have drastic performance differences with the same architecture. Limitations In our work we aimed to make a fair comparison against MT, MT-GS, and single-task training, meaning that even though we first tried to obtain the best-performing architecture, no hyper-parameter search was done for every training strategy (which typically yields better results). Furthermore, we only compared two MTL, MTL-GS learning strategies, with three tasks, thus we cannot make a generalized statement regarding the performance of modern MTL paradigms against single-task training. However, we highlight recent criticisms of GS, specifically those claiming that a hyper-parameter search to scaled loss functions to a similar magnitude might provide better performance [34]. This claim could be considered unfair since our results show that GS also benefits from scaled loss functions, hence it should not be used as replacement for hyper-parameter search, but in conjunction. Another limitation of our work is that we lack a metric to define "similar" tasks which is important to ensure that the subdivision of tasks (into similar tasks) is well fundamented . In future work, we propose to explore a pre-evaluation worflow before training, one in which we keep track of the conflict gradients among all tasks and used this metric to asses and select tasks to jointly train together with gradient surgery. References [1] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [3] Ivano Lauriola, Alberto Lavelli, and Fabio Aiolli. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing, 470:443–456, 2022. [4] Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. A memory transformer network for incremental learning. arXiv preprint arXiv:2210.04485, 2022. [5] Rahul Manohar Samant, Mrinal Bachute, Shilpa Gite, and Ketan Kotecha. Framework for deep learning-based language models using multi-task learning in natural language understanding: A systematic literature review and future directions. IEEE Access, 2022. [6] Chulun Zhou, Zhihao Wang, Shaojie He, Haiying Zhang, and Jinsong Su. A novel multi-domain machine reading comprehension model with domain interference mitigation. Neurocomputing, 500:791–798, 2022. 8 [7] Young Jin Kim and Hany Hassan Awadalla. Fastformers: Highly efficient transformer models for natural language understanding. arXiv preprint arXiv:2010.13382, 2020. [8] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Real-time inference in multi-sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969, 2019. [9] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bertnetworks. arXiv preprint arXiv:1908.10084, 2019. [10] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019. [11] Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon. Evaluation of bert and albert sentence embedding performance on downstream nlp tasks. In 2020 25th International conference on pattern recognition (ICPR), pages 5482–5487. IEEE, 2021. [12] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [13] Akshay Khatri et al. Sarcasm detection in tweets with bert and glove embeddings. arXiv preprint arXiv:2006.11512, 2020. [14] Leilei Gan, Zhiyang Teng, Yue Zhang, Linchao Zhu, Fei Wu, and Yi Yang. Semglove: Semantic co-occurrences for glove from bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2696–2704, 2022. [15] Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, and Ian Tenney. What happens to bert embeddings during fine-tuning? arXiv preprint arXiv:2004.14448, 2020. [16] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021. [17] Jane X Wang. Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences, 38:90–95, 2021. [18] Simon Graham, Quoc Dang Vu, Mostafa Jahanifar, Shan E Ahmed Raza, Fayyaz Minhas, David Snead, and Nasir Rajpoot. One model is all you need: multi-task learning enables simultaneous histology image segmentation and classification. Medical Image Analysis, 83:102685, 2023. [19] Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020. [20] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018. [21] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008. [22] Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021. [23] Shijie Chen, Yu Zhang, and Qiang Yang. Multi-task learning in natural language processing: An overview. arXiv preprint arXiv:2109.09138, 2021. [24] Yu Zhang and Qiang Yang. An overview of multi-task learning. National Science Review, 5(1):30–43, 2018. 9 [25] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018. [26] Xian Li and Hongyu Gong. Robust optimization for multilingual translation with imbalanced data. Advances in Neural Information Processing Systems, 34:25086–25099, 2021. [27] Liyang Liu, Yi Li, Zhanghui Kuang, J Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. iclr, 2021. [28] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020. [29] Lucas Mansilla, Rodrigo Echeveste, Diego H Milone, and Enzo Ferrante. Domain generalization via gradient surgery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6630–6638, 2021. [30] Xiaojun Zhou, Yuan Gao, Chaojie Li, and Zhaoke Huang. A multiple gradient descent design for multi-task learning on edge computing: Multi-objective machine learning approach. IEEE Transactions on Network Science and Engineering, 9(1):121–133, 2021. [31] Qiwei Bi, Jian Li, Lifeng Shang, Xin Jiang, Qun Liu, and Hanfang Yang. Mtrec: Multi-task learning over bert for news recommendation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2663–2669, 2022. [32] Tao Qi, Fangzhao Wu, Chuhan Wu, Peiru Yang, Yang Yu, Xing Xie, and Yongfeng Huang. Hierec: Hierarchical user interest modeling for personalized news recommendation. arXiv preprint arXiv:2106.04408, 2021. [33] Jaekeol Choi, Euna Jung, Jangwon Suh, and Wonjong Rhee. Improving bi-encoder document ranking models with two rankers and multi-teacher distillation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2192–2196, 2021. [34] Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, and Justin Gilmer. Do current multi-task optimization methods in deep learning even help? arXiv preprint arXiv:2209.11379, 2022. [35] Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Which tasks should be learned together in multi-task learning? In International Conference on Machine Learning, pages 9120–9132. PMLR, 2020. [36] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bertnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [37] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013. [38] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809, 2020. [39] Xiao Wang, Yuhang Huang, Dan Zeng, and Guo-Jun Qi. Caco: Both positive and negative samples are directly learnable via cooperative-adversarial contrastive learning. arXiv preprint arXiv:2203.14370, 2022. [40] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019. 10 [41] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013. [42] Travis Addair. Duplicate question pair detection with deep learning. Stanf. Univ. J, 2017. [43] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43, 2013. [44] Yile Wang, Leyang Cui, and Yue Zhang. How can bert help lexical semantics tasks? arXiv preprint arXiv:1911.02929, 2019. [45] D Viji and S Revathy. A hybrid approach of weighted fine-tuned bert extraction with deep siamese bi–lstm model for semantic text similarity identification. Multimedia Tools and Applications, 81(5):6131–6157, 2022. [46] Tianxin Wang, Fuzhen Zhuang, Ying Sun, Xiangliang Zhang, Leyu Lin, Feng Xia, Lei He, and Qing He. Adaptively sharing multi-levels of distributed representations in multi-task learning. Information Sciences, 591:226–234, 2022. [47] Dripta S Raychaudhuri, Yumin Suh, Samuel Schulter, Xiang Yu, Masoud Faraki, Amit K Roy-Chowdhury, and Manmohan Chandraker. Controllable dynamic multi-task architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10955–10964, 2022. [48] Akari Asai, Mohammadreza Salehi, Matthew E Peters, and Hannaneh Hajishirzi. Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655–6672, 2022. 11 A Appendix Figure 2: Architecture V1 for Multi-Task Learning approach. A.0.1 BERT vs BERT average pooling embedding We performed a toy test with the given sentence: 1. "The dog is dancing on the stage“ 2. "The dog is barking on the stage“ 3. "The flying car is about to land in Nevada“ Our hypothesis is that even though sentences 1 and 2 are not exactly the same, their embeddings should be closer if compared against sentence 3. For this experiment, we use a pre-trained model of BERT, computed both the and mean-attention embeddings, and performed a Singular Value Decomposition (SVD) to visualize them. As seen in figure 3 (a) The mean attention embeddings are able to capture that sentence 1 and 2 are closer than sentence 3. This is not true the embeddings as shown in 3 (b) 12 Figure 3: BERT (b) vs BERT mean-attention (a) Embeddings: SVD decomposition of BERT embeddings for sentence 1,2, and 3, 13