Finetuning minBERT Model for Multiple Downstream
Tasks
Stanford CS224N Default Project

Yuan Wang
Department of Computer Science
Stanford University
ywang09@stanford.edu

Abstract
Pre-trained Large Language Models, such as BERT and GPT, contain rich token
embeddings that are useful for various downstream tasks. Instead of building
a separate model for each individual task, it could be more resource-efficient
to build one model that could perform multiple tasks. This paper presents the
author’s findings in extending and finetuning a minBERT model to perform multiple
downstream tasks. Improvements against the baseline are achieved through training
on additional task data, implementing a round robin multi-task training algorithm,
performing additional finetuning with minBERT parameters fixed, and finding
optimal hyperparameters.

1

Key Information to include
• Mentor: Anuj Nagpal
• External Collaborators (if you have any): N/A
• Sharing project: N/A

2

Introduction

The emergence of powerful pretrained large language models (LLMs) such as BERT and GPT has
radically changed the landscape of Natural Language Processing Research. In the past, research
mostly focused on developing individual models that focuses on specific language tasks from scratch,
and little knowledge sharing occurs across different models and different tasks. However, large
attention-based language models that are heavily trained on simple tasks over a huge corpus of text
have proven to produce very powerful token embeddings that significantly benefit almost every major
downstream language task. As a result, researchers began to utilize these pretrained models as a
starting point to build state-of-the-art models that tackle different downstream language tasks.
Since the rich token embeddings produced by LLMs contain useful information for various tasks,
it is naturally possible to utilize the same set of embeddings for multiple downstream tasks. This
is the objective of multitask language models. Since token embeddings often constitute the largest
portion of a language model, utilizing the same set of embeddings for different tasks potentially
offers a resource-efficient solution for multiple tasks. In this project, the author attempts to extend
and finetune a minBERT model to perform three different downstream tasks: sentiment analysis,
paraphrase detection, and semantic textual similarity analysis. For simplicity, these tasks will be
referred to as SST, PARA, and STS.
Stanford CS224N Natural Language Processing with Deep Learning

3

Related Work

When LLMs become available, a lot of research and experimentation has been done on extending
them to achieve state-of-the-art performance various downstream language tasks. The rich research
in this realm offer a lot of interesting and promising ideas for building and improving any model
that utilizes a pretrained LLM as its starting point. For example, Chi Sun et al.[1] performed
systematic experimentation on ways in which the BERT model could be finetuned to perform various
downstream tasks. In addition, there is a lot of interesting research in the realm of multitask learning.
Qiwei Bi et al.[2] used multi-task learning to improve a model’s performance in news encoding and
comprehension.

4

Approach

Model
The model itself has a pretty simple structure. The minBERT layers provide the sentence embeddings,
which are fed into one of three downstream sub-models, depending on the task being performed. As
shown in Figure 1, after the sentence embedding(s) are generated from the minBERT layers, they are
fed towards different task-specific downstream layers. The SST task downstream layers consist of a
single linear layer, and both the PARA and STS task downstream layers consist of two linear layers
for the two input sentences – whose outputs are multiplied together – and an interact linear layer. For
all tasks, dropout is applied before the first linear layers.

Figure 1: Model Structure

2

Baseline
For the baseline method, I use task-by-task sequential training to train the model on all three datasets,
as shown in Figure 2. Within each epoch, the model is first trained on the SST datset, and next on
the PARA dataset, and finally on the STS dataset. Therefore, each epoch involves three rounds of
parameter updates, with each round dedicated to a specific task. The model is trained for 10 epochs,
with a dropout rate of 0.3 and a learning rate of 1e-5. The batch size is 32.

Figure 2: Baseline: Task-by-Task Training

Round Robin multitask training
Other than the baseline method, I also implement Round Robin multi-task training, as shown in
Figure 3. Within each epoch, multiple iterations of training are performed. Within each iteration, the
model is trained on a batch of data from each of the three datasets, and the parameters are updated.
The size of the batch is proportional to the total size of the dataset, so that at the end of each epoch,
all data in the datasets have been trained on at least once, and most of the data have been trained on
at most once. Since each round of parameter update is based off of training losses from all three
datasets, this method of training is designed to improve the overall performance of the model on all
three tasks during each model update.

Figure 3: Round Robin Multitask Training

Layer sharing
In the baseline model, the structure between the PARA downstream layers and the STS downstream
layers is identical. The two tasks are also similar in that they require the model to compare the
meaning between two input sentences. Therefore, it is possbile that training shared linear layers that
extract meaning from input sentences could produce useful output for both the PARA task and the
STS task. To test that hypothesis, I implemented a version of the model where the first linear layers
for the PARA task and the STS task are shared, as shown in Figure 4
3

Figure 4: Layer Sharing

Additional Training Data
While the provided training datasets are probably most useful for model training, additional related
datasets could also improve performance. Since the provided training datasets for SST and STS are
much smaller than the training dataset for PARA, I adapted outside datasets and added them to the
provided datasets. Please see the Data section for more details.

Additional training with minBERT parameters fixed
Updates to the minBERT layers affect the model’s performance on all three tasks, whereas updates
to a downstream layer only affects the model’s performance on an individual task. While updating
the minBERT layers is very important for improving model performance, it could also improve the
performance on a single task at the expense of another task. Therefore, I implemented additional
training with the minBERT parameters fixed, as shown in Figure 5

Figure 5: Additional Training with minBERT Parameters Fixed

Other implementations and experiments
In addition to the extensions and experiments described above, there are a few additional experiments
that I have done. I tried using only a portion of the PARA dataset in the training process, so
that it is similar in size to the other two datasets (see Data section). I also searched for optimal
4

hyperparameters: I tried increasing the number of epochs to train, varying the learning rate and the
batch size.

5

Experiments

5.1

Data

For model training, we are provided with three datasets. For SST, we are provided with a subset
of the Stanford Sentiment Treebank dataset, which contains about 8.5k rows. For PARA, we are
provided with a subset of the Quora dataset, which contains about 141.5k rows. For STS, we are
provided with a subset of the SemEval STS dataset, which contains about 6.0k rows. The Quora
dataset is more than ten times larger than the other two datasets.
In additional to the provided data, I also utilized two additional datasets. For SST, I utilized the train
and dev subsets of the CFIMDB dataset, which contains about 1.9k rows of data in total. For STS, I
utilized the SICK2014 dataset, which contains about 10k rows of data in total [3]. The SICK2014
dataset consists of sentence pairs and includes a sentence relatedness score, which measures how
close the two sentences the pair are related to each other.
Unlike the SST dataset, which measures sentence sentiment in five discrete levels (0 to 4), the
CFIMDB dataset measures sentence sentiment as positive or negative. The sentences in the dataset
are said to be highly polar. To make the CFIMDB data compatible with the SST dataset, I created
two versions of the dataset. In one (polar) version, I assign a sentiment score of 0 to negative and 4 to
positive; in the other (not polar) version, I assign a score of 1 to negative and 3 to positive.
The STS data uses a continuous score between 0 and 5 to measure the similarity between two
sentences. The SICK2014 dataset uses a continuous score between 1 and 5 to measure relatedness
between sentences. To make the latter compatible with the former, I proportionally rescale the scores
from [1, 5] to [0, 5] using the formula snew = (sold − 1) · 1.25, where snew is the rescaled score, and
sold is the original score.
5.2

Evaluation method

I used the model’s prediction accuracy of the provided dev datasets as the primary evaluation method.
5.3

Experimental results

Figure 6: Experiment Results
Figure 6 shows the results I obtained from running the experiments described in the Approach section.
The blue numbers represent improvements against the baseline, whereas the red numbers represent
deterioration in performance. The list of changes that improved performance includes increasing
number of epochs trained [2], training on additional datasets ([5]-[9]), using (un-normalized) round
robin multitask training1 [10], and performing additional training with minBERT parameters fixed
1

Due to limitation in computation power, all round robin multitask training is run with a batch size of 8.

5

[12]. On the other hand, the list of changes that did not lead to improvement are decreasing batch
size to 8 [2], sharing linear layers between PARA and STS [3], cut down on PARA training2 [4], and
performing normalized round robin multitask training3 [11].
5.4

Final results

In the final version, I ensembled all the changes that improved the performance of the model against
the baseline version, and was able to achieve non-trivial improvement over the baseline, as shown in
Figure 7. The final version was trained using the Round Robin multitask method on both the provided
training datasets and the additional CFIMDB and SICK2014 (polar version) datasets. It was trained
for 20 epochs with the minBERT parameters actively updated; next, the currently best parameters
were loaded, and the model was further trained for 9 epochs with the minBERT parameters fixed,
with a learning rate of 1e-6.

Figure 7: Final Results
My submission to the test set leaderboard yielded an SST accuracy of 0.510, a PARA accuracy of
0.788, an STS accuracy of 0.531, and an overall average accuracy of 0.610. While there is non-trivial
improvement from the baseline version to the final version, the improvement in accuracy isn’t as
significant as I expected. There are likely many unexplored areas which could further improve model
performance.

6

Analysis

One interesting phenomenon is that the training of the model on the high-volume Quora dataset
seems to benefit model performance on the other two tasks through producing more powerful and
pertinent minBERT embeddings. This could be reflected when comparing the baseline [0] and the
version where PARA training was cut [4], and when comparing the normalized and unnormalized
versions of round robin multitask training [10] [11]. This is surprising, because it was anticipated
2
3

Only 8000 rows of the Quora dev dataset is used for training.
Within each iteration, gradients are normalized by the amount of task data trained on.

6

that balancing model training towards the SST and STS task and against the PARA task - which has a
much larger training dataset, regardless of whether additional training data is added - might improve
the performance on the two former tasks, at the potentially slight expense of the PARA task. However,
this balancing strategy actually tends to decrease model accuracy on all three datasets, as illustrated
in Figure 8, for example. It seems training on the Quora dataset has a significant enriching effect on
the minBERT parameters that improves performance on the other two tasks, especially the STS task.

Figure 8: Round Robin Multitask Training: a Comparison

7

Conclusion

This project found a few successful measures that help improve performance of the multitask
minBERT model against the baseline. These measures include increasing number of epochs trained,
training on additional datasets, using round robin multitask training as opposed to task-by-task
training, and performing additional training with minBERT parameters fixed. Ensembling these
changes has improved the overall performance of the model by about 5% across all three tasks. It
was also discovered that training on the high-volume PARA Quora dataset helps improve model
performance on the two other tasks.
If more time and resources are available, it would be helpful to perform more experimentation with
different model designs, including altering the downstream layers and the loss functions.

References
[1] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text classification?
In China National Conference on Chinese Computational Linguistics, pages 194–206. Springer,
2019.
[2] Lifeng Shang Xin Jiang Qun Liu Qiwei Bi, Jian Li and Hanfang Yang. Mtrec: Multi-task
learning over bert for news recommendation. In Findings of the Association for Computational
Linguistics: ACL 2022, pages 2663–2669, 2022.
[3] Marco Marelli; Stefano Menini; Marco Baroni; Luisa Bentivogli; Raffaella Bernardi; Roberto
Zamparelli. The sick (sentences involving compositional knowledge) dataset for relatedness and
entailment. In Association for Computational Linguistics (ACL), 2014.

7