SerBERTus: A SMART Three-Headed BERT
Ensemble
Stanford CS224N Default Project

Matthew Hayes
Department of Computer Science
Stanford University
mhayes3@stanford.edu
Mentor: Gabriel Poesia

No External Collaborators

No shared project

Abstract
We examine different architectures, learning methods, and hyperparameter choices
for fine-tuning the 110 million parameter BERTBASE model on three different
tasks: five class sentiment analysis on Stanford Sentiment Treebank (SST) (Socher
et al., 2013); binary paraphrase detection on Quora Question Pairs (QQP) 1 , and
regression on Semantic Text Similarity (Agirre et al., 2013). We find that a strong
SMART (Jiang et al., 2020) loss combined with a novel architecture replicating only
the BERT layers closest to the task-specific heads brings the greatest improvement
to performance on the three tasks without too severe an increase in model size and
resource consumption. Our best model is an ensemble achieving mean performance
of 78.61% on the test set.

1

Introduction

We’re not the strongest or the fastest, but our intelligence and our use of language to communicate
intelligent ideas is perhaps the most distinguishing feature of the human species and the cause of our
dominance: no other records and shares in such depth or breadth. Correspondingly, there is such a
wide variety of tasks that can be expressed with our natural language and that the Natural Language
Processing field of attempts to automate. While we can and have carefully architectured solutions for
each task individually, general purpose alternatives in sum reduce human effort and computational
resources, and can improve performance.
Scaling Vaswani et al. (2017)’s Transformer architecture, Radford et al. (2018) (GPT) and then
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) outperformed
all prior models on respectively 9 and 11 General Language Understanding Evaluation (GLUE) tasks
by extensively pre-training on simple but highly general word-prediction tasks, and then further finetuning the models independently on more specific tasks. The pretraining-finetuning setup, coupled
with large, publicly available, general-purpose models has changed the field. But how to exploit them
to their fullest potential is still largely not understood.

2

Related Work

Large pretrained models had success prior to GPT and BERT. Howard and Ruder (2018), for example,
explored different methods for fine-tuning general-purpose language models consisting of stacked
layers of LSTMs. They proposed and achieved state-of-the-art performance using, among other
techniques: a lower learning rate for the layers close to the input, which contain more general
1

https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs

Stanford CS224N Natural Language Processing with Deep Learning

language encodings than the layers closer to the task-specific output; and keeping the parameters of
these lower layers completely frozen to start and unfreezing one at a time with each successive epoch.
Instead of Radford et al. (2018)’s next token prediction task using transformer decoders, Devlin
et al. (2018) found that stacking bidirectional transformer encoder layers and pretraining on the
Masked Language Model (MLM) and Next Sentence Prediction (NSP) objectives produced higher
quality representations for the same model size. BERT’s pretraining inputs consist of two sequences
separated by a special [SEP] token and preceded by a special [CLS] token. Half the time the second
sequence actually follows the first in the source document, and the other half of the time it is selected
at random. The NSP task is to use the representation in the [CLS] token to predict which which is the
case. MLM predicts the correct original token for a randomly modified 15% of input tokens. 80% of
the 15% are replaced by a special [MASK] token, 10% with a random token from the vocabulary,
and 10% are left unchanged. Liu et al. (2019b) later found that dispensing with the NSP task and
simply performing the MLM task on longer sequences, bigger batches, and different random token
selection for each epoch, was sufficient meet and exceed BERT’s performance.
Sun et al. (2019) examined decisions for further pretraining, multitask finetuning, and target-task
finetuning when adapting BERT for text classification. Using IMDb (Maas et al., 2011) they found
that 100K further MLM and NSP pre-training steps are optimal. For fine-tuning, they find a learning
rate of 2e-5 to outperform any higher learning rate, which seem more subject to the new tasks
overwriting the generality from BERT’s pretraining or "catastrophic forgetting" (McCloskey and
Cohen, 1989). Similar to Howard and Ruder (2018), further reducing this learning rate by a factor of
0.95 for each successive transformer layer approaching the inputoutperformed a uniform learning
rate or decay factor of 0.9.
Jiang et al. (2020) took a less heuristic approach, instead introducing an additional smoothnessinducing regularization term to the loss. They randomly purturb each embedded input a small amount
as measured by by ℓ∞ norm and penalize the model for the change in output, as measured by
symmetric KL-divergence for classification, and squared loss for regression. While the additional
terms approximately halve the maximum batch size that can be used and increase BERT finetuning
time, they find significant evaluation improvements on GLUE.

3

Approach

We first complete a minimal (BERT) (Devlin et al., 2018) implementation referencing the provided
skeleton code, project handout, Vaswani et al. (2017), and Assignment 5’s minGPT 2 implementation
for the multiheaded self attention module. Adding skip connections, layer normalizations, a positionwise feed-forward layer and 10% dropout, we complete the BERT encoder layer. Stacking 12 such
layers and applying an additional feed-forward ‘pooling’ layer with tanh activations to the resulting
embedding in the [CLS] position, we load pre-trained BERTBASE weights from the web and pass
the provided sanity test.
On top of the pooling output, the baseline classifier applies dropout and a dense layer with one
output for each possible class. Referencing Kingma et al. Kingma and Ba (2014) and Loshchilov
et al. Loshchilov and Hutter (2017), we complete the ADAMW implementation, pass the provided
optimizer test, and compare to provided reference accuracies for training on Stanford Sentiment
Treebank (SST) Socher et al. (2013) or CFIMDB Maas et al. (2011) using provided hyper-parameters,
and either frozen pretrained BERTBASE weights or finetuning them.
Extending from the provided baseline for SST, we establish baselines for Quora Question Paraphrase
(QQP) detection 3 and Semantic Textual Similarity Agirre et al. (2013) (STS) tasks by finetuning
single-logit BERT instances with binary cross entropy and squared error losses respectively. Our
multitask baseline, "triple BERT", is thus an ensemble of three BERTBASE models, independently
fine-tuned to maximize performance on each task’s dev set. While this has the disadvantages of more
storage and less generalization, it permits independent design iteration for each dataset.
For STS and QQP, we run our initial experiments by simply concatenating the two passages in a
random order, separated by BERT’s [SEP] token. Alternatively we embed the two passages through
BERT separately similar to Reimers et al. Reimers and Gurevych (2019) and then combine the
2
3

https://github.com/karpathy/minGPT
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs

2

outputs. This approach excels when the task is to find the closest sentence to a query from a large set
since the embeddings can be cached: the query does not have to be re-embedded with each candidate.
With STS and QQP, however, we are given only two sentences to compare for each example and
caching their embedding is not helpful. Without the cross attention between the two inputs we see a
degradation in performance in our initial experiments using either cosine similarity or regression on
the concatenated embeddings (optionally also concatenating their normalized difference).
We evaluate the default hyperparameter choices for dropout, optimizer, learning rate, and number
of epochs using the small SST dataset, and time-permititng validate our findings on STS before
attempting to apply to the larger QQP. When training on all three datasets a single task is used for
each batch and gradient step, but we randomly shuffle the batches from all three tasks. We find this
performs well in practice without any weighting to account for the different dataset sizes.
Following Sun et al. (2019), we implement and evaluate a learning rate that decays with each
successive BERT layer closer to the input. We extend this idea by entirely fixing the token and
position embeddings at the input before the first BERT layer, or increasing the learning rate of the
classification heads relative to the BERT parameters. We also implement the MLM task to explore
the benefit of additional pre-training, but following Liu et al. (2019b) do not implement the NSP task.
We explore the addition of Jiang et al. (2020)’s "SMART loss" implementation from the web 4 to
see if its additional regularization effect is independent
Pn of other methods, or if one is strictly better.
Instead of minimizing the typical loss L(θ) = n1 i=1 ℓ (f (xi ; θ) , yi ) for some task-specific ℓ (in
our case cross entropy
Pn for SST and QQP, squared error for STS), they minimize L(θ) + λs Rs (θ),
where Rs (θ) = n1 i=1 max∥x̃i −xi ∥p ≤ϵ ℓs (f (e
xi ; θ) , f (xi ; θ)) for some sampled embedded input
perturbation x̃. We use the ADAMW optimizer to solve this minimization instead of adopting their
Bregman Proximal point method.
We also consider different architectures for the three tasks. Expanding from our triple-BERT baseline,
we explore the simple but novel ideas of adding a fourth or fifth BERT to a simultaneously trained
ensemble. With "static BERT" we concatenate the embedding from a frozen pretrained instance,
to provide generality and further mitigate "catastrophic forgetting" McCloskey and Cohen (1989)
while the fine-tuned BERTs can focus on adapting to their tasks. With "mutli-BERT", we concatenate
the embedding of an instance fine-tuned on all three tasks for any transferable representation. To
measure alternatives between a completely shared BERT and three independent BERTs, we allow
independently finetuning only the k BERT layers closest to the outputs, while sharing the 12 − k
layers closer to the input, naming this approach CerBERTus for it’s three heads, a reference to the
monstrous watchdog in Greek mythology. The architectures are illustrated in Figure 6.
Finally, our experiments yield many models that may be complementary. We implement ensembling
the predictions of our best models, which Liu et al. (2019b) has shown to work well for BERT. We use
unweighted averaging of the ensemble’s unrounded predictions, and only add models to each task’s
ensemble if they improve or do not change dev set performance. We explore the tradeoff between a
single multi-BERT instance with approximately 110M parameters, the best model trained in a single
session for each task (110M-220M parameters per task), and ensembling several training sessions for
each task (up to 1B parameters per task).

4

Experiments

4.1

Data

The Stanford Sentiment Treebank Socher et al. (2013) consists of movie reviews from the website
rottentomatoes.com. We examine a version of the fine-grained 5 class task, where sentiments
range from negative (0) to positive (4). Munikar et al. (2019) report a 53.2% accuracy when finetuning BERTBASE on the public version of the dataset and we are provided a benchmark mean
accuracy of 51.5% with a standard deviation of 0.4%. That dataset contains 8,544 train examples,
with a majority label of "somehwat positive" (3), at 27.2%. STS’s labels are averages of human
judgements in the range [0, 5], thus continuous and naturally modelled with regression. The median
label of the 6,041 training examples is 3 (somewhat positive). At 141,506, QQP has more than 16
times the number of training examples as SST and STS so it takes significant compute time for a
4

https://github.com/archinetai/smart-pytorch

3

single epoch. The mode of its binary labels is 0 (not paraphrase) at 62.5%. Devlin et al. (2018) report
a Pearson correlation of 86.5% when finetuning BERTBASE on the public version of STS, and Jiang
et al. (2020) report a 90.9% accuracy on the public QQP for their BERTBASE reimplementation.
4.2

Evaluation method

Consistent with the provided leaderboard for results, we measure model quality by accuracy on SST
and QQP, Pearson correlation on STS, and the average of these three metrics. We also consider the
trade-off between these metrics and the increased number of model parameters and training time
when training multiple BERT instances at once or ensembling separately trained models
4.3

Experimental details & results

Figure 1: Effect of learning rate on SST accuracy
4.3.1

Figure 2: SST Accuracy over 100 epochs

Base learning rate and number of epochs

Using the default parameters of 10% dropout within the BERT layers and 30% in the classification
head, and our custom ADAMW optimizer implementation with zero weight decay, we find the default
learning rate of 1e-5 to be a good setting on SST. As shown in Figure 1 it converges only a little
slower than the 2e-5 rate on the training set but still quickly enough during 10 epochs, unlike 5e-6,
which does not quite converge. Additionally, 10 epochs seem sufficient to find the maximum dev set
performance, peaking after after approximately 3-4. Consistent with Nakkiran et al. (2019)’s findings
for transformer-based models, we do not observe an epoch-"double descent" phenomenon when
training for 100 epochs (Figure 2)). Unless otherwise specified, for the remainder of our experiments
we keep fixed the learning rate of 1e-5 and report the best dev performance of 10 epochs over the
training sets.
4.3.2
dropout
prob.
0%
10%
20%
30%
50%
70%
90%

Dropout

mean SST
accuracy
51.7
51.6
51.5
51.4
51.6
51.4
51.4

max SST
accuracy
52.6
52.0
52.1
52.1
52.0
51.9
52.1

mean STS
corr.
86.6
86.4
86.2
86.2
86.4
86.4
86.8

max STS
corr.
86.6
86.5
86.3
86.5
86.6
86.4
87.0

Table 1: Performance on SST and STS varying head dropout Figure 3: Effect of BERT hidden layer
dropout rate on SST accuracy
The default parameters include a dropout probability of 30% between the [CLS] pooling layer and
the output logits. Holding all other default hyperparameters fixed, we experiment with several other
dropout values in the model heads for three different random seeds on SST. We allow up to 30 epochs
for the 90% dropout, but find the best dev performance is still within the first 10. We also try two
different random seeds on STS, but using PyTorch’s ADAMW implementation with a weight decay
4

of 1e-4. We find the largest STS correlation with a dropout of 90%, but taking 19 or 22 epochs to
converge. Overall for both datasets, we and find that removing the head dropout entirely improves
performance within 10 epochs (Table 1).
Fixing no dropout probability in the model head, we also experiment with Bert’s internal hidden layer
dropout probability of 10% with layer decay 0.95 and PyTorch ADAMW weight decay 1e-5, batch
size 128. We do find any other value to perform better, though 15% performs just as well. For the
higher values of 40% and 50%, we allow up to 30 epochs (Figure 3). We do not vary the additional
10% dropout applied to the normalized attention scores.
4.3.3

ADAMW Optimizer

Holding dropout constant at zero and keeping the rest of the default parameters, we examine effects
of the ADAMW optimizer. At the default weight-decay setting of zero, we compare the performance
of our custom ADAMW implementation to the version in PyTorch. Unsurprisingly, we find that
PyTorch’s implementation is slightly faster and yields slightly better accuracies in Table 2, likely due
to subtle optimizations.
Now using PyTorch’s implementation, we consider increasing the weight decay parameter from the
default value of zero. We find the common weight decay setting of 1e-2 to be too high, with a small
weight decay of 1e-5 to performing the best across three random seeds on SST only, while the slightly
larger 1e-4 seems to strike a balance when training on all three datasets at once (Table 3).

custom
52.6
52.8
19:34

mean SST accuracy
max SST accuracy
mean 10 epoch dur. (mm:ss)

PyTorch
52.7
53.2
19:09

Table 2: SST accuracy and speed comparison of custom
ADAMW implementation to PyTorch’s
4.3.4

multi QQP acc.
multi STS corr.
multi SST acc.
multi dataset mean
SST only mean
SST only max

1e-5
86.7
78.7
52.7
72.7
52.6
53.1

1e-4
87.7
87.0
52.4
75.7
52.5
53.1

1e-3
88.9
86.8
50.4
75.4
51.9
53.1

1e-2
88.7
86.5
50.7
75.3
51.6
52.8

Table 3: Effect of weight decay

Learning rate decay

We examine using layer decay with our custom ADAMW implementation without weight decay.
Consistent with Sun et al. (2019), we find that a layer decay value of 0.95 outperforms a fixed learning
rate or a decay rate of 0.9. Extending the idea outside the transformer layers, we find that freezing the
input word and position embeddings or halving their learning rate does not improve performance on
SST. Similarly, significantly increased learning rate in the classification head degrades performance,
however a rate of 1.5 times the rest of the model slightly improves performance.

mean
max

embed
0
51.3
51.8

embed
5e-6
51.9
52.1

layer
0.95
52.4
53.2

layer
0.9
52.2
52.9

all
1e-5
51.7
52.6

head
1.5e-5
52.1
52.6

head
2e-5
51.9
52.4

head
1e-3
50.6
51.1

head
1e-4
51.7
51.7

Table 4: The effect of higher learning rates closer to the task-specific heads. ‘Embed’ denotes
changing only the rate of the initial word and position embedding matrices. ‘Layer’ denotes changing
only the BERT layer learning rates, with a multiplicative factor for each successive BERT layer closer
to the input embedding, e.g. ‘layer 0.9’ denotes that the last BERT layer has a learning rate of 1e-5,
the second to last 9e-6, the third to last 8.1e-6 etc. ‘Head’ denotes only changing the learning rate in
the model heads.

4.3.5

SMART Loss

We use Jiang et al. (2020)’s SMART loss with symmetric KL-divergence for the classification tasks
SST and QQP, and squared error for STS regression. Using their recommended default 1 sampling
step, ϵ = 1e−6, σ = 1e−5, η = 1e−3, p = ∞, we vary only the weight λS using PyTorch’s
ADAMW implementation with 1e-4 weight decay. As showing in Figure 4 we find that on SST
5

Figure 4: Effect of SMART loss (Jiang et al., 2020) weight λS on SST accuracy
λS ∈ [1, 10] performs well, with λS ≥ 10 requiring more than 10 epochs to converge. Three
instances of λS = 10 found the best dev accuracy after 10, 11, or 12 out of 20 epochs, while for
λS = 100 the best dev performance was after 44 of 45 epochs, and may have continued increasing.
4.3.6

Architecture

Using no weight decay, we find that providing the classifier head with both a static BERT instance
and one with parameters that can be fine-tuned does not significantly improve performance, actually
reducing average performance on three SST random seeds and two STS random seeds (Table 5).
Given its additional memory requirement, we exclude it from the remainder of our experiments.
static + finetune
finetune only

SST mean
51.5
51.7

SST max
52.5
52.6

STS mean
86.5
86.6

STS max
86.7
86.6

Table 5: Effect of adding a static BERT instance on SST and STS
In Table 6 we consider other architecture alternatives on all three tasks. We do not use weight decay,
layer decay, or SMART loss, and use the largest batch size that fits into 16GB of GPU memory. We
find that 2 task-specific layers performs better than none or more than 2. A combination of three
task-specific models and a multitask model also performs well, but requires uses much more GPU
RAM and thus requires a smaller batch size and more compute time.
architecture
multi
2 head layers
4 head layers
6 head layers
triple (baseline)
multi + triple

best QQP
88.9
89.0
88.6
88.8
88.9
89.4

best STS
87.0
87.9
87.6
87.6
82.5
87.2

best SST
51.6
52.4
51.6
51.9
52.1
52.4

avg of best
75.8
76.4
75.9
76.1
74.5
76.4

best avg
75.7
76.1
75.5
75.7
75.9

batch size
32
24
20
16
32
8

Table 6: Performance of different architectures on all three tasks. We report he best accuracy or
Pearson correlation on the dev set achieved for each task (potentially in different epochs), as well as
the best average performance on the dev set when all heads are constrained to the same epoch (except
in the case of the “Multi” architecture, where the models are trained independently).
4.3.7

Further pre-training

We perform further pretraining on the MLM objective with, as in Sun et al. (2019), a batch size of
32, followed by fine-tuning without weight decay or layer decay. On SST we find that the additional
pretraining generally does not improve performance. For the small SST dataset, we try three different
sets of parameters during pretraining, show in in Figure 5. With a 1e-5 learning rate, we see a negative
correlation between the number of additional pretraining steps and the best dev performance after
fine-tuning. When decaying the 1e-5 base pretraining learning rate by a factor of 0.95 per layer
6

approaching the input, the degradation is not as bad, but still none of the models with additional
pretraining performs as well as without. When we try a smaller learning rate of 5e-6 however, we do
see some improvement in dev accuracy, peaking around 16K steps.
On QQP, we find that pretraining with the 1e-5 learning rate performs well, shown in Figure 6. On
this dataset, the peak finetuned dev accuracy is achieved after approximately 66K pretraining steps,
as opposed to the 100K found by Sun et al. (2019). We also see an improvement in STS correlation
with a small number of pretraining steps at 1e-5 learning rate (Figure 8), but due to the small size
of the dataset, diminishing returns, and time constraints, do not investigate beyond 3K additional
pretraining steps.

Figure 5: Additional pretraining on SST effect on accuracy
after finetuning
4.3.8

Figure 6: Additional pretraining on QQP effect on
accuracy after finetuning

Ensembling and test results

We submit to the learderboard for test evaluation two ensembles. One ensemble consists of three
models: for each task the one model that performed the best on each task’s dev set. For both QQP
aand STS this was a 2-head-layer CerBERTus model with SMART λS = 12, and layer decay factor
0.95 but from different epochs, while the SST model was independently trained on the one task
without layer decay and λS = 10. For the second ensemble we average the predictions of between
7-9 of our best performing models per task from all experiments, only adding models to the ensmeble
if they improve the dev performance. They individually achieve at least 53% dev accuracy on SST,
87% dev Pearson correlation on STS, and 88.5% dev accuracy on QQP. Table 7 shows that the
larger ensemble achieves the best performance overall, but is only a minor improvement over the
three-model ensemble, which itself brings only minor improvement over a single multi-BERT or
CerBERTus model.
SMART
weight
12
15

Layer
decay
0.95
0.95

Approx
params
110M
138M

SST
dev
53.32
52.41

STS
dev
89.38
89.86

QQP
dev
88.67
89.64

avg
dev
77.12
77.30

12

0.95

153M

53.04

89.99

89.72

77.58

QQP, STS 2-layer CerBERTus

QQP, STS 12

QQP, STS 0.95

386M

54.59

89.93

90.27

SST triple-BERT

SST 10

SST 1.0

7-9 models per task

Several

Several

2.6B+

55.85

90.45

90.94

Category
minimal size
single base
same epoch
single base
same epoch
best of each

Achritecture
multi BERT
2-layer
CerBERTus
3-layer
CerBERTus

(before above STS)

best ensemble

SST
test

STS
test

QQP
test

avg
test

78.26

55.48

89.77

90.11

78.45

79.08

54.75

90.25

90.83

78.61

Table 7: Comparison of Ensemble size and performance on dev and test sets

5

Analysis

Examining the dev set predictions of each single best model, mistakes on SST are typically off by one
class, and many people might agree with its predictions, e.g. the model rates “It ’s a stunning lyrical
work of considerable force and truth." as a positive review, but the ground truth is only ‘somewhat
positive’. In the cases it is off by two, the labels are sometimes even more questionable, such as
the neutral label of "It ’s a coming-of-age story we’ve all seen bits of in other films – but it ’s rarely
been told with such affecting grace and cultural specificity," which is predicted positive. But some
7

reviews are more subtle, without any words or short phrases clearly identifying the sentiment, as in
the ‘somehwat positive’ "[Lawrence bounces] all over the stage, dancing, running, sweating, mopping
his face and generally displaying the wacky talent that brought him fame in the first place," which is
predicted ‘somewhat negative’: ‘sweating’ and ‘wacky talent’ on on their own are not clear, but when
combined with the human experience, the review paints a playful scene. There are not any examples
where it is wrong by three or four classes.
For QQP many mistakes are borderline, e.g. mutlti-part questions or slight variations: “How can I lose
fat as a teenager?” is predicted a paraphrase of “How can I lose fat as a 15 year old?". But sometimes
it seems the word embedding have overfit e.g. predicting "Is there any evolutionary advantage
of baldness?" is a paraphrase of "Was there any evolutionary advantage for beards?": beards and
baldness both involve head hair, but the pre-trained BERT must have learned the difference to do well
on MLM. For STS we see the model may have lost meaning of expressions or be relying too heavily
on word overlap, as it gives "Work into it slowly" and "It seems to work" a score of 2.7 ≈‘mostly
equivalent’ but they are actually different topics.
Our findings on dropout and weight decay are somewhat surprising, as we expect additional regularization to be helpful, particularly on the small datasets like SST and STS. However, BERT’s
hidden layer and attention probability dropouts may be sufficiently regularizing on their own. Using a
different head dropout probability for each task may have performed better: our experiments showed
that STS benefits from a very high dropout rate, while without the ability to combine all of the
the already robust pooling dimensions the model may be forced to learn cruder approximations for
fine-grained sentiment. In previous literature, we typically see the weight decay parameter fixed at
the common value of 0.01, which may not be well tuned for BERT fine-tuning: potentially this higher
value causes the pretrained parameters values to decay too much during fine-tuning.
In reference to Sun et al. (2019) our results on layer decay and further pretraining seem reasonable.
Despite our differing data, we also found 0.95 to be the best layer decay factor. Halving or completely
freezing the input embedding may not have allowed enough specialization and a smaller reduction
may be effective. Similarly, when drastically increasing the head learning rate, the gradient steps may
have become too out of sync with those in the BERT layers. Additional pre-training appears to be
highly dataset-specific. Sun et al. (2019) also found a degradation in performance when performing
additional pretraining using only the small TREC Li and Roth (2002) dataset, and a large dataset from
a similar domain may have helped SST. Even for the larger QQP, Sun et al. (2019)’s 100K additional
pretrianing steps was not optimal, and we did not see nearly as smooth of a relationship with finetuned
dev performance. Though Liu et al. (2019b) found that the NSP task is overall unnecessary to achieve
good fine-tuned performance, it may have given us more consistent results given our reliance on the
[CLS] representation.
For model architecture, the static BERT dimensions may have been too highly correlated with
some from fine-tuning, allowing the model to cheat regularization and overfit: perhaps a higher
head dropout rate would have been helpful in this case. On the other hand, the regularizing effect
of multitask training (Liu et al., 2019a) may have been sufficient for multi-BERT to effectively
complement triple-BERT. However, factoring model size and resource consumption, multi-BERT
alone or 2-to-3-head-layer-CerBERTus model seems to be an excellent balance, consistent with the
existing understanding that the layers closer to the input are very generic and fine-tuning them to
specific tasks does not bring much benefit.

6

Conclusion

We learn that the additional regularization introduced by SMART loss is highly effective. While
a combination of the triple-BERT architecture and multi-BERT works well, the middle-ground
CerBERTus achieves a comparable performance with significantly fewer parameters and compute
time. In the future we might revisit separate embeddings with cross attention for QQP and STS, as
well as further pretraining on the smaller datasets by utilizing data from the same domain or use other
token embeddings in addition to the [CLS] token. Many of our choices were made with evaluation on
only a single dev dataset, and using cross-validation would allow us to ensure we are not overfitting
to it. For the few examples that exceed BERT’s 512 token limit, we might also revisit how we handle
truncation, e.g. truncating the middle instead of the end of the text as in Sun et al. (2019). We might
also explore using Howard and Ruder (2018)’s slanted triangluar learning rate schedule.
8

References
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. *SEM 2013
shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic
Textual Similarity, pages 32–43, Atlanta, Georgia, USA. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of
deep bidirectional transformers for language understanding.
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification.
Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020.
SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International
Conference on Computational Linguistics.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural
networks for natural language understanding. In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pages 4487–4496. Association for Computational
Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert
pretraining approach.
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies, pages
142–150, Portland, Oregon, USA. Association for Computational Linguistics.
Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks:
The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages
109–165. Academic Press.
Manish Munikar, Sushil Shakya, and Aakash Shrestha. 2019. Fine-grained sentiment classification
using bert.
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2019.
Deep double descent: Where bigger models and more data hurt.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language
understanding by generative pre-training.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bertnetworks.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text classification?
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.

9

A

Appendix

Figure 7: Effect of additional pretraining steps on STS correlation

Figure 8: Illustration of architectures. Note that only the embedding in the [CLS] position is fed to
the linear heads, after passing through the pooling layer.

10