Contrastive Learning for Sentence Embeddings in
BERT and its Smaller Variants
Stanford CS224N Custom Project

Vrishab Krishna
Department of Computer Science
Stanford University
vrishab@stanford.edu

Rohan Bansal
Department of Computer Science
Stanford University
robansal@stanford.edu

Abstract
Contrastive learning is a method of learning representations using invariances
in the data under augmentations and encouraging the resultant embeddings of
augmented samples to remain close together. An interesting property of such
approaches is that they enable models to perform better on different tasks even when
trained on smaller amounts of data and also enables smaller models to perform
as well as their larger counterparts. In this project, we demonstrate that both
supervised and unsupervised contrastive learning approaches provide improved
semantic performance for smaller BERT architectures (including BERTsmall , and
BERTmini ) both in pre-training and downstreaming objectives, while improving
the representational uniformity of the word embeddings and retaining widespread
downstream flexibility. Our results indicate that we can continue to maximize
performance in smaller transformer architectures and produce comparable results
to larger state-of-the-art architectures at a fraction of the computing cost and
training time. We conclude by offering new areas of research that may provide
even larger boosts to semantic performance, including supervised applications in
computer vision that have shown to perform well for comparable objectives.

1

Key Information to include

External collaborators (if you have any): None, External mentor (if you have any): None, Sharing
project: False

2

Introduction

Self-supervised learning (SSL) has resulted in tremendous improvements in NLP models and representations of data without the need for intensive and noisy labeling on ill-defined tasks. Next Sentence
Prediction (NSP), Masked Language Modeling (MLM) have been used to inculcate language specific
priors into models removing the need for extremely large amounts of labelled data for a particular
task as well as generating more general models that can be used as a precursor to a fine-tuned model
for classification, regression, as well as generative tasks.
A promising sub-field of of self-supervised learning is contrastive learning, where the goal of the
optimization is to distinguish between similar and dissimilar samples in the data. This involves
capitalizing on the fundamental stuctures within the data to develop compressed, expressive, and
robust representations. One way this is done is using priors on invariances within the data and using
them to label pairs of datapoints as similar (positive) or dissimilair (negative). In computer vision,
such methods have been found to be very successful as image transformations like affine disturbances,
color jitters and noise are easy to apply and do not have a significant impact on the semantics of an
image when used in moderation. The positive samples would be two augmented versions of the same
starting image and negative samples would be augmented versions of different images. SSL methods
Stanford CS224N Natural Language Processing with Deep Learning

like SimCLR Chen et al. (2020) and DINO Caron et al. (2021) have provided significant boosts when
trained on fractions of label data.
In NLP, we look at similar approaches but in the context of sentences and their representations.
The issue is that augmentations and invariances are more complex in natural language than with
images - the structure makes it difficult to generate alternate views of the same sentence which do not
perturb the semantics. However, the addition of contrastive approaches have resulted in significant
improvements in few-shot and fine-tuning accuracies as well as generalization. Hence, even smaller
models contrastively fine tuned on such methods could potentially result in similar performance as
much large models trained in the vanilla fashion. Distillation methods between larger and smaller
models have shown that equivalent performance can be reached with an order of magnitude or fewer
parameters. SSL methods could provide ways for smaller models to be trained and achieve better
accuracy without the need for training larger models at all.

3

Related Work

In the context of sentence embeddings, recent works have shown that different views of the same
sentence can be generated with different dropouts in the model (Yao et al., 2021; Yan et al., 2021).
Using the InfoNCE contrastive loss function (very similar to that used by SimCLR in images (Chen
et al., 2020)), these methods optimize generated embeddings across a corpus, obtaining state of the
art results with BERT-base and BERT-large on datasets of semantic and textual similarity (STS)
(Rethmeier and Augenstein, 2021; Gao et al., 2021) (with a positive pair (x, a+ ) and K negative
pairs (x, a−
i )):
+

LInfoNCE =

es(x,a )
PK
−
es(x,a+ ) + i=1 es(x,ai )

(1)

One issue with such contrastive schemes is feature suppression. In the case of the SimCSE paper
above, the embeddings sometimes fail to discern textural and semantic components. Early last year,
Wang et al. (2022) made an addition of soft negative samples to force a difference between textual
and semantic similarity. They then suggest to constrain the cosine similarity difference between
positive pairs and soft negative pairs ∆ by proposing a bidirectional margin loss to constrain this
value in the interval [−β, −α]:
LBML = ReLU(∆ + α) + ReLU(−∆ − β)

(2)

Thus, the final objective function for soft negative contrastive training is defined as:
LSNCSE = LInfoNCE + λLBML

(3)

where λ is used to control the weight of these soft negative samples. This approach was shown to
alleviate some of the issues with feature suppression for larger BERT models, and the authors also
experimented with treating soft negative examples as purely negative (in an identical objective to
LInfoNCE ), however found no marked improvement (Wang et al., 2022).
These contrastive methods have provided significant performance increases across different datasets
and tasks when tested on large models like BERTbase and BERTlarge . However, another set of
experiments of interest that, to our knowledge, has not been previously tested include observing how
the performance of smaller models like BERTsmall and BERTmini hold up under SimCSE and SNCSE
pretraining. We hope to see that smaller models can reach the accuracy of BERTbase and BERTlarge
on downstream tasks with only a fraction of the training data. This is incredibly useful as it opens up
the use of highly accurate models in resource constrained settings.

4

Approach

Our goal is to study the improvement provided by the previously described pretraining methods on
top of (after) regular NSP that is used to train the vanilla BERT variants. We also contextualize these
improvements in terms of the training time taken (number of epochs of pretraining to achieve a result)
2

as well as amount of data used for training in downstream tasks. We run a comprehensive set of
experiments, including on different architectures sizes, datasets, and augmentation methods, to best
assess the maximal performance that these techniques can provide.
We are using the SimCSE and SNCSE repositories for our experiments, adapting and significantly
simplifying their original code with the huggingface transformers module to make it more modular
for the experiments we are hoping to run in the future. We also include 1 which visualizes the
supervised and unsupervised approaches proposed by Gao et al. (2021). This figure highlights that
the supervised task is essentially a binary classification task where embeddings are drawn closer
together for entailment pairs and separated for all other pairs. In contrast, the unsupervised task uses
two different dropout masks to attempt to predict whether a sentence is identical to itself and does not
require annotated data. For the soft negative unsupervised task, we follow the best performing variable
selection in the original work Wang et al. (2022) by setting α = 0.1, β = 2.0 and λ = 1 × 10−3 .

Figure 1: A depiction of the supervised and unsupervised SimCSE approaches for pre-training Gao
et al. (2021)
As baselines we use each of the vanilla BERT models - in particular, we consider the vanilla,
pretrained and available BERTlarge , BERTbase , BERTsmall , and BERTmini models, each of which is
trained on both NSP and MLM tasks on 2.5 billion words from Wikipedia (Devlin et al., 2018). We
also write custom code to further pre-train BERT on the random sample of Wikipedia sentences
we use for contrastive learning in order to provide a more quality baseline and eliminate potential
differences in precision. The checkpointed models are provided from the HuggingFace transformers
repository and run on sentences from each dataset to generate the embeddings which are subsequently
evaluated.

5

Experiments

5.1

Data

Our data use cases, we have a broad classification into those used for the contrastive pretraining and
those used in downstream tasks as a pseudometric on the quality of embeddings.
5.1.1

Contrastive Pretraining

Supervised, a collection of 570k annotated sentence pairs from the SNLI dataset which are labeled
either entailment (similar) or contradiction (different).
General Unsupervised, a collection of 1 million randomly sampled sentences from Wikipedia where
each sentence is considered similar to itself and separate from all other sentences.
Soft Unsupervised, again generated from a collection of 1 million randomly sampled sentences from
Wikipedia, however soft samples are drawn by using explicit parser negation for relevant sentences
and hard samples are simply the remaining sentences.
5.1.2

Downstream

Sentiment Classification (SST-2), a collection of 65k movie review sentences, labeled as positive
(similar) or negative (different).
3

Question Answering (QNLI), a collection of 110k question-answer pairs randomly sampled from
Wikipedia (and answered by human experts).
Semantic Similarity (STSB), semantic analysis on the STS benchmark consisting of 8.5k annotated
sentence pairs with human annotated similarity scores.
5.2

Evaluation method

For the pre-training evaluation, make use of the similar methods of as SimCSE and utilize the
SentEval package with different poolers. To check the quality of embeddings, we evaluate them
zero-shot - there is no training on the actual dataset itself. The matching is done using the cosine
similarity of embeddings to predict across multiclass and binary tasks. This fundamentally evaluates
whether the embeddings generated by the model have the predictive capacity to solve this problem
independently, with no training. These outputs are then evaluated with Pearson’s and Spearmans
correlation to check the degree of similarity between predictions and the ground truth. We report
these scores individually for datasets STS12-15 and SICK and well as averaged across all datasets.
Because this evaluation is done zero-shot and the models do not see these semantic datasets during
pre-training, we utilize the comparable vanilla BERT checkpoints as baselines. Since our goal is to
measure whether this pretraining method provides additional benefits on top of NSP pretraining, the
embeddings from these checkpoints are a fair baseline.
We have 3 different downstream tasks that we use to evaluate the quality of embeddings. While SST-2
and QNLI are binary classification tasks (where we use accuracy as our metric), STSB is a discrete
regression task and we use the average Pearson’s and Spearmans correlation score for evaluation.
5.3

Experimental details

The models (BERTlarge , BERTbase , BERTsmall , , and BERTmini ) were first pretrained starting with
a BERT uncased checkpoint for each variant on one of the three datasets described above using
the transformers library (Wolf et al., 2020). We use a batch size of 64 with a learning rate of
3 × 10−5 in the Adam optimizer for 1 and 3 epochs for the general unsupervised and supervised tasks
respectively as suggested by Gao et al. (2021). These experiments use the traditional infoNCE loss
described above and we also incorporate further MLM to make the task more difficult, and our final
objective function is simply the sum of these two tasks. For the soft unsupervised task, we use the
bidirectional margin loss added with the MLM task as well, and perform a grid search over learning
rates of {2×10−4 , 3×10−5 , 5×10−5 } while training for 3 epochs as suggested by Oord et al. (2018).
After the conclusion of this pre-training, we performed our sentence embedding based STS evaluation,
with different poolers (including the CLS token, averaging the embeddings, and random sampling),
to quantify the improvements of this contrastive learning on an unseen dataset and compared to
the original baselines. Finally, we trained each model on the downstream tasks presented above by
performing a comprehensive gridsearch over learning rates of {2 × 10−5 , 3 × 10−5 , 5 × 10−5 }, batch
sizes of {16, 32, 64} and epochs {2, 3, 4} and chose the best performing combination for each model
for comparison. All experiments are performed on 4 A100 GPUs with distributed training.
5.4

Results
Model
STS12 STS13 STS14 STS15 SICK
Avg.
BERT-Mini
30.52
32.97
27.11
7.40
51.93
26.05
BERT-Small
33.80
35.39
30.93
11.55
54.68
29.28
BERT-Base
17.19
29.06
19.55
7.16
35.11
17.60
BERT-Large
18.82
28.53
23.69
9.29
35.09
17.92
BERT-Small (Sup.)
73.81
76.42
74.60
14.19
79.62
56.80
BERT-Small (Unsup.)
60.38
75.08
66.56
14.67
68.76
50.42
BERT-Small (Soft)
53.76
63.12
56.11
14.80
63.15
44.56
BERT-Mini (Sup.)
69.51
70.28
69.26
13.08
76.27
53.91
BERT-Mini (Unsup.)
55.13
63.76
56.23
13.95
60.45
44.68
BERT-Mini (Soft)
56.15
63.31
55.80
13.77
62.15
44.82
Table 1: Aggregated Results after Pre-Training for custom SentEval Evaluation

4

Model
SST-2
QNLI STSB
BERT-Base
93.5
90.5
85.8
BERT-Large
94.9
92.7
86.5
BERT-Mini
85.9
84.1
75.4
BERT-Small
89.7
86.4
78.8
BERT-Small (Sup.)
87.5
85.0
84.6
BERT-Small (Unsup.)
86.9
85.2
84.5
BERT-Mini (Sup.)
82.9
82.88
78.3
BERT-Mini (Unsup.)
82.0
82.7
77.5
Table 2: Downstream GLUE Experiments

We observe that this brief period of pre-training (< 5 minutes on 2 A100 GPU) allows the small and
tiny BERT models to improve drastically on the majority of the STS datasets. For sentence embedding
based evaluation, we tried different pooling outputs–including taking the CLS token, averaging
the embeddings and randomly sampling from each sentence–however, we saw little discrepancy in
results, and thus we provide relevant numbers for these smaller variants using the traditional CLS
pooling in Table 1. These results indicate that even unsupervised pre-training approaches on these
smaller architectures produce better results than the vanilla BERTbase and BERTlarge embeddings
1
while being 14 and 10
the size of bert-base. It is also important to note that conducting further standard
pre-training on these vanilla models did not improve their semantic performance. In fact, the numbers
highlight that the released checkpoints for the larger BERT models exhbit particularly poor semantic
performance. Additionally, we see that the models trained with soft negative samples generally
lag behind and do not seem to outperform the traditional supervised and unsupervised InfoNCE
objectives.
We then provide the downstream results in Table 2 where we observe that the contrastive pre-trained
models perform slightly worse than their traditional counterparts on the sentiment classification
(SST-2) and question answering (QNLI) tasks, but actually outperform on the pearson score for
semantic similarity after being trained on the STSB dataset. The BERTsmall model actually provides
nearly identical performance to both BERTbase and BERTlarge on the STSB task which indicates the
ability of contrastive learning to provide a massive improvement despite the significant difference in
model size. Additionally, all smaller variant runs take under 12 minutes to complete downstream
training on 2 A100 GPUs.

6

Analysis

Figure 2: Downstream STSB scores for supervised, unsupervised, and standard BERT models
The pretraining results produced above indicate that the contrastive learning approaches are highly
effective in improving semantic representations, and the performance of different self-supervised and
contrastive methods are helpful in generating smaller models with representations on par with larger
models on some tasks.
5

The drastic increase in performance compared to the BERT baselines also emphasize the poor nature
of these released embeddings and their tendency to predict high similarity for the vast majority of
sentences. We provide 3 examples of sentence pairs which highlight the types of improvements that
the supervised, contrastive BERTsmall makes when compared to the vanilla BERTbase :
• We observe an improvement in score from .967 → .863 for the pair I have this book to
read and I have to read this book. This example highlights how changes in word order
can often lead to different sentence meanings, but the traditional BERT models struggle to
contextualize this difference.
• We observe an improvement in score from .802 → .853 for the pair There’s a man on a
bicycle and That man is riding a bike. Here, the sentences are nearly identical in meaning
but use different phrasing which leads to a lower score in BERTbase .
• We observe a large improvement in score from .932 → .167 for the pair Have you seen my
cat vs I play the piano. These sentences are completely unrelated outside of the first-person
subject, however BERTbase assigns an unreasonably high similarity score.
A known issue with BERT embeddings is that and indicate that they are highly clustered and lack
uniformity - this can be seen in the poor performance of BERT in the STS tasks. The above examples
help illustrate how contrastive learning leads to better uniformity and reduces both the outsized
influence of wording and the anisotropic nature of the vanilla BERT embeddings. These results also
confirm the findings of Gao et al. (2021) and Wang et al. (2022) indicating that contrastive learning
produces higher quality similar sentence retrieval for given query sentences.
The lack of impact on different pooling choices for the sentence evaluation tasks also lends credence to the idea of improved distributions and indicates that the model is learning intrinsically
better representations for frequent words in the training corpus. Additionally, the performance of
the unsupervised models (which were trained on Wikipedia corpus) and the lack of improvement
demonstrated by the vanilla models when trained for more steps through the standard pre-training
approach serves as a partial ablation result and seems to indicate that the contrastive learning objective
alone is responsible for the lift, as opposed to the data or further training. Separately, the lagging
performance of soft negative approaches may imply that the smaller models are not flexible enough to
capture the more nuanced differences through a purely unsupervised approach, such as the addition of
a single negation term, because they have been shown to outperform InfoNCE loss for both BERTbase
and BERTlarge in Wang et al. (2022).
The downstream performance of the models also tends to align with prior literature and our expectations. We observe incrementally worse performance on both the sentiment classification and
question answering tasks which indicate that the traditional BERT is able to better fit these specific
datasets but is not necessarily capturing the true semantics of the English language. However, the
minor discrepancy seems to actually indicate that the pre-trained contrastive embeddings are still
sufficiently flexible to be used in general tasks and would be appropriate as an alternative to the
vanilla checkpoints released in the original BERT paper. On the contrary, for the downstream STSB
task we see that the contrastive pre-trained approaches produce better performance than their standard
counterparts and we provide a visualization of these changes in Fig 2. Generally, the supervised
approaches perform slightly better than the unsupervised approach, which is expected and in line
with the pre-training STS results, however this difference seems to be nearly negligible. Additionally,
we see that the pearson curves seem to flatten significantly beginning with BERTsmall and the improvements for scaling the parameter count are relatively minimal. This indicates that the contrastive
approaches do have a limit on the added value they can provide, and we hypothesize that access to
more high-quality annotated data may allow for even larger improvements in BERTbase and BERTlarge .

7

Conclusion

In this work, we demonstrate that contrastive learning approaches in pre-training, as presented by
Gao et al. (2021) and Wang et al. (2022), provide significant semantic improvements to high-quality
BERT embeddings, particularly in its smaller variants such as BERTsmall and BERTmini . Through
ablation experiments and downstream training, we also highlight that these semantic embeddings are
a viable alternative to the current state-of-the-art approaches and can provide the necessary flexibility
needed to adapt them to a wide variety of fine tuning objectives. Additionally, the smaller size of
6

these models allows for quicker and cheaper training and reduces the hardware overhead associated
for utilization in general use cases. The improvements are demonstrated even with unsupervised
objectives using easily accessible datasets, such as the Wikipedia corpus, further eliminating the need
for more human annotated semantically accurate data.
An avenue we would like to further explore is the use of different augmentations and methods to
produce the two different views of the data in contrastive learning. SimCSE makes use of different
dropout masks in two forward passes (Gao et al., 2021) but there exist a host of other transforms
that could be applied in tandem for training, like reordering, corruption, etc. (see Bhattacharjee
et al. (2022) for a more comprehensive list). For future work, we would intend to combine several
of these transforms such as dropout, latent/embedding perturbation, word deletion and reordering,
and test which combinations result in marked metric improvements. We would be particularly
interested in analyzing whether these types of augmentations can boost the performance of soft
contrastive sampling, an approach which provided significant boosts to larger BERT architectures but
had disappointing performance for the smaller variants. We would also wish to determine whether
these types of corruption methods could allow for linear probing to provide high-quality results,
where the semantic word embeddings could remain fixed and only a classification layer head would
need to be trained for various finetuning tasks. This would further reduce the training cost and allow
for retention of semantic details even in other domains.
Finally, another aspect of future work we would like to consider is the use of more modern constrastive
self-supervised approaches from computer vision. DINO is a method that involves self distillation
with a teacher and student network where the student is trained to mimic the activations of the
teacher under augmented samples but the teacher’s weights are updated as an exponential moving
average of the student’s weights Caron et al. (2021). This method showed emergent properties
in vision transformers as well as obtaining state-of-the-art results across several tasks–including
image retrieval, copy detection, and semantic layout–which indicate that they may further improve
semantic performance in textual domains. To our knowledge, such method of self-distillation for
pretraining has not been applied to BERT and has not been tested in smaller variants either. It would
be interesting to see how methods that were highly effective in computer vision fare in NLP and what
are the potential reasons for their success or failure.

References
Amrita Bhattacharjee, Mansooreh Karami, and Huan Liu. 2022. Text transformations in contrastive
self-supervised learning: A review.
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and
Armand Joulin. 2021. Emerging properties in self-supervised vision transformers.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework
for contrastive learning of visual representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of
deep bidirectional transformers for language understanding.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence
embeddings.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive
predictive coding.
Nils Rethmeier and Isabelle Augenstein. 2021. A primer on contrastive pretraining in language
processing: Methods, lessons learned and perspectives.
Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, and Jing Shao. 2022. Sncse:
Contrastive learning for unsupervised sentence embedding with soft negative samples. arXiv
preprint arXiv:2201.05979.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick
von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
7

Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art
natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for
Computational Linguistics.
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. Consert: A
contrastive framework for self-supervised sentence representation transfer.
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2021. Cpt:
Colorful prompt tuning for pre-trained vision-language models.

8