Few-shot Classification of Disaster-related Tweets
Stanford CS224N Custom Project

Jubayer Ibn Hamid
Department of Computer Science
Stanford University
jubayer@stanford.edu

Jitendra Nath Pandey
Department of Computer Science
Stanford University
jnpandey@stanford.edu

Sheikh Rifayat Daiyan Srijon
Department of Computer Science
Stanford University
srijon@stanford.edu

Abstract
Social media is a very powerful tool in helping emergency aid centres and response
operators in coordinating a response to a crisis. Platforms like Twitter allow information to travel fast making coordination with people at the scene is easier
and, therefore, allowing response operators to attain higher situational awareness
(Vieweg, 2012). However, lack of filtration methods on these platforms means
that there remains possibilities of the spreading of false news. This skepticism has
curtailed our ability to respond to crises in a timely manner. An AI-driven solution
to this problem needs to be able to perform well even when it has been trained on a
small labelled dataset. As (Chowdhury et al., 2020) discusses, most of the work
in this domain has been with regards to classifying posts that have been written in
English only and if one were to finetune a model for each disaster, the dataset would
be even smaller. In this project, we analyse the performance of language models
(both base models and those that incorporate few-shot learning) in classifying
disaster-related Tweets as either true or false on few-shot datasets. Particularly we
analyse the performance of base DistilBERT models with pretraining, with supervised contrastive learning that enhances the loss function to get better results with
fewer training examples, and with Prototypical Neural Networks. We find that large
language models like DistilBERT are good at few-shot learning of classification of
disaster-related tweets even without incorporating few-shot learning techniques and
show lower degradation of performance with shrinking of datasets. Our research
reinforces the hypothesis from OpenAI (2020) that pretraining scaled-up language
models on large corpuses of data improves task-agnostic performance using strong
generalised, representation of language and that finetuning on noisy datasets worsens performance in few-shot learning. Our analysis of the results suggests that
large pretrained language models perform very well at few-shot learning due to
learning of strong representations of language make them task-agnostic few-shot
learners. In particular, we find that, comparatively, fine-tuning can even worsen
performance when noisy datasets damage the representational learning of these
large language models.

1

Key Information to include
• Mentor: Swastika Dutta
• External Collaborators (if you have any): N/A
• Sharing project: N/A

Stanford CS224N Natural Language Processing with Deep Learning

2

Introduction

Disasters are high-pressure situations in which response operators need to act fast and deploy
resources very efficiently. For that, it is crucial that they have access to key information that increases
their situational awareness. Research suggests that social media can be a powerful tool in enabling
that (Vieweg, 2012). However, on most social media platforms like Twitter there is a lack of filtration
method specifically aimed at filtering posts regarding disasters on the basis of whether they are true
or not. As such, social media posts are both high reward (if you respond correctly to posts that
spread true information) and high risk (if you respond on the basis of misinformation spread by
posts). It is imperative that we can filter disaster-related posts based on whether they are spreading
misinformation or true information.
One of the biggest challenges in this area is the shortage of data. First of all, there is a shortage of
data in multiple languages. This severely limits performance of models on non-English languages
as seen in the results section of Chowdhury et al. (2020). Furthermore, even for English language,
there are strong imbalances in data. For example, in the dataset CREDBANK (Mitra and Gilbert,
2015), the vast majority (> 95 percent) has been labelled as certainly accurate whereas only one has
been labelled as certainly inaccurate - which suggests that data imbalances are very likely in the
realm of disaster-tweet classification. In many cases, we want to finetune models with respect to
each instantiation of disasters by using only data from that disaster respectively and in such cases
insufficiency of data becomes an even more acute problem.
In this project, we implemented analysed the performance of three different models; a DistilBERT
model with cross-entropy loss, a DistilBERT with supervised contrastive loss and a prototypical neural
network. We trained these models on datasets of various sizes and evaluated them to analyse their
performance in few-shot classification. We conclude that the DistilBERT and Prototypical Neural
Network performs better (with some differences in precision versus recall) at few-shot classifications
but, overall, baseline language models are good at learning from small datasets, which confirms the
study (OpenAI, 2020).

3

Related Work

Few-shot classification aims to learn a classifier using a small number of labelled training examples.
Several different approaches have been taken train a model to do this. Initialised-based methods
tackles the problem by training models to be able to learn to finetune; some attempt to train to learn
good model intialisations (Finn et al., 2017) whereas other models are trained to learn an optimiser
(Ravi and Larochelle, 2017). In this paper, we explore distance-metric learning based methods;
specifically we analyse the performance of Prototypical Neural Networks (Snell et al., 2017) which
are models that learn to embed classes into a class-space and learn to embed each input into the class
space. The class label is then found by measuring similarities via norms in that class-space.
Supervised contrastive loss in classification is a family of loss-functions that are widely used in
natural-language processing problems (Beliz Gunel). In the case of binary classification, we work
with a batch of training examples of size N : {xi , yi }i=1,...N . Furthermore, let Nyi be the total
number of examples in the batch that have the same label as yi . Let yi,c be the true label and ŷi,c is
the model output for the probability of the i-th example belonging to the class c.
Now, suppose, Φ(·) ∈ Rd is the encoder that outputs the l2 normalized final encoder hidden layer
before the softmax projection. The overall loss is a weighted average of cross-entropy (CE) and the
proposed supervised contrastive learning (SCL) loss, as denoted by L:
L = (1 − λ)LCE + λLSCL
where λ is a hyperparameter, LCE is the cross-entropy loss defined as
N

LCE = −

2

1 XX
yi,c · log ŷi,c
N i=1 c=1

and the contrastive loss is
N
N
X
X
1
exp (Φ(xi ) · Φ(xj )/τ )
LSCL = −
1{i ̸= j} · 1{yi = yj } log PN
N
−
1
y
i
k=1 i̸=k exp (Φ(xi ) · Φ(xk )/τ )
i=1
j=1
2

.
Our final implementation of the supervised contrastive loss function was inspired by the implementation in Khosla et al. (2020).
Prototypical Neural Networks (PNN) are a family of neural networks that aim to do few-shot
classification of training examples across unseen classes. Suppose Si be the set of all training
examples which are in class i. PNN computes a M -dimension representation/embedding (called a
prototype) for each class through the function fϕ : RD → RM with parametres ϕ. Using this, the
prototype for class i is defined as
X
1
ci =
fϕ (xi ).
|Si |
(xi ,yi )∈Si

Using this, we can compute the probability of each input being in class i using
exp(−d(fϕ (x), ci ))
pϕ (y = i|x) = P
k exp(−d(fϕ (x), ck ))
where d(x, y) is the Euclidean-norm of x and y.

4

Approach

The first model we trained is a DistilBERT model. We pretrained this model by masked-language
modelling. Instead of pretraining on all kinds of tweets, we pretrained on disaster-related tweets only
so that the model learns stronger representations for words that specifically appear in disaster-related
tweets. We then finetuned the model by adding a softmax classifier. In this first model, we only used
cross-entropy loss.
Next, we trained a DistilBERT model with supervised contrastive loss and then tuned the hyperparametres λ and τ . In order to ensure that the model is forced to capture similarities between
examples in one class and contrasting them with examples in other classes, we ensure that λ ≥ 0.5.
Lastly, we trained a Prototypical Neural Network using a learnable embedding matrix. We used a
distilBERT model with a prototypical neural network head. Across all models, we used dropout to
prevent overfitting on small datasets.
All these models have been pretrained and fine-tuned (HuggingFace) with 5 transformer blocks and
incorporating dropout (to reduce overfitting) and layernorm. We also employed early stoppping of
training when we saw insignificant improvement in performance to prevent overfitting which we
think is an important concern when we train on few-shot datasets. The models were trained to be able
to identify a tweet related to a disaster as either true (correct information) or false (misinformation).
Note that we are specifically not training the models to idenity uninformative posts, we only want to
identify them as either true or false.

5

Experiments

5.1

Data

We trained on three datasets - two from Kaggle (Kaggle) and the other being the CRISIS6 dataset
(CrisisMMD). These are labelled datasets that have Tweets labelled as either true or false. We tested
our algorithms on four different sizes of datasets - 5700 (full), 1000, 100 and 10 examples. Each of
these four datasets were created by randomly sampling datapoints from the full dataset.
5.2

Evaluation method

We evaluate the performance of each model on each dataset using both accuracy and F1 score which
combines the precision and recall scores of a model. To define the F1 score, define the following:
True Positives (TP): Number of samples correctly predicted as “positive.”
False Positives (FP): Number of samples wrongly predicted as “positive.”
True Negatives (TN): Number of samples correctly predicted as “negative.”
3

False Negatives (FN): Number of samples wrongly predicted as “negative.”
TP
.
Then, the F1 score can be computed a F 1 = T P + 1 (F
P +F N )
2

5.3

Experimental details

First, we report the hyperparametres for our models. Note that for all the models, we early stopped
training if we saw no improvement in validation loss over 10 epochs in order to prevent overfitting on
small datasets. For the DistilBERT model with cross-entropy loss, the hyperparametres (after tuning)
are as follows:
Hyperparametre
Learning rate
Decay factor
Number of training
epochs
Dropout (for each
transformer block)
Dropout (final)
Number of transformer blocks

Value
10−5
0.8 (after every 10 epochs)
200
0.1
0.5
5

For the DistilBERT model with supervised contrastive learning loss, the hyperparametres (after
tuning) are as follows:
Hyperparametre
Learning rate
Decay factor
Number of training
epochs
Dropout (for each
transformer block)
Dropout (final)
λ (weight of SCL loss
function)
Number of transformer blocks

Value
10−5
0.8 (after every 10 epochs)
200
0.1
0.5
0.9
5

For the Prototypical Neural Networks model, the hyperparametres (after tuning) are as follows:
Hyperparametre
Learning rate
Decay factor
Number of training
epochs
Dropout (for each
transformer block)
Dropout (final)
λ (weight of SCL loss
function)
Number of transformer blocks

Value
6 · 10−5
0.8 (after every 10 epochs)
200
0.1
0.5
0.9
5

As mentioned before, we used four different sizes of datasets. For the smaller ones, we randomly
sampled x training examples where x is the size of the training set. We resampled if we saw significant
imbalance in the dataset.
4

5.4

Results

First, we analyse the F1 scores attained by the three models on different datasets:

First, we observe that the performance of distilBERT with supervised contrastive learning (SCL) loss
falls rapidly as the dataset gets to a size of 10. On the smallest dataset of 10 examples, SCL attains an
F1 score of only 0.186 while the others gets > 0.6.
Before this, however, SCL performed as well as the other models; in fact, it performed better than
distilBERT with cross-entropy loss on 100 training examples. With the full dataset, SCL performs
better than all other models - SCL attains F1 score of 0.813 whereas distilBERT attains (a marginally
less value of) 0.806 and prototypical NN attains 0.784.
It is clear that the performance of DistilBERT with cross-entropy loss and Prototypical Neural
Networks are more stable with differences in sizes of dataset and have less variance in model
performance.
Next, we analyse the accuracy of the models.

We note that the distilBERT with cross-entropy loss is superior to all models on all sizes of data in
terms of accuracy. Once again, we note that the performance of SCL drops significantly when we
make the dataset a size of 10; on size 10, it attains an accuracy of only 0.552 while on datasets ≥ 100,
it attained accuracy ≥ 76. Before that, it attained accuracy approximately the same as DistilBERT
with cross-entropy and (marginally) outperforming prototypical neural network.

6

Analysis

Firstly, we observe that the base distilBERT model with cross-entropy loss performs notably well at
few-shot learning, which reaffirms analysis in OpenAI (2020).
5

Note that the accuracy attained by the distilBERT model changes very little as the dataset size is
changed. This coincides with the hypothesis that scaling up language models greatly improves
task-agnostic, few-shot performance. The DistilBERT model is a large language model with 66
million parametres and has been trained on 3.3 billion words. In fact, using supervised contrastive
learning loss and prototypical neural head on top of the distilBERT model seemed to either make no
improvement in performance or sometimes worsen the performance.

We believe this is because large language models attain strong representations of language in general
and that Tweets are not out-of-distribution enough to hurt their performance in this particular domain.
Furthermore, we notice that there is a significant change in performance as the dataset is shrunk from
size 100 to 10, compared to other shrinks. This suggests that there is significant noise in the dataset
which "cancels out" only when the dataset is extremely large. For small datasets, when finetuned with
this small, noisy dataset, model performance sees large variance. Notably, the performance degrade
when shrinking from 5700 to 100 training examples is not as large as it is when shrinking from 100
to 10. Our hypothesis here is that a training set of 100 examples is still fairly strongly representative
of the distribution of Tweets related to disasters but 10 examples is harmful for robustness.
Our hypothesis is further strengthened by observing the performance of prototypical neural networls.
Note that for binary classification the prototype for each class will be
ci =

1
|Si |

X
(xi ,yi )∈Si

6

fϕ (xi ).

If the dataset is extremely noisy, this unweighted mean embedding of the two classes will vary a lot.
This is because if the embeddings are of 768 dimensions, then the variance
|I T ΣI|
where I is the identity and Σ is the covariance matrix of fϕ is going to be very large. This means its
performance in a different noisy dataset of small size will be bad. In comparison, DistilBERT without
the prototypical neural head will not stray from the original representations much when it employs
cross-entropy loss on the small noisy datasets.

7

Conclusion

In conclusion, we find that base large language models without incorporating few-shot learning loss
functions are relatively better at classification of disaster-related Tweets than those that do employ
them. Further work could be done to explore other large language models on few-shot learning
datasets. Furthermore, due to limitation of GPU capacity, we could not try some other interesting
experiments but further work should explore deploying pretrained large language models without any
finetuning at few-shot learning by just adding classifier heads on top of them.

References
Alexis Conneau Ves Stoyanov Beliz Gunel, Jingfei Du. Supervise contrastive learning for pre-trained
language model fine-tuning. In ICLR 2021.
Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross-lingual disaster-related
multi-label tweet classification with manifold mixup. In Association for Computational Linguistics
2020.
CrisisMMD. Crisismmd: Multimodal crisis dataset.
Chelsea Finn, Sergey Levine, and Peter Abbeel. 2017. Model-agnostic meta-learning for fast
adaptation of deep networks. Proceedings of the 34th International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017.
HuggingFace. Fine-tuning a masked language model. https://huggingface.co/course/chapter7/3?fw=tf.
Kaggle. Disaster tweets dataset. https://www.kaggle.com/datasets/vstepanenko/disaster-tweets.
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron
Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. arXiv preprint
arXiv:2004.11362.
Tanushree Mitra and Eric Gilbert. 2015. Credbank: A large-scale social media corpus with associated
credibility annotations. Vol. 9 No. 1 (2015): Ninth International AAAI Conference on Web and
Social Media.
OpenAI. 2020. Language models are few-shot learners.
Sachin Ravi and Hugo Larochelle. 2017. Optimisation as a model for few-shot learning. Proceedings
of the International Conference on Learning Representations (ICLR).
Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical networks for few-shot learning.
CoRR, abs/1703.05175.
Sarah Elizabeth Vieweg. 2012. Situational awareness in mass emergency: A behavioral and linguistic
analysis of microblogged communications. In PhD. Thesis, University of Colorado at Boulder.

7