HiddenCut: Simple Data Augmentation for Natural Language
Understanding with Better Generalization
Jiaao Chen, Dinghan Shen1 , Weizhu Chen1 , Diyi Yang
Georgia Institute of Technology, 1 Microsoft Dynamics 365 AI
{jchen896,dyang888}@gatech.edu
{dishen,wzchen}@microsoft.com

Abstract
Fine-tuning large pre-trained models with taskspecific data has achieved great success in
NLP. However, it has been demonstrated that
the majority of information within the selfattention networks are redundant and not utilized effectively during the fine-tuning stage.
This leads to inferior results when generalizing
the obtained models to out-of-domain distributions. To this end, we propose a simple yet effective data augmentation technique, HiddenCut, to better regularize the model and encourage it to learn more generalizable features.
Specifically, contiguous spans within the hidden space are dynamically and strategically
dropped during training. Experiments show
that our HiddenCut method outperforms the
state-of-the-art augmentation methods on the
GLUE benchmark, and consistently exhibit superior generalization performances on out-ofdistribution and challenging counterexamples.
We have publicly released our code at https:
//github.com/GT-SALT/HiddenCut.

1

Introduction

Fine-tuning large-scale pre-trained language models (PLMs) has become a dominant paradigm in the
natural language processing community, achieving
state-of-the-art performances in a wide range of
natural language processing tasks (Devlin et al.,
2019; Liu et al., 2019; Yang et al., 2019a; Joshi
et al., 2019; Sun et al., 2019; Clark et al., 2019;
Lewis et al., 2020; Bao et al., 2020; He et al., 2020;
Raffel et al., 2020). Despite the great success, due
to the huge gap between the number of model parameters and that of task-specific data available, the
majority of the information within the multi-layer
self-attention networks is typically redundant and
ineffectively utilized for downstream tasks (Guo
et al., 2020; Gordon et al., 2020; Dalvi et al., 2020).
As a result, after task-specific fine-tuning, models are very likely to overfit and make predictions

based on spurious patterns (Tu et al., 2020; Kaushik
et al., 2020), making them less generalizable to outof-domain distributions (Zhu et al., 2019; Jiang
et al., 2019; Aghajanyan et al., 2020).
In order to improve the generalization abilities
of over-parameterized models with limited amount
of task-specific data, various regularization approaches have been proposed, such as adversarial
training that injects label-preserving perturbations
in the input space (Zhu et al., 2019; Liu et al., 2020;
Jiang et al., 2019), generating augmented data via
carefully-designed rules (McCoy et al., 2019; Xie
et al., 2020; Andreas, 2020; Shen et al., 2020), and
annotating counterfactual examples (Goyal et al.,
2019; Kaushik et al., 2020). Despite substantial
improvements, these methods often require significant computational and memory overhead (Zhu
et al., 2019; Liu et al., 2020; Jiang et al., 2019; Xie
et al., 2020) or human annotations (Goyal et al.,
2019; Kaushik et al., 2020).
In this work, to alleviate the above issues, we
rethink the simple and commonly-used regularization technique—dropout (Srivastava et al., 2014)—
in pre-trained transformer models (Vaswani et al.,
2017). With multiple self-attention heads in transformers, dropout converts some hidden units to zeros in a random and independent manner. Although
PLMs have already been equipped with the dropout
regularization, they still suffer from inferior performances when it comes to out-of-distribution cases
(Tu et al., 2020; Kaushik et al., 2020). The underlying reasons are two-fold: (1) the linguistic
relations among words in a sentence is ignored
while dropping the hidden units randomly. In reality, these masked features could be easily inferred
from surrounding unmasked hidden units with the
self-attention networks. Therefore, redundant information still exists and gets passed to the upper
layers. (2) The standard dropout assumes that every hidden unit is equally important with the ran-

dom sampling procedure, failing to characterize
the different roles these features play in distinct
tasks. As a result, the learned representations are
not generalized enough while applied to other data
and tasks. To drop the information more effectively, Shen et al. (2020) recently introduce Cutoff
to remove tokens/features/spans in the input space.
Even though models will not see the removed information during training, examples with large noise
may be generated when key clues for predictions
are completely removed from the input.
To overcome these limitations, we propose a simple yet effective data augmentation method, HiddenCut, to regularize PLMs during the fine-tuning
stage. Specifically, the approach is based on the linguistic intuition that hidden representations of adjacent words are more likely to contain similar and
redundant information. HiddenCut drops hidden
units more structurally by masking the whole hidden information of contiguous spans of tokens after
every encoding layer. This would encourage models to fully utilize all the task-related information,
instead of learning spurious patterns during training. To make the dropping process more efficient,
we dynamically and strategically select the informative spans to drop by introducing an attentionbased mechanism. By performing HiddenCut in the
hidden space, the impact of dropped information
is only mitigated rather than completely removed,
avoiding injecting too much noise to the input. We
further apply a Jensen-Shannon Divergence consistency regularization between the original and
these augmented examples to model the consistent
relations between them.
To demonstrate the effectiveness of our methods,
we conduct experiments to compare our HiddenCut
with previous state-of-the-art data augmentation
method on 8 natural language understanding tasks
from the GLUE (Wang et al., 2018) benchmark
for in-distribution evaluations, and 5 challenging
datasets that cover single-sentence tasks, similarity
and paraphrase tasks and inference tasks for out-ofdistribution evaluations. We further perform ablation studies to investigate the impact of different
selecting strategies on HiddenCut’s effectiveness.
Results show that our method consistently outperforms baselines, especially on out-of-distribution
and challenging counterexamples. To sum up, our
contributions are:
• We propose a simple data augmentation
method, HiddenCut, to regularize PLMs dur-

ing fine-tuning by cutting contiguous spans of
representations in the hidden space.
• We explore and design different strategic sampling techniques to dynamically and adaptively construct the set of spans to be cut.
• We demonstrate the effectiveness of HiddenCut through extensive experiments on both indistribution and out-of-distribution datasets.

2

Related Work

2.1

Adversarial Training

Adversarial training methods usually regularize
models through applying perturbations to the input
or hidden space (Szegedy et al., 2013; Goodfellow et al., 2014; Madry et al., 2017) with additional forward-backward passes, which influence
the model’s predictions and confidence without
changing human judgements. Adversarial-based
approaches have been actively applied to various
NLP tasks in order to improve models’ robustness
and generalization abilities, such as sentence classification (Miyato et al., 2017), machine reading
comprehension (MRC) (Wang and Bansal, 2018)
and natural language inference (NLI) tasks (Nie
et al., 2020). Despite its success, adversarial training often requires extensive computation overhead
to calculate the perturbation directions (Shafahi
et al., 2019; Zhang et al., 2019a). In contrast, our
HiddenCut adds perturbations in the hidden space
in a more efficient way that does not require extra
computations as the designed perturbations can be
directly derived from self-attentions.
2.2

Data Augmentation

Another line of work to improve the model robustness is to directly design data augmentation
methods to enrich the original training set such as
creating syntactically-rich examples (McCoy et al.,
2019; Min et al., 2020) with specific rules, crowdsourcing counterfactual augmentation to avoid
learning spurious features (Goyal et al., 2019;
Kaushik et al., 2020), or combining examples in the
dataset to increase compositional generalizabilities
(Jia and Liang, 2016; Andreas, 2020; Chen et al.,
2020b,a). However, they either require careful design (McCoy et al., 2019; Andreas, 2020) to infer
labels for generated data or extensive human annotations (Goyal et al., 2019; Kaushik et al., 2020),
which makes them hard to generalize to different
tasks/datasets. Recently Shen et al. (2020) introduce a set of cutoff augmentation which directly

creates partial views to augment the training in a
more task-agnostic way. Inspired by these prior
work, our HiddenCut aims at improving models’
generalization abilities to out-of-distribution via
linguistic-informed strategically dropping spans of
hidden information in transformers.
2.3

Dropout-based Regularization

Variations of dropout (Srivastava et al., 2014) have
been proposed to regularize neural models by injecting noise through dropping certain information
so that models do not overfit training data. However, the major efforts have been put to convolutional neural networks and trimmed for structures in images recently such as DropPath (Larsson et al., 2017), DropBlock (Ghiasi et al., 2018),
DropCluster (Chen et al., 2020c) and AutoDropout
(Pham and Le, 2021). In contrast, our work takes a
closer look at transformer-based models and introduces HiddenCut for natural language understanding tasks. HiddenCut is closely related to DropBlock (Ghiasi et al., 2018), which drops contiguous regions from a feature map. However, different
from images, hidden dimensions in PLMs that contain syntactic/semantic information for NLP tasks
are more closely related (e.g., NER and POS information), and simply dropping spans of features
in certain hidden dimensions might still lead to
information redundancy.

3

HiddenCut Approach

To regularize transformer models in a more structural and efficient manner, in this section, we introduce a simple yet effective data augmentation technique, HiddenCut, that reforms dropout to cutting
contiguous spans of hidden representations after
each transformer layer (Section 3.1). Intuitively,
the proposed approach encourages the models to
fully utilize all the hidden information within the
self-attention networks. Furthermore, we propose
an attention-based mechanism to strategically and
judiciously determine the specific spans to cut (Section 3.2). The schematic diagram of HiddenCut,
applied to the transformer architecture (and its comparison to dropout) are shown in Figure 1.
3.1

HiddenCut

For an input sequence s = {w0 , w1 , ..., wL } with
L tokens associated with a label y, we employ a
pre-trained transformer model f1:M (·) with M layers like RoBERTa (Liu et al., 2019) to encode the

text into hidden representations. Thereafter, an
inference network g(·) is learned on top of the pretrained models to predict the corresponding labels.
In the hidden space, after layer m, every word wi
in the input sequence is encoded into a D dimenD
sional vector hm
i ∈ R and the whole sequence
could be viewed as a hidden matrix Hm ∈ RL×D .
With multiple self-attention heads in the transformer layers, it is found that there is extensive
redundant information across hm
i ∈ H that are linguistically related (Dalvi et al., 2020) (e.g., words
that share similar semantic meanings). As a result,
the removed information from the standard dropout
operation may be easily inferred from the remaining unmasked hidden units. The resulting model
might easily overfit to certain high-frequency features without utilizing all the important task-related
information in the hidden space (especially when
task-related data is limited). Moreover, the model
also suffers from poor generalization ability while
being applied to out-of-distribution cases.
Inspired by Ghiasi et al. (2018); Shen et al.
(2020), we propose to improve the dropout regularization in transformer models by creating augmented training examples through HiddenCut,
which drops a contiguous span of hidden information encoded in every layer, as shown in Figure 1
(c). Mathematically, in every layer m, a span of
hidden vectors, S ∈ Rl×D , with length l = αL in
the hidden matrix Hm ∈ RL×D are converted to 0,
and the corresponding attention masks are adjusted
to 0, where α is a pre-defined hyper-parameter indicating the dropping extent of HiddenCut. After
being encoded and hiddencut through all the hidden
layers in pre-trained encoders, augmented training
data f HiddenCut (s) is created for learning the inference network g(·) to predict task labels.
3.2

Strategic Sampling

Different tasks rely on learning distinct sets of information from the input to predict the corresponding task labels. Performing HiddenCut randomly
might be inefficient especially when most of the
dropping happens at task-unrelated spans, which
fails to effectively regularize model to take advantage of all the task-related features. To this end, we
propose to select the spans to be cut dynamically
and strategically in every layer. In other words, we
mask the most informative span of hidden representations in one layer to force models to discover
other useful clues to make predictions instead of

Figure 1: Illustration of the differences between Dropout (a) and HiddenCut (b), and the position of HiddenCut
in transformer layers (c). A sentence in the hidden space can be viewed as a L × D matrix where L is the length
of the sentence and D is the number of hidden dimensions. The cells in blue represent that they are masked.
Dropout masks random independent units in the matrix while our HiddenCut selects and masks a whole span of
hidden representations based on attention weights received in the current layer. In our experiments, we perform
HiddenCut after the feed-forward network in every transformer layer.

relying on a small set of spurious patterns.
Attention-based Sampling Strategy The most
direct way is to define the set of tokens to be cut
by utilizing attention weights assigned to tokens
in the self-attention layers (Kovaleva et al., 2019).
Intuitively, we can drop the spans of hidden representations that are assigned high attentions by
the transformer layers. As a result, the information
redundancy is alleviated and models would be encourage to attend to other important information.
Specifically, we first derive the average attention for
each token, ai , from the attention weights matrix
A ∈ RP ×L×L after self-attention layers, where
P is the number of attention heads and L is the
sequence length:
ai =

PP PL
j ( k A[j][k][i])
P

.

We then sample the start token hi for HiddenCut
from the set that contains top βL tokens with higher
average attention weights (β is a pre-defined parameter). Then HiddenCut is performed to mask the
hidden representations between hi and hi+l . Note
that the salient sets are different across different
layers and updated throughout the training.
Other Sampling Strategies We also explore
other widely used word importance discovery methods to find a set of tokens to be strategically cut by
HiddenCut, including:
• Random: All spans of tokens are viewed as
equally important, thus are randomly cut.

• LIME (Ribeiro et al., 2016) defines the importance of tokens by examining the locally faithfulness where weights of tokens are assigned
by classifiers trained with sentences whose
words are randomly removed. We utilized
LIME on top of a SVM classifier to pre-define
a fixed set of tokens to be cut.
• GEM (Yang et al., 2019b) utilizes orthogonal basis to calculate the novelty scores that
measure the new semantic meaning in tokens,
significance scores that estimate the alignment
between the semantic meaning of tokens and
the sentence-level meaning, and the uniqueness scores that examine the uniqueness of the
semantic meaning of tokens. We compute the
GEM scores using the hidden representations
at every layer to generate the set of tokens to
be cut, which are updated during training.
• Gradient (Baehrens et al., 2010): We define
the set of tokens to be cut based on the rankings of the absolute values of gradients they received at every layer in the backward-passing.
This set would be updated during training.
3.3

Objectives

During training, for an input text sequence s with
a label y, we generate N augmented examples
HiddenCut (s)} through perform{f1HiddenCut (s), ..., fN
ing HiddenCut in pre-trained encoder f (·). The
whole model g(f (·)) is then trained though several objectives including general classification loss

(Lori and Laug ) on data-label pairs and consistency
regularization (Ljs ) (Miyato et al., 2017, 2018;
Clark et al., 2018; Xie et al., 2019; Shen et al.,
2020) across different augmentations:
Lori = CE(g(f (s)), y)
X
Laug =
CE(g(fiHiddenCut (s)), y)
N

Ljs =

X

KL[p(y|g(fiHiddenCut (s))||pavg ]

N

where CE and KL represent the cross-entropy loss
and KL-divergence respectively. pavg stands for the
average predictions across the original text and all
the augmented examples.
Combining these three losses, our overall objective function is:
L = Lori + γLaug + ηLjs
where γ and η are the weights used to balance the
contributions of learning from the original data and
augmented data.

4

Experiments

4.1

Datasets

We conducted experiments on both in-distribution
datasets and out-of-distribution datasets to demonstrate the effectiveness of our proposed HiddenCut.
In-Distribution Datasets We mainly trained and
evaluated our methods on the widely-used GLUE
benchmark (Wang et al., 2018) which covers a
wide range of natural language understanding tasks:
single-sentence tasks including: (i) Stanford Sentiment Treebank (SST-2) which predict the sentiment
of movie reviews to be positive or negative, and (ii)
Corpus of Linguistic Acceptability (CoLA) which
predict whether a sentence is linguistically acceptable or not; similarity and paraphrase tasks including (i) Quora Question Pairs (QQP) which predict
whether two question are paraphrases, (ii) Semantic
Textual Similarity Benchmark (STS-B) which predict the similarity ratings between two sentences,
and (iii) Microsoft Research Paraphrase Corpus
(MRPC) which predict whether two given sentences are semantically equivalent; inference tasks
including (i) Multi-Genre Natural Language Inference (MNLI) which classified the relationships
between two sentences into entailment, contradiction, or neutral, (ii) Question Natural Language
Inference (QNLI) which predict whether a given

sentence is the correct answer to a given question,
and (iii) Recognizing Textual Entailment (RTE)
which predict whether the entailment relation holds
between two sentences. Accuracy was used as the
evaluation metric for most of the datasets except
that Matthews correlation was used for CoLA and
Spearman correlation was utilized for STS-B.
Out-Of-Distribution Datasets To demonstrate
the generalization abilities of our proposed methods, we directly evaluated on 5 different out-ofdistribution challenging sets, using the models that
are fine-tuned on GLUE benchmark datasets:
• Single Sentence Tasks: Models fine-tuned
from SST-2 are directly evaluated on two
recent challenging sentiment classification
datasets: IMDB Contrast Set (Gardner
et al., 2020) including 588 examples and
IMDB Counterfactually Augmented Dataset
(Kaushik et al., 2020) including 733 examples.
Both of them were constructed by asking NLP
researchers (Gardner et al., 2020) or Amazon
Mechanical Turkers (Kaushik et al., 2020) to
make minor edits to examples in the original
IMDB dataset (Maas et al., 2011) so that the
sentiment labels change while the major contents keep the same.
• Similarity and Paraphrase Tasks: Models fine-tuned from QQP are directly evaluated on the recently introduced challenging
paraphrase dataset PAWS-QQP (Zhang et al.,
2019b) that has 669 test cases. PAWS-QQP
contains sentence pairs with high word overlap but different semantic meanings created
via word-swapping and back-translation from
the original QQP dataset.
• Inference Tasks: Models fine-tuned from
MNLI are directly evaluated on two challenging NLI sets: HANS (McCoy et al., 2019)
with 30,000 test cases and Adversarial NLI
(A1 dev sets) (Nie et al., 2020) including
1,000 test cases. The former one was constructed by using syntactic rules (lexical overlap, subsequence and constituent) to generate
non-entailment examples with high premisehypothesis overlap from MNLI. The latter one
was created by adversarial human-and-modelin-the-loop framework (Nie et al., 2020) to create hard examples based on BERT-Large models(Devlin et al., 2019) pre-trained on SNLI
(Bowman et al., 2015) and MNLI.

Method
RoBERTa-base
ALUM
Token Cutoff
Feature Cutoff
Span Cutoff
HiddenCut †

MNLI
87.6
88.1
88.2
88.2
88.4
88.2

QNLI
92.8
93.1
93.1
93.3
93.4
93.7

QQP
91.9
92.0
91.9
92.0
92.0
92.0

RTE
78.7
80.2
81.2
81.6
82.3
83.4

SST-2
94.8
95.3
95.1
95.3
95.4
95.8

MRPC
89.5
90.9
91.1
90.7
91.1
92.0

CoLA
63.6
63.6
64.1
63.6
64.7
66.2

STS-B
91.2
91.1
91.2
91.2
91.2
91.3

Avg
86.3
86.8
87.0
87.0
87.3
87.8

Table 1: In-distribution evaluation results on the dev sets of the GLUE benchmark. † means our proposed method.

Method
RoBERTa-base
Span Cutoff
HiddenCut †

Single-Sentence
IMDB-Cont. IMDB-CAD
84.6
88.4
85.5
89.2
87.8
90.4

Similarity&Paraphrase
PAWS-QQP
38.4
38.8
41.5

HANS
67.8
68.4
71.2

Inference
AdvNLI (A1)
31.2
31.1
32.8

Table 2: Out-of-distribution evaluation results on 5 different challenging sets. † means our proposed method. For
all the datasets, we did not use their training sets to further fine-tune the derived models from GLUE.

4.2

Baselines

We compare our methods with several baselines:
• RoBERTa (Liu et al., 2019) is used as our
base model. Note that RoBERTa is regularized with dropout during fine-tuning.
• ALUM (Liu et al., 2020) is the state-of-theart adversarial training method for neural language models, which regularizes fine-tuning
via perturbations in the embedding space.
• Cutoff (Shen et al., 2020) is a recent data augmentation for natural language understanding
tasks by removing information in the input
space, including three variations: token cutoff,
feature cutoff, and span cutoff.
4.3

Implementation Details

We used the RoBERTa-base model (Liu et al.,
2019) to initialize all the methods. Note that HiddenCut is agnostic to different types of pre-trained
models. We followed Liu et al. (2019) to set the
linear decay scheduler with a warmup ratio of 0.06
for training. The maximum learning rate was selected from {5e − 6, 8e − 6, 1e − 5, 2e − 5} and the
max number of training epochs was set to be either
5 or 10. All these hyper-parameters are shared for
all the models. The HiddenCut ratio α was set 0.1
after a grid search from {0.05, 0.1, 0.2, 0.3, 0.4}.
The selecting ratio β in the important sets sampling process was set 0.4 after a grid search from
{0.1, 0.2, 0.4, 0.6}. The weights γ and η in our ob-

jective function were both 1. All the experiments
were performed using a GeForce RTX 2080Ti.
4.4

Results on In-Distribution Datasets

Based on Table 1, we observed that, compared to
RoBERTa-base with only dropout regularization,
ALUM with perturbations in the embedding space
through adversarial training has better results on
most of these GLUE tasks. However, the extra
additional backward passes to determine the perturbation directions in ALUM can bring in significantly more computational and memory overhead.
By masking different types of input during training, Cutoff increased the performances while being
more computationally efficient.
In contrast to Span Cutoff, HiddenCut not only
introduced zero additional computation cost, but
also demonstrated stronger performances on 7 out
of 8 GLUE tasks, especially when the size of training set is small (e.g., an increase of 1.1 on RTE
and 1.5 on CoLA). Moreover, HiddenCut achieved
the best average result compared to previous stateof-the-art baselines. These in-distribution improvements indicated that, by strategically dropping contiguous spans in the hidden space, HiddenCut not
only helps pre-trained models utilize hidden information in a more effective way, but also injects less
noise during the augmentation process compared to
cutoff, e.g., Span Cutoff might bring in additional
noises for CoLA (which aims to judge whether
input sentences being linguistically acceptable or
not) when one span in the input is removed, since

Strategy
RoBERTa
DropBlock
Random
LIME
LIME-R
GEM
GEM-R
Gradient
Gradient-R
Attention
Attention-R

SST-2
94.8
95.4
95.4
95.2
95.3
95.5
95.1
95.6
95.1
95.8
94.6

QNLI
92.8
93.2
93.5
93.1
93.2
93.4
93.2
93.6
93.4
93.7
93.4

Table 3: The performances on SST-2 and QNLI with
different strategies when dropping information in the
hidden space. Different sampling strategies combined
with HiddenCut are presented. “-R” means sampling
outside the set to be cut given by these strategies.

it might change the labels.
4.5

Results on Out-Of-Distribution Datasets

To validate the better generalizability of HiddenCut, we tested our models trained on SST-2, QQP
and MNLI directly on 5 out-of-distribution/outof-domain challenging sets in zero-shot settings.
As mentioned earlier, these out-of-distribution sets
were either constructed with in-domain/out-ofdomain data and further edited by human to make
them harder, or generated by rules that exploited
spurious correlations such as lexical overlap, which
made them challenging to most existing models.
As shown in Table 2, Span Cutoff slightly improved the performances compared to RoBERTa
by adding extra regularizations through creating
restricted input. HiddenCut significantly outperformed both RoBERTa and Span Cutoff. For example, it outperformed Span Cutoff. by 2.3%(87.8%
vs. 85.5%) on IMDB-Conts, 2.7%(41.5% vs.
38.8%) on PAWS-QQP, and 2.8%(71.2% vs 68.4%)
on HANS consistently. These superior results
demonstrated that, by dynamically and strategically dropping contiguous span of hidden representations, HiddenCut was able to better utilize all the
important task-related information which improved
the model generalization to out-of-distribution and
challenging adversary examples.
4.6

Ablation Studies

This section presents our ablation studies on different sampling strategies and the effect of important

hyper-parameters in HiddenCut.
4.6.1

Sampling Strategies in HiddenCut

We compared different ways to cut hidden representations (DropBlock (Ghiasi et al., 2018) which
randomly dropped spans in certain random hidden dimensions instead of the whole hidden space)
and different sampling strategies for HiddenCut described in Section 3.2 (including Random, LIME
(Ribeiro et al., 2016), GEM (Yang et al., 2019b),
Gradient (Yeh et al., 2019), Attention) based on
the performances on SST-2 and QNLI. For these
strategies, we also experimented with a reverse set
denoted by “-R” where we sampled outside the
important set given by above strategies.
From Table 3, we observed that (i) sampling
from important sets resulted in better performances
than random sampling. Sampling outside the defined importance sets usually led to inferior performances. These highlights the importance of
strategically selecting spans to drop. (ii) Sampling
from dynamic sets sampled by their probabilities
often outperformed sampling from predefined fixed
sets (LIME), indicating the effectiveness of dynamically adjusting the sampling sets during training.
(iii) The attention-based strategy outperformed all
other sampling strategies, demonstrating the effectiveness of our proposed sampling strategies for
HiddenCut. (iv) Completely dropping out the spans
of hidden representations generated better results
than only removing certain dimensions in the hidden space, which further validated the benefit of
HiddenCut over DropBlock in natural language understanding tasks.
4.6.2

The Effect of HiddenCut Ratios

The length of spans that are dropped by HiddenCut is an important hyper-parameter, which is controlled by the HiddenCut ratio α and the length
of input sentences. α could also be interpreted
as the extent of perturbations added to the hidden space. We presented the results of HiddenCut on MNLI with a set of different α including
{0.05, 0.1, 0.2, 0.3, 0.4} in Table 5. HiddenCut
achieved the best performance with α = 0.1, and
the performance gradually decreased with higher α
since larger noise might be introduced when dropping more hidden information. This suggested the
importance of balancing the trade-off between applying proper perturbations to regularize models
and injecting potential noises.

Method
RoBERTa
HiddenCut
RoBERTa
HiddenCut
RoBERTa
HiddenCut
RoBERTa
HiddenCut

<s>
<s>
<s>
<s>
<s>
<s>
<s>
<s>

I
I
The
The
I
I
The
The

would
would
movie
movie
would
would
movie
movie

Original and Counterfactual Sentences
rate
8
stars
out
of
rate
8
stars
out
of
became
more
and
more
intriguing
became
more
and
more
intriguing
rate
8
stars
out
of
rate
8
stars
out
of
became
only
slightly
more
intriguing
became
only
slightly
more
intriguing

10
10
</s>
</s>
20
20
</s>
</s>

</s>
</s>

</s>
</s>

Prediction
Positive
Positive
Positive
Positive
Positive
Negative
Positive
Negative

Table 4: Visualization of the attention weights at the last layer in models. The sentences in the first section are
from IMDB with positive labels and the sentences in the second section is constructed by changing ratings or
diminishing via qualifiers (Kaushik et al., 2020) to flip their corresponding labels. Deeper blue represents that
those tokens receive higher attention weights.

α
MNLI

0.05
88.07

0.1
88.23

0.2
88.13

0.3
88.07

0.4
87.64

Table 5: Performances on MNLI with different HiddenCut ratio α, which controls the length of span to cut in
the hidden space.

β
SST-2

0.1
95.18

0.2
95.30

0.4
95.76

0.6
95.46

Table 6: Performances on SST-2 with different sampling ratio β, which controls the size of important token
set from which HiddenCut would sample.

4.6.3

The Effect of Sampling Ratios

The number of words that are considered important
and selected by HiddenCut is also an influential
hyper-parameter controlled by the sampling ratio β
and the length of input sentences. As shown in Table 6, we compared the performances on SST-2 by
adopting different β including {0.1, 0.2, 0.4, 0.6}.
When β is too small, the number of words in the important sets is limited, which might lead HiddenCut
to consistently drop certain hidden spans during the
entire training process. The low diversities reduce
the improvements over baselines. When β is too
large, the important sets might cover all the words
except stop words in sentences. As a result, the
Attention-based Strategy actually became Random
Sampling, which led to lower gains over baselines.
The best performance was achieved when β = 0.4,
indicating a reasonable trade-off between diversities and efficiencies.
4.7

Visualization of Attentions

To further demonstrate the effectiveness of HiddenCut, we visualize the attention weights that the
special start token (“<s>”) assigns to other tokens at
the last layer, via several examples and their coun-

terfactual examples in Table 4. We observed that
RoBERTa only assigned higher attention weights
on certain tokens such as “8 stars”, “intriguing”
and especially the end special token “</s>”, while
largely ignored other context tokens that were also
important to make the correct predictions such as
scale descriptions (e.g., “out of 10”) and qualifier
words (e.g., “more and more”). This was probably
because words like “8 stars” and “intriguing” were
highly correlated with positive label and RoBERTa
might overfit such patterns without probable regularization. As a result, when the scale of ratings
(e.g., from “10” to “20”) or the qualifier words
changed (e.g., from “more and more” to “only
slightly more”), RoBERTa still predicted the label
as positive even when the groundtruth is negative.
With HiddenCut, models mitigated the impact of
tokens with higher attention weights and were encouraged to utilize all the related information. So
the attention weights in HiddenCut were more uniformly distributed, which helped models make the
correct predictions for out-of-distribution counterfactual examples. Taken together, HiddenCut helps
improve model’s generalizability by facilitating it
to learn from more task-related information.

5

Conclusion

In this work, we introduced a simple yet effective data augmentation technique, HiddenCut, to
improve model robustness on a wide range of
natural language understanding tasks by dropping contiguous spans of hidden representations
in the hidden space directed by strategic attentionbased sampling strategies. Through HiddenCut,
transformer models are encouraged to make use
of all the task-related information during training rather than only relying on certain spurious
clues. Through extensive experiments on indistribution datasets (GLUE benchmarks) and out-

of-distribution datasets (challenging counterexamples), HiddenCut consistently and significantly outperformed state-of-the-art baselines, and demonstrated superior generalization performances.

Acknowledgment
We would like to thank the anonymous reviewers,
and the members of Georgia Tech SALT group for
their feedback. This work is supported in part by
grants from Amazon and Salesforce.

References
Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta,
Naman Goyal, Luke Zettlemoyer, and Sonal Gupta.
2020. Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156.
Jacob Andreas. 2020. Good-enough compositional
data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 7556–7566, Online. Association
for Computational Linguistics.
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert Müller. 2010. How to explain individual classification decisions. Journal of Machine Learning
Research, 11(61):1803–1831.
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang,
Nan Yang, Xiaodong Liu, Yu Wang, Songhao
Piao, Jianfeng Gao, Ming Zhou, et al. 2020.
Unilmv2: Pseudo-masked language models for unified language model pre-training. arXiv preprint
arXiv:2002.12804.

Liyan Chen, P. Gautier, and Sergül Aydöre. 2020c.
Dropcluster: A structured dropout for convolutional
networks. ArXiv, abs/2002.02997.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2019. Electra: Pre-training
text encoders as discriminators rather than generators. In International Conference on Learning Representations.
Kevin Clark, Minh-Thang Luong, Christopher D. Manning, and Quoc V. Le. 2018. Semi-supervised
sequence modeling with cross-view training. In
EMNLP.
Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and
Yonatan Belinkov. 2020. Analyzing redundancy in
pretrained transformer models. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–
4926, Online. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT.
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,
Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,
Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,
Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer
Singh, Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.
2020. Evaluating models’ local decision boundaries
via contrast sets. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
1307–1323, Online. Association for Computational
Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference.
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal. Association for Computational Linguistics.

G. Ghiasi, Tsung-Yi Lin, and Quoc V. Le. 2018. Dropblock: A regularization method for convolutional
networks. In NeurIPS.

Jiaao Chen, Zhenghui Wang, Ran Tian, Zichao Yang,
and Diyi Yang. 2020a. Local additivity based data
augmentation for semi-supervised ner. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
1241–1251.

Mitchell Gordon, Kevin Duh, and Nicholas Andrews.
2020. Compressing BERT: Studying the effects of
weight pruning on transfer learning. In Proceedings
of the 5th Workshop on Representation Learning for
NLP, pages 143–155, Online. Association for Computational Linguistics.

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020b. MixText: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–
2157, Online. Association for Computational Linguistics.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi
Parikh, and Stefan Lee. 2019. Counterfactual visual
explanations. In ICML, pages 2376–2384.

Ian J Goodfellow, Jonathon Shlens, and Christian
Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

Demi Guo, Alexander M. Rush, and Yoon Kim. 2020.
Parameter-efficient transfer learning with diff pruning.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Weizhu Chen. 2020. Deberta: Decoding-enhanced
bert with disentangled attention. arXiv preprint
arXiv:2006.03654.

the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational
Linguistics.

Robin Jia and Percy Liang. 2016. Data recombination
for neural semantic parsing. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
12–22, Berlin, Germany. Association for Computational Linguistics.

Aleksander Madry, Aleksandar Makelov, Ludwig
Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017.
Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2019.
Smart: Robust and efficient fine-tuning for pretrained natural language models through principled regularized optimization.
arXiv preprint
arXiv:1911.03437.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Weld, Luke Zettlemoyer, and Omer Levy. 2019.
Spanbert: Improving pre-training by representing
and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.
2020. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
Olga Kovaleva, Alexey Romanov, Anna Rogers, and
Anna Rumshisky. 2019. Revealing the dark secrets
of BERT. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
4365–4374, Hong Kong, China. Association for
Computational Linguistics.
Gustav Larsson,
M. Maire,
and Gregory
Shakhnarovich. 2017. Fractalnet: Ultra-deep neural
networks without residuals. ArXiv, abs/1605.07648.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020.
Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and
comprehension. SCL.
Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu
Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao.
2020. Adversarial training for large neural language
models. arXiv preprint arXiv:2004.08994.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Dan Huang, Andrew Y. Ng, and Christopher Potts.
2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Right for the wrong reasons: Diagnosing syntactic
heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3428–3448,
Florence, Italy. Association for Computational Linguistics.
Junghyun Min, R. Thomas McCoy, Dipanjan Das,
Emily Pitler, and Tal Linzen. 2020. Syntactic
data augmentation increases robustness to inference
heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 2339–2352, Online. Association for Computational Linguistics.
Takeru Miyato, Andrew M. Dai, and Ian J. Goodfellow. 2017. Adversarial training methods for
semi-supervised text classification. arXiv: Machine
Learning.
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,
and Shin Ishii. 2018. Virtual adversarial training:
a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–
1993.
Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 4885–4901, Online. Association
for Computational Linguistics.
Hieu Pham and Quoc V. Le. 2021. Autodropout: Learning dropout patterns to regularize deep networks.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text transformer.
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings
of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD
’16, page 1135–1144, New York, NY, USA. Association for Computing Machinery.
Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer,
Larry S Davis, Gavin Taylor, and Tom Goldstein.

2019. Adversarial training for free! In Advances
in Neural Information Processing Systems, pages
3358–3369.
Dinghan Shen, M. Zheng, Y. Shen, Yanru Qu, and
W. Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. ArXiv, abs/2009.13818.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks
from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi
Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao
Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv
preprint arXiv:1904.09223.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
Rob Fergus. 2013. Intriguing properties of neural
networks. arXiv preprint arXiv:1312.6199.
Lifu Tu, Garima Lalwani, Spandana Gella, and He He.
2020. An empirical study on robustness to spurious correlations using pre-trained language models.
Transactions of the Association for Computational
Linguistics, 8:621–633.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser,
and Illia Polosukhin. 2017. Attention is all you need.
ArXiv, abs/1706.03762.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R. Bowman. 2018.
Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP.
Yicheng Wang and Mohit Bansal. 2018. Robust machine comprehension models via adversarial training. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 575–581, New
Orleans, Louisiana. Association for Computational
Linguistics.
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint
arXiv:1904.12848.
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2020. Unsupervised data augmentation for consistency training.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
2019a. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in
neural information processing systems, pages 5754–
5764.

Ziyi Yang, Chenguang Zhu, and Weizhu Chen. 2019b.
Parameter-free sentence embedding via orthogonal
basis. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
638–648, Hong Kong, China. Association for Computational Linguistics.
Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Sai Suggala,
David I. Inouye, and Pradeep Ravikumar. 2019. On
the (in)fidelity and sensitivity of explanations. In
NeurIPS.
Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. 2019a. You only propagate
once: Painless adversarial training using maximal
principle. arXiv preprint arXiv:1905.00877, 2(3).
Yuan Zhang, Jason Baldridge, and Luheng He. 2019b.
PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages
1298–1308, Minneapolis, Minnesota. Association
for Computational Linguistics.
Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2019. Freelb: Enhanced adversarial training for natural language understanding.
In International Conference on Learning Representations.