TreeMix: Compositional Constituency-based Data Augmentation for
Natural Language Understanding
Le Zhang
Fudan University
zhangle18@fudan.edu.cn

Zichao Yang
CMU
yangtze2301@gmail.com

Diyi Yang
Georgia Tech
dyang888@gatech.edu
Abstract
Data augmentation is an effective approach to
tackle over-fitting. Many previous works have
proposed different data augmentations strategies for NLP, such as noise injection, word
replacement, back-translation etc. Though effective, they missed one important characteristic of language–compositionality, meaning of a
complex expression is built from its sub-parts.
Motivated by this, we propose a compositional
data augmentation approach for natural language understanding called TreeMix. Specifically, TreeMix leverages constituency parsing
tree to decompose sentences into constituent
sub-structures and the Mixup data augmentation technique to recombine them to generate
new sentences. Compared with previous approaches, TreeMix introduces greater diversity
to the samples generated and encourages models to learn compositionality of NLP data. Extensive experiments on text classification and
SCAN demonstrate that TreeMix outperforms
current state-of-the-art data augmentation methods.

1

S1:They will find little interest in this poor film.
S2:It comes as a touching love story.
Method
Example
EDA
They will this find little
(Wei and Zou, 2019)
interest in bad movie.
AEDA
They will find ? little in
(Karimi et al., 2021)
! this poor movie;.
Noise
Thes will fi little
(Xie et al., 2017)
intres _ this poor film .
SSMix
They will find little interest
(Yoon et al., 2021)
in love poor film
Replacement
They will find limited interest
(Kolomiyets et al., 2011)
in this odd film.
Back Translation
They will show little interest
(Edunov et al., 2018)
in this strange film.
They will find little interest
TreeMix
in this touching love story.

Table 1: Input-level DAs for Text-Classification. EDA
includes random deletion, swapping, and insertion.
AEDA randomly inserts punctuation. SSMix swaps
tokens based on their saliency. The replacement method
randomly substitutes words with synonyms. In Backtranslation, the source sentences are first translated into
another language, and then back again.

Introduction

Data augmentation (DA) has won great popularity in natural language processing (NLP) (Chen
et al., 2021; Feng et al., 2021) due to the increasing demand for data and the expensive cost for
annotation. DA aims at increasing the quantity
and diversity of the datasets by generating more
samples based on existing ones, which helps make
the training process more consistent and improves
the model’s capacity for generalization (Xie et al.,
2020). For instance, existing DA methods often
leverage word-level manipulation (Wei and Zou,
2019; Kobayashi, 2018; Karimi et al., 2021) and
model-based sentence generation (Edunov et al.,
2018; Ng et al., 2020). As mixup-based (Zhang
et al., 2018) augmentation achieving huge success
in computer vision (Yun et al., 2019; Uddin et al.,
2021; Kim et al., 2021), some recent works start

to adapt mixup to NLP, such as at the hidden level
(Guo et al., 2019; Chen et al., 2020b) and at the
input level (Yoon et al., 2021; Shi et al., 2021).
Despite these empirical success, DA methods
still suffer from key limitations. Simple rules
based augmentation methods (Wei and Zou, 2019;
Kobayashi, 2018; Karimi et al., 2021) show little
to none effect over large pretrained language models. While mixup-based augmentation methods
demonstrate huge potential, such interpolation at
the hidden or input level has limited capability to
capture explicit linguistic properties in text (Guo
et al., 2019; Chen et al., 2020b; Yoon et al., 2021).
Moreover, current DA methods exhibit limited ability in compositional generalization. Take a look at
the following example from a BERT-based model
that is fine-tuned using the SST2 dataset from the

GLUE Benchmark:
This film is good and everyone loves it.

99%

This film is poor and I do not like it.

99%

This film is good and I do not like it.

99%

The first two examples are correctly classified. Despite that the last one is composed of fragments
from the first two, the model fails to produce a
correct or plausible label (in terms of characterizing a sentence’s sentiment), demonstrating poor
performance in compositional generalization.
However, compositionality is one key aspect of
language that the meaning of a complex sentence is
built from its subparts. Prior work also shows that
syntax trees (e.g., tree-based LSTMs) are helpful
to model sentence structures for better text classification (Shi et al., 2018). However, leveraging
compositional structures for data augmentation has
not received much attention in the language technologies communities, with a few exceptions in
semantic parsing (Andreas, 2020; Herzig and Berant, 2021).
To this end, we propose a compositional data
augmentation method for natural language understanding, i.e., TreeMix (Figure 1). TreeMix is an
input-level mixup method that utilizes constituency
parsing information, where different fragments
(phrase of a subtree) from different sentences are recombined to create new examples that were never
seen in the training set; new soft labels will also
be strategically created based on these fragments
at the same time. In this way, TreeMix not only exploits compositional linguistic features to increase
the diversity of the augmentation, but also provides
reasonable soft labels for these mixed examples.
Empirically, we find that TreeMix outperforms
existing data augmentation methods significantly
on a set of widely used text classification benchmarks. To validate the compositional effectiveness of TreeMix, we experiment with SCAN (Lake
and Baroni, 2018)—a task requires strong compositional generalization, and find that TreeMix exhibits reasonable ability to generalize to new structures built of components observed during training.

2

Related Work

2.1

Generic Data Augmentation

Most prior work operates data augmentation at different levels (Chen et al., 2021). Token-level DA

methods manipulate tokens or phrases while preserving syntax and semantic meaning as well as
labels of the original text, such as synonymy words
substitutions (Wang and Yang, 2015; Zhang et al.,
2015; Fadaee et al., 2017; Kobayashi, 2018; Miao
et al., 2020) where synonyms are detected following pre-defined rules or by word embedding similarities. These methods has limited improvement
(Chen et al., 2021) over large pretrained language
models (PLMs). Besides, introducing noise by
random insertion, replacement, deletion, and swapping (Wang et al., 2018; Wei and Zou, 2019; Karimi
et al., 2021; Xie et al., 2020) is expected to improve
the robustness of the model. Sentence-Level DA
methods increase the diversity by generating distinct examples, such as via paraphrasing (Yu et al.,
2018; He et al., 2020; Xie et al., 2020; Kumar et al.,
2020; Chen et al., 2020b; Cai et al., 2020) or back
translation (Sennrich et al., 2016; Edunov et al.,
2018). Other line of work used label-conditioned
generation methods that train a conditional generation model such as GPT-2 or VAE to create new
examples given labels as conditions (Bergmanis
et al., 2017; Liu et al., 2020b,a; Ding et al., 2020;
Anaby-Tavor et al., 2020). Although these methods
can produce novel and diverse text patterns that do
not exist in the original datasets, they require extensive training. Hidden-Level DA methods mainly
manipulate hidden representations by perturbation
(Miyato et al., 2019; Zhu et al., 2020; Jiang et al.,
2020; Chen et al., 2020c; Shen et al., 2020; Hsu
et al., 2017, 2018; Wu et al., 2019; Malandrakis
et al., 2019) and interpolation like mixup (Zhang
et al., 2018) to generates new examples (Miao et al.,
2020; Cheng et al., 2020; Chen et al., 2020b; Guo
et al., 2019, 2020; Chen et al., 2020a).
2.2

Compositional Data Augmentation

Compositional augmentation aims at increasing the
diversity of the datasets and improving the compositional generalization capability of the resulting models (Jia and Liang, 2016; Andreas, 2020).
These methods often recombine different components from different sentences to create new examples following a set of pre-designed linguistic rules
such as lexical overlaps (Andreas, 2020), neuralsymbolic stack machines (Chen et al., 2020d), and
substructure substitution (Shi et al., 2021). Compositional methods have been applied in a set of NLP
tasks, such as sequence labeling (Guo et al., 2020),
semantic parsing (Andreas, 2020), constituency

Constituency
Parsing

𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔
{ x𝑖𝑖 : “ They will find little interest in this poor film. ”, y𝑖𝑖 : 0}

𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔
{ x𝑗𝑗 : “ It comes as a touching transcendent love story. ”, y𝑗𝑗 : 1}

𝑇𝑇(x𝑖𝑖 )

𝑇𝑇(x𝑗𝑗 )

NP

Subtree
Selection

VP
MD

New Label
Construction

They

will

VP

find

PRP

PP
NN

no interest

IN

in

IN

NP

NP
DT

DT

JJ

NN

this

poor

film

It

.

comes

.

PP

VBZ

NP
JJ

Subtree
Substitution

NP

VP
VB

PRP

.

as

a

JJ

JJ

trans
touching cend
ent

NN

NN

love

story

.

𝑵𝑵𝑵𝑵𝑵𝑵 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺

{ x� : “ They will find little interest in a touching transcendent love story . ”; y� : [ 7/12 , 5/12 ] = [ 0.583 , 0.417 ] }

Figure 1: Illustration of TreeMix for single sentence classification

parsing (Shi et al., 2020, 2021), dependency parsing (Dehouck and Gómez-Rodríguez, 2020; Shi
et al., 2021), named entity recognition (Dai and
Adel, 2020), text generation (Feng et al., 2020),
and text classification (Yoon et al., 2021; Shi et al.,
2021). Our work also falls into this category.
The most relevant are Shi et al. (2021) and
Yoon et al. (2021). However, Shi et al. (2021)
only performs constituent substructure combinations with examples from the same category, thus
inadequate in creating diverse enough augmentation with newly created labels.
Besides, Yoon et al. (2021) simply swaps the
most and least salient spans, heavily relying on the
model’s performances in estimating salient spans,
and failing to consider these sentences’ linguistic
structures. Our proposed TreeMix fills these gaps
by allowing the composition of sentences from different label categories, by utilizing rich consistency
based structures in text, and by strategically generating soft labels for these augmented instances.

3

Method

Our work is motivated by Mixup (Zhang et al.,
2018), which creates virtual samples by mixing
inputs. Given two random drawn examples (xi , yi )
and (xj , yj ), where x denotes the input sample and
y is the corresponding one-hot label, Mixup creates
a new sample by:
x = λxi + (1 − λ)xj ,
y = λyi + (1 − λ)yj ,
where λ ∈ [0, 1]. Mixup can be easily implemented
in continuous space, hence some prior works (Chen

et al., 2020b) have extended it to NLP by performing interpolation in hidden space.
We improve upon Mixup by incorporating compositionality of language, a key characteristic that
is essential to generalization but neural models often fall short in capturing (Lake and Baroni, 2018).
Instead of interpolating with the whole sample,
TreeMix, our newly proposed method, creates new
sentences by removing phrases of sentences and
reinserting subparts from other sentences. TreeMix
makes use of constituency trees to decompose a
sentence into meaningful constituent parts, which
can then be removed and recombined to generate
new augmentation samples. We aim to improve
models’ compositionality generalization ability by
training on large amount of samples produced by
TreeMix. An example of using TreeMix for single
sentence classification is shown in Figure 1.
3.1

TreeMix

Let xi = {x1i , x2i , ..., xli } denotes a sequence with
length l and its corresponding label in one-hot encoding as yi . We run a constituency parser on xi
to get its parsing tree as T (xi ). In order to get
meaningful subparts of a sequence, we traverse
the parsing tree recursively and get all the subtrees
with more than one child. Denote the collection
of subtrees as S(xi ) = {tki }, where tki denotes
the k-th subtree of sample xi . For a subtree tki ,
it covers a continuous span tki ≜ [xri k , ..., xsi k ]
of xi that starts with index rk and ends with index sk . For example, as shown in the left part
of Figure 1, the subtrees of the example sentence
can cover spans such as this poor film, in
this poor film, no interest etc.

They will find little interest in this poor film.
[λL , λU ] possible selected sub-trees
[0.1,0.3]
(little interest),(this poor film)
[0.3,0.5]
(in this poor film)
[0.5,0.7]
(little interest in this poor film)
Table 2: Examples of possible candidate subtrees with
different λ intervals

For a given sample (xi , yi ), we randomly sample another data point (xj , yj ) from the training set.
We run the constituency parser on both sentences
and get their subtree sets S(xi ) and S(xj ), based
on which we can sample subtrees to exchange. We
introduce two additional hyper-parameters λL and
λU to constraint the length of subtrees to sample.
λL and λU , measured in terms of length ratio of the
subtree to the original sentences, sets the lower and
upper limits of the subtrees to sample. Intuitively,
λ controls the granularity of the phrases that we
aim to exchange. We would like that the length of
phrase to exchange to be reasonable. If it is too
short, then the exchange cannot introduce enough
diversity to the augmented sample; otherwise if it
is too long, the process might inject too much noise
to the original sentence. We set λ to be the ratio in
order to be invariant to the length of original sentences. Table 2 shows some subtree examples with
different length constraints. We define the length
constrained subtree set as:
|t|
∈ [λL , λU ]}.
Sλ (x) ≜ {t|t ∈ S(x), s.t.
|x|
Here |.| denotes the length of a sequence or a subtree. For two sentences xi and xj , we randomly
sample two subtrees tki ∈ Sλ (xi ) and tlj ∈ Sλ (xj )
and construct a new sample by replacing tki with tlj ,
i.e.
x̄ ≜ [x1i , ..., xri k −1 , xrj l , ..., xsj l , xsi k +1 , ...xli ] (1)
| {z }
tlj

where tlj = [xrj l , ..., xsj l ] replaces tki
=
rk
sk
[xi , ..., xi ].
Figure. 1 shows an example
of TreeMix, where the subtree a touching
transcend love story replaces the subtree
this poor film.
Label Creation for TreeMix Creating a valid
label for the augmented sample x̄ is a challenging
problem. Similar to that of Mixup (Zhang et al.,
2018), we use a convex combination of original

Algorithm 1: Dataset construction
Input: Original dataset D; data size
multiplier β; parameters λL and λU
Output: Augmentation Dataset D′
while |D′ | <β|D| do
Randomly select two samples (xi , yi )
and (xj , yj ) ∈ D
(x̄, ȳ) = TreeMix((xi , yi ), (xj , yj ))
D′ ← D′ ∪ {(x̄, ȳ)}
end

labels of two sentences as the new label for the
augmented sample:
ȳ =

|tlj |
li − |tki |
y
+
yj , (2)
i
li − |tki | + |tlj |
li − |tki | + |tlj |

where li is the length of xi and |tki |, |tkj | are the
length of the subtrees. In the new sentence, li −
|tki | words from xi are kept and |tlj | words from
sentence xj are inserted.
l −|tk |

i
i
In Equation 2, l −|t
k |+|tl | is the fraction of words
i

i

j

that come from xi , which determines the weight
of yi . The label is then created based on the conjecture that the change in labels is proportional
to the length changes in the original sentences.
We provided a set of augmentation examples from
TreeMix in Table A.1 in Appendix.
Pairwise Sentence Classification Task The
above mainly used single sentence classification as
the running example for TreeMix. Here we argue
that TreeMix can easily be extended to pairwise
sentence classification problem, where the relationship between the sentences is the label.
Formally, for a given sample (xi , x′i , yi ), we randomly sample another sample (xj , x′j , yj ) and run
the parser and get the subtree sets of each sentence
S(xi ), S(x′i ) and S(xj ), S(x′j ). Then we ran′
domly sample subtrees tki ∈ Sλ (xi ), tki′ ∈ Sλ (x′i )
′
and tlj ∈ Sλ (xj ), tlj ′ ∈ Sλ (x′j ). We construct x̄ by
′
replacing tki with tlj and x̄′ by replacing tki′ with
′
tlj ′ . The new label is created as:
′

li + li′ − |tki | − |tki′ |
ȳ =
′ yi + (3)
′
li + li′ − |tki | − |tki′ | + |tlj | + |tlj ′ |
′

|tlj | + |tlj ′ |
′

′

li + li′ − |tki | − |tki′ | + |tlj | + |tlj ′ |

yj .

The meanings of the notations are the same as in
Equation 2.

Our main algorithm is shown in Algorithm 1.
Although not all sentences created by TreeMix are
fluent or even valid new sentences, they contains
subparts with different meanings that encourage the
models to build rich representation of sentences in
a compositional manner. Note that the augmented
labels are convex combination of original labels,
only when the model learns the representations of
two parts together can they predict both labels with
different weights.
3.2

Training Objective

Our model is trained on a combination of the original samples and augmentation samples to obtain
a trade-off between regularization and noise injection. The final training objective is:
L=

E

(x,y)∼D

+γ

[−y⊺ log Pθ (y|x)]
E

(x̄,ȳ)∼D′

[−ȳ⊺ log Pθ (ȳ|x̄)] ,

Experiment

4.1

Datasets

To test TreeMix’s effectiveness, we experiment
with a variety of text classification benchmarks,
as shown in Table 3. We use accuracy as a metric, and exclude datasets from GLUE (Wang et al.,
2019) that are not suitable for mixup, including
CoLA that measures linguistic acceptability and
will be ruined by mixup operations, and WNLI that
is too small to show a method’s validity.
4.2

Table 3: Dataset name or split name, task category
and number of label class, Size used for training
and testing. For tasks from GLUE, Size indicates
(#train:#validation); for TREC, AG_NEWS and IMDb,
Size indicates (#train:#test).

(4)

γ is the weight1 on the augmentation samples.

4

Dataset
Task
Class
Size
Single Sentence Classification
SST2
Sentiment
2
67k/1.8k
TREC-fine
Question
47
5.5k/500
TREC-coarse Question
6
5.5k/500
AG_NEWS
News
4
12k/4k
IMDb
Sentiment
2
12.5k/12.5k
Pair Sentence Classification
RTE
NLI
2
3.5k/300
MRPC
Paraphrase
2
3.7k/400
QNLI
QA
2
105k/5.5k
QQP
Paraphrase
2
364k/40.4k
MNLI
NLI
3
393k/9.8k

Experiment Setup

The proposed TreeMix method creates new samples by combining text spans based on the constituency tree’s information, thus we use the Stanford CoreNLP toolkit2 to obtain parsing related
information (Manning et al., 2014). We use the pretrained language model bert-base-uncased for sequence classification task from HuggingFace. With
seeds ranging from 0 to 4 and λL = 0.1, λU = 0.3,
we use TreeMix to generate twice and five times
more samples than the original training set3 . We
replicate the original dataset to the same size as
the augmentation datasets in the training stage to
1
Section B in Appendix presents discussions on how the
objective and different weight parameter affects the result.
2
The specific version is 3.9.2
3
Section D.1 in Appendix presents robustness check on
how different amount of augmented data affects the result.

ensure that the model receives the same amount of
data from the original dataset and the augmentation
dataset for each training batch.
If not specified, we train the model for 5 epochs,
with a maximum sequence length of 128 and batch
size of 96. The model is optimized using the
AdamW optimizer with an eps of 1e-8 and a learning rate of 2e-5. Table C.1 in Appendix contains
detailed hyper-parameter settings for each dataset.
4.3

Baseline

We compare TreeMix with the following benchmarks: (1) No augmentation (BERT): standard
training without any augmentation, (2) EDA that
randomly performs insertion, replacement, swap
and deletion to the text. (3) AEDA that randomly
inserts punctuation to the text. (4) Back translation(BT) (Edunov et al., 2018): texts are translated
between English and German using Transformer
architectures trained on WMT16 English-German.
(5) GPT3Mix(Yoo et al., 2021) designs prompts
and utilizes GPT3 to generate new examples to
train the model. (6) SSMix (Yoon et al., 2021) applies mixup based on the saliency (Simonyan et al.,
2014) of tokens, similar to PuzzleMix (Kim et al.,
2020) and SaliencyMix (Uddin et al., 2021). (7)
EmbedMix is the pretrained-language-model version of WordMixup in Guo et al. (2019), which
performs mixup on the embedding level. (8) TMix
(Chen et al., 2020b) first encodes two inputs separately, then performs the linear interpolation of
two embeddings at a certain encoder layer, and fi-

   

nally forward-passes the combined embedding in
the remaining layers.

 % ( 5 7
 % ( 5 7  5 D Q G 0 L [
 % ( 5 7  7 U H H 0 L [

5

Results and Analysis

5.1

Performance On Full Dataset

The results of TreeMix on the entire datasets are
shown in Table 4. TreeMix outperforms all baselines significantly on single sentence classification
tasks, demonstrating the superiority of using compositional substructure for substitution and augmentation. For instance, On SST2, it improves by
0.98%. Compared to other methods, the improvement was more than doubled.
This is because that, unlike SSMix which substitutes the text spans based on the saliency, our
TreeMix makes use of the constituency information
to help identify linguistically informed sentence
substructures, and by recombining these components, the compositional diversity of the datasets
can be maximized. With our TreeMix generated
samples, the model can see more combinations of
the substructures in the training stage that aren’t
available in the original corpus, leading to better
generalization ability.
When it comes to sentence relationship classification tasks, TreeMix is also very effective. For
example, It improves by 2.47% on the RTE data set,
whereas the best improvement of other methods is
only 0.3%, and it improves by 0.82% on QNLI,
where other data augmentation methods have little
effect. We hypothesized that, when two constituent
parts from one sentence pair are embedded into
another sentence pair, the inherent relationship is
also embedded. This better helps the models on
how to to identify two pairs of relationships in a single sample, which further increases its capacity to
categorize these challenging adversarial sentences.
Since TreeMix works by increasing dataset diversity and providing models with more text patterns
to learn, it has very significant improvements for
these relatively small datasets such as RTE and
TREC, compared to these large datasets such as
AG NEWS,QQP and MNLI that already have a lot
of diversity and text patterns.
5.2

Influence of Constituency Information

To determine the importance of constituency information, we designed a Random Mixup (RandMix)
that randomly selects text spans as long as the ratio of span length to sentence length is less than

 $ F F X U D F \  

  
  
  
  
  

 6 6 7 

 7 5 ( &  I  7 5 ( &  F

 , 0 ' E  $ * 1 ( : 6

Figure 2: Performance of RandMix and TreeMix on single sentence classification datasets, scores are averaged
over 5 random seeds.

a particular threshold λrand 4 . The rest setting of
RandMix is the same as TreeMix. We compare
TreeMix and RandMix on single sentence classification datasets in Figure 2.
We found that, both RandMix and TreeMix are
quite effective, but TreeMix outperforms RandMix
on most datasets. For instance, TreeMix exceeds
RandMix by 0.8% on SST2, 0.6% on TREC-f, and
0.5% on TREC-c. One exception is on IMDb,
where the average sentence length is much longer.
The reason for the poorer performance of TreeMix
is due to the sparse parsing results on long sentences; since there are many subtrees, substituting
any single part might bring very minimal change
to the entire sentence.
5.3

Influence of Training Set Size

To examine the influence of TreeMix with different
training set sizes, we uniformly sample 1%, 2%,
5%, 10%, and 20% of the data from the training set
to investigate TreeMix in low-resource situations.
The entire test set is used to evaluate the model’s
generalization ability. Since TreeMix generates
more examples for training, we use RandMix to
generate the same number of extra samples as a
comparison to ensure the data size is fair. The
results are summarized in Figure 3.
We found that, (1) TreeMix outperforms RandMix in all settings, further demonstrating the advantage of the compositional substructure with the constituency information over the randomly selected
spans. (2) Both mixup methods can significantly
improve the model’s performance in the case of
extreme data scarcity (e.g, 1% and 2%). (3) When
the amount of data is sufficient (e.g, more than
4
We observed λrand ∼ U(0, 0.3) is optimal and we use
this settings for the experiment

Single Sentence Classification

Model
SST2
BERT
BERT+EDA
BERT+AEDA
BERT+BT
BERT+GPT3Mix
BERT+SSMix
BERT+EmbedMix
BERT+TMix
BERT+TreeMix
†

Pair Sentence Classification

TREC-f

TREC-c

IMDb

AG NEWS

MRPC

RTE

QNLI

QQP

MNLI

92.57†
92.48
93.25†
93.14†
93.03†
93.03†

92.36
91.95
92.15
92.15
92.80
92.32
92.68

97.08†
97.20†
96.68
97.60†
97.44†
97.52†

93.63
93.62
93.59
93.74
93.72
93.69

94.67
94.67
94.22
94.64
94.72
94.69

84.90
82.13
84.31
85.34
85.69

68.15
67.40
68.40
68.37
68.45

90.54
90.60
90.44
90.48

90.67
90.75
90.58
90.66

84.27†
84.54†
84.35†
84.30†

93.92

93.20

97.95

94.34

94.72

85.34

70.62

91.36

90.88

84.45

92.96†
92.20†

96.79†

denotes the result is extracted from the original paper

 $ F F X U D F \  

Table 4: Results of comparison with baseline on full datasets, TREC-f and TREC-c indicates TREC-fine and
TREC-coarse respectively. Scores are averaged over 5 random seeds. For GLUE tasks, we report accuracy of
validation sets, and for other datasets we report test accuracy. EDA and AEDA will seriously damage the sentence
relationship and harm the accuracy; GPT3Mix only reports full data experiments results on SST2 in original paper.
We only report the results of back translation on small dataset due to the heavy computational cost.

    
    
    
    
    
    
    
    

 % ( 5 7
 % ( 5 7  5 D Q G 0 L [
 % ( 5 7  7 U H H 0 L [
   

 

  

 S H U F H Q W D J H  R I  G D W D    

  

Figure 3: Results on SST2 varying data size. Scores are
averaged over 5 random seeds.

5%), TreeMix outperforms RandMix by a significant margin. However, TreeMix only slightly outperforms RandMix when there is a severe lack of
data (e.g, 1% and 2%). This is due to that the too
small datasets often contain very limited structures,
thus constraining TreeMix’s ability to increase text
patterns and compositional diversity. (4) The relative improvement of TreeMix over conventional
training without augmentation diminishes as the
amount of data increases, largely due to that additional augmented text patterns might overlap with
those already existing in the dataset, resulting in
limited improvement.
5.4

Influence of Cross-Category Mixing

Different from prior work Shi et al. (2021),
TreeMix allows the composition of sentences from
different label categories. To test whether this
cross-label category mixup is more effective than

Datasets

BERT

TM(same)

TM(cross)

SST2
TREC-fine
TREC-coarse
IMDb
AGNEWS
MRPC
RTE
QNLI
QQP
MNLI

92.96
92.36
97.08
93.63
94.67
84.90
68.15
90.54
90.67
84.27

93.78
92.60
97.74
94.22
94.47
85.34
70.25
90.87
90.85
84.33

93.92
93.20
97.95
94.34
94.71
85.34
70.62
91.36
90.88
84.45

Table 5: Performance with TreeMix performed (1)
within same classes TM(same) and (2) cross different
classes TM(cross), averaged over 5 runs.

a within-label category mixup, we conducted ablation studies with TreeMix on samples in the same
class5 . Table 5 shows the results. Across all
datasets, we found that TreeMix that combines data
from different classes is more effective than combining data from the same class, consistent with
findings in Zhang et al. (2018). When given only
labels from one category, current models have a tendency to make simple or spurious judgments based
on the most frequently occurring words. However
the semantics of the sentence are complicated beyond simple words. For example, the model is
likely to classify a sentence like “I like this good
movie" as positive because of the words “like" and
“good", but if “good movie" is replaced with “bad
5
Section D.2 in Appendix discuss the effect of constraining
the length and label of the swapped subtree on the result

film", the model must perceive the different constituent parts within the sentence. This ability can
only be obtained when the model is trained on the
cross-category generated samples.
5.5

Influence of Length Ratio

Dataset

BERT

λ =[0.1,0.3]

λ =[0.3,0.5]

SST2
TREC-fine
TREC-coarse
IMDb
AG NEWS
MRPC
RTE
QNLI
QQP
MNLI

92.96
92.36
97.08
93.63
94.67
84.90
68.15
90.54
90.67
84.27

93.92
93.2
97.95
94.34
94.72
85.34
70.62
91.36
90.88
84.45

93.05
92.25
96.94
93.29
94.53
84.93
70.35
90.78
90.54
83.78

Table 6: Performance with different length ratio intervals λ

The only constraint we impose on TreeMix is
the length ratio of the subtree controlled by λ. We
select subtrees that are between 10% and %30 and
between 30% and 50% of the length of the sentence,
respectively. Table 6 shows the results.
On all datasets, λ = [0.1, 0.3] outperforms
λ = [0.3, 0.5], which is in line with Zhang et al.
(2018)’s observation that giving too high mixup
ration values can lead to underfitting. Another linguistic explanation for the scenario follows: When
λ = [0.3, 0.5], TreeMix may select longer text
spans, which usually contain unique constituency
components like SBAR; The exchange of these
spans will severely damage the sentence’s semantic
and grammatical structure, causing the model to
become confused. As a result, TreeMix with larger
switching spans performs poorly, and even worse
than baseline on some datasets.
5.6

Compositional Generalization

To quantify TreeMix’s overall ability of compositional generalization beyond classification tasks,
we conducted experiments on SCAN (Lake and Baroni, 2018) dataset, which is a command execution
dataset widely used to test for systematic compositionality. It contains simple source commands
and target action sequences. We test on commonly
used challenging splits: addprim-jump, addprimturn-left, around-right, where primitive commands
(e.g “jump”) only appear alone during training but
will be combined with other modifiers (e.g “jump

†

Method

JUMP

TURN-L

AROUND-R

Baseline
WordDrop
SwitchOut
SeqMix
TreeMix

0†

49% †

0†

51% †

0†
49%†
72%

16% †
99% †
99%

0†
0†
0†
0†
0%

GECA
GECA+WordDrop
GECA+SwitchOut
GECA+SeqMix
GECA+TreeMix

87%†
51%†
77%†
98%†
99%

-

82†
61†
73†
89 †
91%

denotes the result is extracted from the original paper

Table 7: Experimental results (accuracy) on SCAN.

twice”) during testing. A model that works well
for this task should learn to compose the primitive commands with the modifiers and generates
corresponding execution. With TreeMix, we can
generate the compositional commands that are not
seen in the training set.
The new command generation process is the
same as in single sentence classification, except
that we increase the length constraint λU to 1 to
allow the exchange of the commands with only
one word. After we synthesize new commands,
we follow the rules in Lake and Baroni (2018) to
translate valid commands into actions and filter
out ungrammatical commands. We follow the settings in Andreas (2020) and use the following data
augmentation methods as baselines: (1) WordDrop
that drops words randomly; (2) SwitchOut (Wang
et al., 2018) that randomly replaces words with
other random words from the same vocabulary; (3)
SeqMix (Guo et al., 2020) which creates new synthetic examples by softly combining in-put/output
sequences from the training set, and (4) GECA
(Andreas, 2020) that performs enumerated valid
swaps.
As shown in Table 7, TreeMix outperforms
SwitchOut and WordDrop for all splits. TreeMix
by itself does not perform as well as GECA, but
when being combined with GECA, it demonstrates
very strong results. TreeMix outperforms SeqMix
in all splits, due to the fact that TreeMix can more
precisely find the linguistically rich compositional
segments of a sentence, as evidenced by the results of the comparisons of TreeMix and SSMix
in Section 5.1 and TreeMix and RandMix in Section 5.3. A closer look at these augmented samples
show that TreeMix can generate all possible combinations of “jump” and other modifiers like “left”

and “around”; these previously unseen command
combinations further validates TreeMix’s ability
to improve the dataset’s compositional diversity.
TreeMix demonstrates weak performances on the
around-right split, where the model observes commands “around” and “right” in isolation at the
training stage, and it has to derive the meaning of
“around right” at the test time. Because the word
“around” cannot be parsed as a single subtree for
swap. Instead, it always appears in a subtree with
the word “left”, preventing TreeMix from generating the phrase “turn right”. Despite its limitations
on around-left, TreeMix performs well on all other
splits and can be easily combined with other data
augmentation methods, demonstrating the compositional generalization ability of TreeMix beyond
classification tasks.

6

Conclusion

This work introduced TreeMix, a compositional
data augmentation approach for natural language
understanding. TreeMix leverages constituency
parsing tree to decompose sentences into substructures and further use the mixup data augmentation technique to recombine them to generate
new augmented sentences. Experiments on text
classification and semantic parsing benchmarks
demonstrate that TreeMix outperforms prior strong
baselines, especially in low-resource settings and
compositional generalization.

Acknowledgements
The authors would like to thank reviewers for their
helpful insights and feedback. This work is funded
in part by a grant from Salesforce.

References
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,
Amir Kantor, George Kour, Segev Shlomov, Naama
Tepper, and Naama Zwerdling. 2020. Do not have
enough data? deep learning to the rescue! In The
Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI
2020, The Tenth AAAI Symposium on Educational
Advances in Artificial Intelligence, EAAI 2020, New
York, NY, USA, February 7-12, 2020, pages 7383–
7390. AAAI Press.
Jacob Andreas. 2020. Good-enough compositional data
augmentation. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 7556–7566, Online. Association for
Computational Linguistics.

Toms Bergmanis, Katharina Kann, Hinrich Schütze, and
Sharon Goldwater. 2017. Training data augmentation
for low-resource morphological inflection. In Proceedings of the CoNLL SIGMORPHON 2017 Shared
Task: Universal Morphological Reinflection, pages
31–39, Vancouver. Association for Computational
Linguistics.
Hengyi Cai, Hongshen Chen, Yonghao Song, Cheng
Zhang, Xiaofang Zhao, and Dawei Yin. 2020. Data
manipulation: Towards effective instance learning for
neural dialogue generation via learning to augment
and reweight. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 6334–6343, Online. Association for
Computational Linguistics.
Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal,
and Diyi Yang. 2021. An empirical survey of data
augmentation for limited data learning in nlp.
Jiaao Chen, Zhenghui Wang, Ran Tian, Zichao Yang,
and Diyi Yang. 2020a. Local additivity based data
augmentation for semi-supervised NER. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
1241–1251, Online. Association for Computational
Linguistics.
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020b. MixText: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–
2157, Online. Association for Computational Linguistics.
Luoxin Chen, Weitong Ruan, Xinyue Liu, and Jianhua
Lu. 2020c. SeqVAT: Virtual adversarial training for
semi-supervised sequence labeling. In Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 8801–8811, Online.
Association for Computational Linguistics.
Xinyun Chen, Chen Liang, Adams Wei Yu, Dawn Song,
and Denny Zhou. 2020d. Compositional generalization via neural-symbolic stack machines. In Advances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual.
Yong Cheng, Lu Jiang, Wolfgang Macherey, and Jacob
Eisenstein. 2020. AdvAug: Robust adversarial augmentation for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 5961–5970, Online. Association for Computational Linguistics.
Xiang Dai and Heike Adel. 2020. An analysis of simple
data augmentation for named entity recognition. In
Proceedings of the 28th International Conference
on Computational Linguistics, pages 3861–3867,
Barcelona, Spain (Online). International Committee
on Computational Linguistics.

Mathieu Dehouck and Carlos Gómez-Rodríguez. 2020.
Data augmentation via subtree swapping for dependency parsing of low-resource languages. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3818–3830, Barcelona,
Spain (Online). International Committee on Computational Linguistics.
Bosheng Ding, Linlin Liu, Lidong Bing, Canasai Kruengkrai, Thien Hai Nguyen, Shafiq Joty, Luo Si, and
Chunyan Miao. 2020. DAGA: Data augmentation
with a generation approach for low-resource tagging
tasks. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 6045–6057, Online. Association for
Computational Linguistics.
Sergey Edunov, Myle Ott, Michael Auli, and David
Grangier. 2018. Understanding back-translation at
scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 489–500, Brussels, Belgium. Association for
Computational Linguistics.
Marzieh Fadaee, Arianna Bisazza, and Christof Monz.
2017. Data augmentation for low-resource neural
machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 567–573,
Vancouver, Canada. Association for Computational
Linguistics.
Steven Y. Feng, Varun Gangal, Dongyeop Kang, Teruko
Mitamura, and Eduard Hovy. 2020. GenAug: Data
augmentation for finetuning text generators. In Proceedings of Deep Learning Inside Out (DeeLIO): The
First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 29–42,
Online. Association for Computational Linguistics.
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation
approaches for NLP. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
pages 968–988, Online. Association for Computational Linguistics.
Demi Guo, Yoon Kim, and Alexander Rush. 2020.
Sequence-level mixed sample data augmentation. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 5547–5552, Online. Association for Computational Linguistics.
Hongyu Guo, Yongyi Mao, and Richong Zhang. 2019.
Augmenting data with mixup for sentence classification: An empirical study.
Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
Ranzato. 2020. Revisiting self-training for neural
sequence generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

Jonathan Herzig and Jonathan Berant. 2021. Spanbased semantic parsing for compositional generalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pages 908–921, Online. Association for Computational Linguistics.
Wei-Ning Hsu, Hao Tang, and James R. Glass. 2018.
Unsupervised adaptation with interpretable disentangled representations for distant conversational speech
recognition. ArXiv, abs/1806.04872.
Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
Robin Jia and Percy Liang. 2016. Data recombination
for neural semantic parsing. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
12–22, Berlin, Germany. Association for Computational Linguistics.
Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020.
SMART: Robust and efficient fine-tuning for pretrained natural language models through principled
regularized optimization. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 2177–2190, Online. Association
for Computational Linguistics.
Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2021.
Aeda: An easier data augmentation technique for text
classification.
Jang-Hyun Kim, Wonho Choo, Hosan Jeong, and
Hyun Oh Song. 2021. Co-mixup: Saliency guided
joint mixup with supermodular diversity.
Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song.
2020. Puzzle mix: Exploiting saliency and local
statistics for optimal mixup. In Proceedings of the
37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event, volume
119 of Proceedings of Machine Learning Research,
pages 5275–5285. PMLR.
Sosuke Kobayashi. 2018. Contextual augmentation:
Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 452–457,
New Orleans, Louisiana. Association for Computational Linguistics.
Oleksandr Kolomiyets, Steven Bethard, and MarieFrancine Moens. 2011. Model-portability experiments for textual temporal analysis. In Proceedings
of the 49th Annual Meeting of the Association for

Computational Linguistics: Human Language Technologies, pages 271–276, Portland, Oregon, USA.
Association for Computational Linguistics.
Varun Kumar, Ashutosh Choudhary, and Eunah Cho.
2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop
on Life-long Learning for Spoken Language Systems,
pages 18–26, Suzhou, China. Association for Computational Linguistics.
Brenden M. Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional
skills of sequence-to-sequence recurrent networks. In
Proceedings of the 35th International Conference on
Machine Learning, ICML 2018, Stockholmsmässan,
Stockholm, Sweden, July 10-15, 2018, volume 80 of
Proceedings of Machine Learning Research, pages
2879–2888. PMLR.
Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng
Chen, Jiancheng Lv, Nan Duan, and Ming Zhou.
2020a. Tell me how to ask again: Question data
augmentation with controllable rewriting in continuous space. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 5798–5810, Online. Association for
Computational Linguistics.
Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng
Ma, Lili Wang, and Soroush Vosoughi. 2020b. Data
boost: Text data augmentation through reinforcement
learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
9031–9041, Online. Association for Computational
Linguistics.
Nikolaos Malandrakis, Minmin Shen, Anuj Goyal,
Shuyang Gao, Abhishek Sethi, and Angeliki Metallinou. 2019. Controlled text generation for data
augmentation in intelligent artificial agents. In Proceedings of the 3rd Workshop on Neural Generation
and Translation, pages 90–98, Hong Kong. Association for Computational Linguistics.
Christopher Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven Bethard, and David McClosky.
2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
Zhengjie Miao, Yuliang Li, Xiaolan Wang, and WangChiew Tan. 2020. Snippext: Semi-supervised opinion mining with augmented data. In WWW ’20: The
Web Conference 2020, Taipei, Taiwan, April 20-24,
2020, pages 617–628. ACM / IW3C2.
Takeru Miyato, Shin-Ichi Maeda, Masanori Koyama,
and Shin Ishii. 2019. Virtual adversarial training: A regularization method for supervised and
semi-supervised learning.
IEEE Transactions

on Pattern Analysis and Machine Intelligence,
41(8):1979–1993.
Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi.
2020. SSMBA: Self-supervised manifold based data
augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 1268–1283, Online. Association for
Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Improving neural machine translation models
with monolingual data. In Proceedings of the 54th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 86–96,
Berlin, Germany. Association for Computational Linguistics.
Dinghan Shen, Ming Zheng, Yelong Shen, Yanru Qu,
and Weizhu Chen. 2020. A simple but tough-to-beat
data augmentation approach for natural language understanding and generation. ArXiv, abs/2009.13818.
Haoyue Shi, Karen Livescu, and Kevin Gimpel. 2020.
On the role of supervision in unsupervised constituency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 7611–7621, Online. Association for Computational Linguistics.
Haoyue Shi, Karen Livescu, and Kevin Gimpel. 2021.
Substructure substitution: Structured data augmentation for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages
3494–3508, Online. Association for Computational
Linguistics.
Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. 2018.
On tree-based neural sentence modeling. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 4631–4641,
Brussels, Belgium. Association for Computational
Linguistics.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks:
Visualising image classification models and saliency
maps.
A. F. M. Shahab Uddin, Mst. Sirazam Monira, Wheemyung Shin, TaeChoong Chung, and Sung-Ho Bae.
2021. Saliencymix: A saliency guided data augmentation strategy for better regularization.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R. Bowman. 2019.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations,
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net.
William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding

based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve
tweets. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing,
pages 2557–2563, Lisbon, Portugal. Association for
Computational Linguistics.
Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. SwitchOut: an efficient data augmentation
algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 856–861,
Brussels, Belgium. Association for Computational
Linguistics.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text
classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.
Zhanghao Wu, Shuai Wang, Yanmin Qian, and Kai Yu.
2019. Data augmentation using variational autoencoder for embedding based speaker verification. In
INTERSPEECH.
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2020. Unsupervised data augmentation for consistency training.
Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming
Nie, Dan Jurafsky, and Andrew Y. Ng. 2017. Data
noising as smoothing in neural network language
models. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings. OpenReview.net.
Kang Min Yoo, Dongju Park, Jaewook Kang, SangWoo Lee, and Woomyeong Park. 2021. Gpt3mix:
Leveraging large-scale language models for text augmentation.
Soyoung Yoon, Gyuwan Kim, and Kyumin Park. 2021.
SSMix: Saliency-based span mixup for text classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages
3225–3234, Online. Association for Computational
Linguistics.
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui
Zhao, Kai Chen, Mohammad Norouzi, and Quoc V.
Le. 2018. Qanet: Combining local convolution with
global self-attention for reading comprehension. In
6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April
30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net.
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun,
Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe.
2019. Cutmix: Regularization strategy to train

strong classifiers with localizable features. In 2019
IEEE/CVF International Conference on Computer
Vision, ICCV 2019, Seoul, Korea (South), October 27
- November 2, 2019, pages 6022–6031. IEEE.
Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and
David Lopez-Paz. 2018. mixup: Beyond empirical
risk minimization. In 6th International Conference
on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference
Track Proceedings. OpenReview.net.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12,
2015, Montreal, Quebec, Canada, pages 649–657.
Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2020. Freelb: Enhanced
adversarial training for natural language understanding. In 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia,
April 26-30, 2020. OpenReview.net.

A

Augmentation examples
Original Sentence1
a love story and a murder mystery
that expands into a meditation on the
deep deceptions of innocence [1]
the attempt to build up a pressure
cooker of horrified awe [0]
rest contentedly with the knowledge
that he ’s made at least one damn fine
horror movie [1]
might just be better suited to a night
in the living room than a night at the
movies [0]
is a touching reflection on aging , suffering and the prospect of death [1]
is dark , brooding and slow , and takes
its central idea waytoo seriously [0]

Original Sentence2
really an advantage to invest such subtlety and warmth in an animatronic
bear when the humans are acting like
puppets [0]
had the ability to mesmerize , astonish
and entertain [1]
minor film [0]
are made for each other . [1]
keep upping the ante on each other
[1]
merely pretentious [0]

New Sentence
a love story and are acting like puppets that expands into a meditation on
the deep deceptions of innocence [0.21
0.79]
the attempt to build up the ability of
horrified awe [0.8 0.2]
rest contentedly with the knowledge
that he ’s made minor film damn fine
horror movie [0.13 0.87]
might just be better suited to each other
room than a night at the movies [0.86
0.14]
is each other on aging , suffering and
the prospect of death [0 1]
is dark , brooding and slow , and takes
merely pretentious too seriously [1. 0.]

Table A.1: Examples of TreeMix on SST2 datasets, the number following sentence is label, bold tokens are selected
phrase for substitution

B

The necessity of merged loss techniques

We provide a detailed discussion of the techniques proposed in 3.2. We first investigate the noise contained
in the augmentation dataset, then we figure out how the unbalance dataset will affect the performance. In
the second part, we vary the weight parameter γ to see how it affects the model’s learning process.
B.1

Noise and Unbalance

All mixup methods, as previously stated, introduce noise into the dataset. This noise in the text includes
grammatical structure confusion and multiple semantic meanings in the sentences. The model will be
overwhelmed by the noise if trained solely on the generated augmentation dataset, and will even perform
worse than the baseline. In terms of the unbalance problem, we find that training the model without
replicating the original dataset to the same size as the augmentation dataset hurts the model’s performance.
The results are shown in the table B.1.

BERT
Merged Loss
Augmentation only
No Replicate

SST2

TREC-f

TREC-c

IMDb

AG NEWS

MRPC

RTE

QNLI

QQP

MNLI

92.96
93.92
92.57
93.05

92.36
93.2
90.44
92.42

97.08
97.95
96.42
97.21

93.63
94.34
92.37
93.7

94.67
94.72
93.98
94.65

84.9
85.34
83.93
85.02

68.15
70.63
65.45
69.56

90.54
91.36
88.72
91.04

90.67
90.88
89.24
90.72

84.27
84.45
83.78
84.35

Table B.1: Merged Loss indicates results following techniques in 3.2, Augmentation indicates the model is trained
on the generated dataset alone. No Replicate indicates Merged Loss without replication of the original training set.

B.2

Weight parameter

We vary weight parameter γ to find optimal balance point between diversity and linguistic grammar,
the results are shown in figure 4. Performance on the two classification tasks follows a similar pattern.
Both increase with increasing weight and then rapidly decrease with increasing weight after reaching
the highest point. Performance is weaker than the baseline when the weight value exceeds 0.7. We find
the model achieves the best performance with γ ∈ {0.2, 0.5}. For single sentence classification tasks,
when γ = 0.5 the model always gets higher accuracy, and γ = 0.2 is better for these sentence relation
classification datasets.

Figure 4: The performance when varying the value of the weight parameter on single sentence classification (left)
and sentence relation classification (right)
Datasets

epoch

batch size

aug batch size

val steps

sequence length

aug weight

SST2
TREC-f
TREC-c
IMDb
AGNEWS
MRPC
RTE
QNLI
QQP
MNLI

5
20
20
5
5
10
5
5
5
5

96
96
96
8
96
32
32
96
96
96

96
96
96
8
96
32
32
96
96
96

100
100
100
500
500
100
50
100
300
100

128
128
128
512
128
128
128
128
128
128

0.5
0.5
0.5
0.5
0.5
0.2
-0.2
0.2
0.2
0.2

Table C.1: Best settings for different datasets

C

Hyper-parameters for each datasets

We explore different parameter combinations and find the best ones for each task, as in Tab C.1. There are
some exceptions, such as TREC datasets, where the model cannot converge even with 10 epochs, so we
increase the training epochs to 20 for this dataset. IMDb’s examples are extremely long, with an average
length of more than 200 words. Along with this change, we increased the truncation length to 512 and the
batch size to 8 to fully capture the semantic meaning. RTE is the most unusual. First, when we train using
original RTE datasets, the accuracy deviation is really substantial, reaching up to 4%. Second, we find
that γ = −0.2 is optimum for this set, which contradicts previous findings.

D

Ablation Study

Shi et al. (2021) has proposed a similar study that uses constituency information for mixup. There are a
few significant differences between our approaches. To begin with, their method is too restricted; they only
perform mixup between examples from the same category, and they require the substituted subtree’s label
to be the same. Second, because they are limited to the same class examples, they are unable to devise a
method for adding a soft label to the example. Instead, we only use TreeMix in the previous settings with
the length constraint. Several other constraints in the subtree selection process are investigated in this
section, and we achieve better performance than Shi et al. (2021) by giving the subtree selection process
more freedom, and we validate that their work is a special case of our method by examining how other
constraints affect the performance. This section’s values are the averages of five runs with seeds ranging
from 0 to 4
D.1

What is the difference between different amounts of data?

TreeMix has the potential to generate an infinite amount of augmented data in theory. However, due to
TreeMix’s principle, it can only improve performance to a point when the size of the augmentation data

set reaches a certain limit. We investigated how many augmentation datasets the model needs. Table D.1
shows the results of producing twice and five times the augmentation data for experiments.
Dataset

Size

BERT

TM(x2)

TM(x5)

RTE
MRPC
TREC-f
TREC-c
IMDb
SST2
QNLI
AGNEWS
QQP
MNLI

3.5k
3.7k
5.5k
5.5k
12.5k
67k
105k
120k
364k
393k

68.15
84.90
92.36
97.08
93.63
92.96
90.54
94.67
90.67
84.27

70.57
85.22
93.2
97.71
94.34
93.92
91.36
94.71
90.88
84.45

70.62
85.37
92.85
97.95
94.24
93.92
91.34
94.69
90.83
84.41

Table D.1: Improvement of performance on all datasets with different amount of augmentation datasets, TM(x2)
indicates generating twice as much augmentation data than the original data, TM(x5) indicates five times than
original data, datasets in the table are listed in order of size

The key to getting the best results is to strike a balance between the original datasets and the augmentation datasets in terms of diversity and linguistic confusion. With more augmentation datasets, the
model will learn more patterns while also observing more grammatically poor samples, which could
negatively impact performance. We discovered that augmentation datasets twice the size of the original
dataset produce the best results for larger datasets. This is in line with our previous theoretical analysis:
large datasets inherently include more patterns and diversity, which helps the model generalize better.
Maintaining the original linguistic grammar while increasing diversity in these datasets is, therefore, more
important. When working with smaller datasets, it’s better to train with more augmentation data. For
models to train on these datasets, we believe diversity is more important than linguistic grammar.
TREC-fine is an exception. We attribute it to the datasets’ excessive classes (up to 47 classes within only
5.5k training samples): each class has a very limited number of samples, and if we create overly augmented
dataset samples, the limited samples of each category are insufficient to resist injected linguistic noise. As
a result, for TREC-fine, x2 is preferable to x5. For a smaller dataset, we recommend generating five times
as much augmentation data as possible, and for a larger dataset, we recommend generating twice as much
augmentation data.
D.2

Is it beneficial to keep the swapped subtree’s label or length the same?
     
     
     

 7 U H H 0 L [
 E H U W
     

     
     

     
     

     
     

     

     

     

     
     

 7 U H H 0 L [  O D E H O   7 U H H 0 L [  O H Q J W K 

 7 U H H 0 L [

Figure 5: Performance on SST2 with different subtree selection constraints, green part is bert performance, orange
part is improvement of TreeMix when applying different constraints, TreeMix(label) indicates only select subtrees
with same phrase label, TreeMix(length) indicates only select subtrees with same length. TreeMix indicates without
any constraints

Each subtree has its own label (e.g., VP and NP) and corresponds to a specific text span. When
selecting subtrees, we can use these characteristics as additional constraints. Figure 5 shows the results.

When we impose restrictions on the subtree selection process, the experimental results clearly show that
performance suffers.
We hypothesize that this is because in datasets with similar sentence lengths, subtrees of the same
phrase label or phrase length tend to have similar structures (e.g., tree height, relative position in the
sentence). Although the exchange of such subtrees can retain the original linguistic grammar of the text
to some extent (e.g., replacing a noun phrase with another noun phrase will not significantly disrupt the
sentence) and maintain similar sentence length, it cannot exploit the potential compositional diversity in
the datasets as efficiently as TreeMix without any constraints, resulting in lower diversity augmentation
datasets and limited improvement compared to the baseline. In terms of the comparison of TreeMix(label)
and TreeMix(length), we find that TreeMix(label) prefers simple phrases such as NP and VP because these
are the most common phrases occurring in sentences, and this exchange will not improve the diversity
of the datasets. For example, in "I like this apple," replacing "apple" with "orange" will not provide
innovative text patterns.