Continual Learning for Text Classification with Information
Disentanglement Based Regularization
Yufan Huang∗, Yanzhe Zhang∗ , Jiaao Chen, Xuezhi Wang1 , Diyi Yang
Georgia Institute of Technology, 1 Google
{yhuang704, jiaaochen, dyang888}@gatech.edu, 1 xuezhiw@google.com

Abstract
Continual learning has become increasingly
important as it enables NLP models to
constantly learn and gain knowledge over
time. Previous continual learning methods are
mainly designed to preserve knowledge from
previous tasks, without much emphasis on how
to well generalize models to new tasks. In this
work, we propose an information disentanglement based regularization method for continual learning on text classification. Our proposed method first disentangles text hidden
spaces into representations that are generic to
all tasks and representations specific to each
individual task, and further regularizes these
representations differently to better constrain
the knowledge required to generalize. We
also introduce two simple auxiliary tasks: next
sentence prediction and task-id prediction, for
learning better generic and specific representation spaces. Experiments conducted on largescale benchmarks demonstrate the effectiveness of our method in continual text classification tasks with various sequences and lengths
over state-of-the-art baselines.

1

Introduction

Computational systems in real world scenarios face
changing environment frequently, and thus are often required to learn continually from dynamic
streams of data building on what was learnt before
(Parisi et al., 2019). While being an intrinsic nature
of human to continually acquire and transfer knowledge throughout lifespans, most machine learning
models often suffer from catastrophic forgetting:
when learning on new tasks, models dramatically
and rapidly forget knowledge from previous tasks
(McCloskey and Cohen, 1989). As a result, Continual Learning (CL) (Ring, 1998; Thrun, 1998) has
received more attention recently as it can enable
models to perform positive transfer (Perkins et al.,
1992) as well as remember previously seen tasks.
∗

Equal contribution.

A growing body of research has been conducted
to equip neural networks with the ability of continual learning abilities (Kirkpatrick et al., 2017;
Lopez-Paz and Ranzato, 2017; Aljundi et al., 2018).
Existing continual learning methods on NLP tasks
can be broadly categorized into two classes: purely
replay based methods (d’Autume et al., 2019; Sun
et al., 2019; Holla et al., 2020) where examples
from previous tasks are stored and re-trained during the learning of the new task to retain old information, and regularization based methods (Wang
et al., 2019; Han et al., 2020) where constraints are
added on model parameters to prevent them from
changing too much while learning new tasks. The
former usually stores an extensive amount of data
from old tasks (d’Autume et al., 2019) or trains
language models based on task identifiers to generate sufficient examples (Sun et al., 2019), which
significantly increases memory costs and training
time. While the latter utilizes previous examples
efficiently via the constraints added on text hidden
space or model parameters, it generally views them
as equally important and regularize them to the
same extent (Wang et al., 2019), making it hard for
models to differentiate informative representation
that needs to be retained from ones that need a large
degree of updates. However, we argue that when
learning new tasks, task generic information and
task specific information should be treated differently, as these generic representation might function consistently while task specific representations
might need to be changed significantly.
To this end, we propose an information disentanglement based regularization method for continual learning on text classification. Specifically,
we first disentangle the text hidden representation
space into a task generic space and a task specific
space using two auxiliary tasks: next sentence prediction for learning task generic information and
task identifier prediction for learning task specific
representations. When training on new tasks, we

constrain the task generic representation to be relatively stable and representations of task specific
aspects to be more flexible. To further alleviate
catastrophic forgetting without much increases of
memory and training time, we propose to augment
our regularization-based methods by storing and
replaying only a small amount of representative
examples (e.g., 1% samples selected by memory
selection rules like K-Means (MacQueen et al.,
1967)). To sum up, our contributions are threefold:
• We propose an information disentanglement
based regularization method for continual text
classification, to better learn and constrain
task generic and task specific knowledge.
• We augment the regularization approach with
a memory selection rule that requires only a
small amount of replaying examples.
• Extensive experiments conducted on five
benchmark datasets demonstrate the effectiveness of our proposed methods compared to
state-of-the-art baselines.

2

Related work

Continual Learning Existing continual learning
research can be broadly divided into four categories: (i) replay-based method, which remind
models of information from seen tasks via experience replay (d’Autume et al., 2019), distillation (Rebuffi et al., 2017), representation alignment (Wang et al., 2019) or optimization constraints (Lopez-Paz and Ranzato, 2017; Chaudhry
et al., 2018b) using examples sampled from previous tasks (Rebuffi et al., 2017; d’Autume et al.,
2019) or synthesized with generative models (Shin
et al., 2017; Sun et al., 2019); (ii) regularizationbased method, which constrains model’s output (Li
and Hoiem, 2017), hidden space (Rannen et al.,
2017), or parameters (Lopez-Paz and Ranzato,
2017; Zenke et al., 2017; Aljundi et al., 2018)
from changing too much to retain learned knowledge; (iii) architecture-based method, where different tasks are associated with different components
of the overall model to directly minimize the interference between new tasks and old tasks (Rusu
et al., 2016; Mallya and Lazebnik, 2018); (iv) metalearning-based method, which learns robust model
initialization (Obamuyide and Vlachos, 2019) or
data representations (Javed and White, 2019; Holla
et al., 2020; Wang et al., 2020) before training on
task sequences to alleviate forgetting.

Among these different approaches, replay-based
methods and regularization-based methods have
been widely applied to NLP tasks to enable large
pre-trained models (Devlin et al., 2018; Radford
et al., 2019) to continually acquire novel world
knowledge from streams of textual data without
forgetting the already learned knowledge. For instance, replaying examples have shown promising
performance for text classification (d’Autume et al.,
2019; Sun et al., 2019; Holla et al., 2020), relation
extraction (Wang et al., 2019; Obamuyide and Vlachos, 2019; Han et al., 2020) and question answering (d’Autume et al., 2019; Sun et al., 2019; Wang
et al., 2020). However, they often suffer from large
memory costs or considerable training time, due
to the requirements of storing an extensive amount
of texts (d’Autume et al., 2019; Wang et al., 2019)
or training language models to generate a sufficient number of examples (Sun et al., 2019). Recently, regularization-based methods (Wang et al.,
2019) have also been applied to directly constrain
knowledge deposited in model parameters without abundant rehearsal examples. Despite better
efficiency compared to replay-based methods, current regularization-based approaches often fail to
generalize well to new tasks as they treat and constrain all the information equally and thus limit the
needed updates for parameters that are specific to
different tasks. To overcome these limitations, we
propose to first distinguish hidden spaces that need
to be retained from those that need to be updated
substantially through information disentanglement,
and then regularize different spaces separately, to
better remember previous knowledge as well as
transfer to new tasks. In addition, we enhance our
regularization method by replaying only a limited
amount of examples selected by K-means as the
memory selection rule.
Textual Information Disentanglement Our
work is related to information disentanglement for
text data, which has been extensively explored
in generation tasks like style transfer (Fu et al.,
2017; Zhao et al., 2018; Romanov et al., 2018; Li
et al., 2020), where text hidden representations
are often disentangled into sentiment (Fu et al.,
2017; John et al., 2018), content (Romanov et al.,
2018; Bao et al., 2019) and syntax (Bao et al.,
2019) information through supervised learning
from pre-defined labels (John et al., 2018) or
unsupervised learning with adversarial training
(Fu et al., 2017; Li et al., 2020). Building on these

prior works, we differentiate task generic space
from task specific space via supervision from two
simple yet effective auxiliary tasks: next sentence
prediction and task identifier prediction.

3

Problem Formulation

In this work, we focus on continual learning for
a sequence of text classification tasks {T1 , ...Tn },
where we learn a model fθ (.), θ is a set of parameters shared by all tasks and each task Ti contains a different set of sentence-label training pairs,
i ). After learning all tasks in the se(xi1:m , y1:m
quence, we seek to minimize the generalization
error on all tasks (Parisi et al., 2019) :
R(fθ ) =

n
X

E(xi ,yi )∼Ti L(fθ (xi ), y i )

i=1

We use two commonly-used techniques for this
problem setting in our proposed model:
• Regularization: in order to preserve knowledge stored in the model, regularization is
a constraint added to model output (Li and
Hoiem, 2017), hidden space (Zenke et al.,
2017) and parameters (Lopez-Paz and Ranzato, 2017; Zenke et al., 2017; Aljundi et al.,
2018) to prevent them from changing too
much while learning new tasks.
• Replay: when learning new tasks, Experience Replay (Rebuffi et al., 2017) is commonly used to recover knowledge from previous tasks, where a memory buffer is first
adopted to store seen examples from previous
tasks and then the stored data is replayed with
the training set for the current task. Formally,
after training on task t − 1 (t ≥ 2), γ|St−1 |
examples are randomly sampled from the t-th
training set St−1 into the memory buffer M,
where 0 ≤ γ ≤ 1 is the store ratio. Data from
M is then merged with the t-th training set St
when learning from task t.

4

Method

In continual learning, the model needs to adapt to
new tasks quickly while maintaining the ability to
recover information from previous tasks, hence not
all information stored in the hidden representation
space should be treated equally. For example, syntactic knowledge might be shared globally across
all tasks, like the ability to recognize word order

and grammar; while some knowledge is unique to
certain tasks, and should be preserved separately.
This key observation motivates us to propose an
information-disentanglement based regularization
for continual text classification to retain shared
knowledge while adapting specific knowledge to
streams of tasks (Section 4.1). We also incorporate
a small set of representative replay samples to alleviate catastrophic forgetting (Section 4.3). Our
model architecture is shown in Figure 1.
4.1

Information Disentanglement (ID)

This section describes how to disentangle sentence
representations into task generic space and task specific space, and how separate regularizations are
imposed on them for continual text classification.
Formally, for a given sentence x, we first use a
multi-layer encoder B(.), e.g., BERT (Devlin et al.,
2018), to get the hidden representations r which
contain both task generic and task specific information. Then we introduce two disentanglement
networks G(.) and S(.) to extract the generic representation g and specific representation s from r.
For new tasks, we learn the classifiers by utilizing
information from both spaces, and we allow different spaces to change to different extents to best
retain knowledge from previous tasks.
Task Generic Space Task generic space is the
hidden space containing information generic to different tasks in a task sequence. During switching
from one task to another, the generic information
should roughly remain the same, e.g., syntactic
knowledge should not change too much across the
learning process of a sequence of tasks. To extract
task generic information g from hidden representations r, we leverage the next sentence prediction
task (Devlin et al., 2018) to learn the generic information extractor G(.). More specifically, we
insert a [SEP] token into each training example
during tokenization to form a sequence pair labeled
IsNext, and switch the first sequence and the second
sequence to form a sentence pair labeled NotNext.
In order to distinguish IsNext pairs and NotNext
pairs, extractor G(.) needs to learn the context dependencies between two segments, which is beneficial to understand every example and generic to
any individual task.
Denote x̃ as the NotNext example corresponding
to x (IsNext), and l ∈ {0, 1} as the label for next
sentence prediction. We build a sentence relation

Figure 1: Our proposed model architecture. We disentangle the hidden representation into a task generic space and
a task specific space via different induction biases. When training on new tasks, different spaces are regularized
separately. Also, a small portion of previous data is stored and replayed.

predictor fnsp on the generic feature extractor G(.):
Lnsp = Ex∈St ∪M (L(fnsp (G(B(x)), 0)
+L(fnsp (G(B(x̃)), 1))
where L is the cross entropy loss, M is the memory
buffer and St is the t-th training set.
Task Specific Space Models also need task specific information to perform well over each task.
For example, on sentiment classification words like
“good” or “bad” could be very informative, but they
might not generalize well for tasks like topic classification. Thus we employ a simple task-identifier
prediction task on the task specific representation s,
which means for any given example we want to distinguish which task this example belongs to. This
simple auxiliary setup will encourage s to embed
different information from different tasks. The loss
for task-identifier predictor ftask is:
Ltask = E(x,z)∈St ∪M L(ftask (S(B(x)), z)
where z is the corresponding task id for x.
Text Classification To adapt to the t-th task,
we combine the task generic representation g =
G(B(x) and task specific representation s =
S(B(x)) to perform text classification, where we
minimize the cross entropy loss:
Lcls = E(x,y)∈St ∪M L(fcls (g ◦ s), y))
Here y is the corresponding class label for x, fcls (.)
is the class predictor. ◦ denotes the concatenation
of the two representations.

4.2

ID Based Regularization

To further prevent severe distortion when training
on new tasks, we employ regularization on both
generic representations g and specific representations s. Different from previous approaches (Li
and Hoiem, 2017; Zenke et al., 2017; Aljundi et al.,
2018) which treat all the spaces equally, we allow regularization to different extents on g and s
as knowledge in different spaces should be preserved separately to encourage both more positive
transfer and less forgetting. Specifically, before
training all the modules on task t, we first compute
the generic representations and specific representations of all sentences x from the training set St
of current task t and memory buffer Mt , using
the trained B t−1 (.), Gt−1 (.) and S t−1 (.) from previous task t − 1: g t−1 = Gt−1 (B t−1 (x)), and
st−1 = S t−1 (B t−1 (x)) to hoard the knowledge
from previous models. The computed generic and
specific representations are saved. During the learning from training pairs from task t, we impose two
regularization losses separately:
Lgreg = Ex∈St ∪Mt kGt−1 (B t−1 (x)) − G(B(x))k2
Lsreg = Ex∈St ∪Mt kS t−1 (B t−1 (x)) − S(B(x))k2
4.3

Memory Selection Rule

Since we only store a small number of examples
as a way to balance the replay as well as the extra
memory cost and training time, we need to carefully select them in order to utilize the memory
buffer M efficiently. Considering that if two stored

examples are very similar, then only storing one
of them could possibly achieve similar results in
the future. Thus, those stored examples should be
as diverse and representative as possible. To this
end, after training on t-th task, we employ K-means
(MacQueen et al., 1967) to cluster all the examples
from current training set St : For each x ∈ St , we
utilize its embedding B(x) as its input feature to
conduct K-means. We set the numbers of clusters
to γ|St | and only select the example closest to each
cluster’s centroid, following (Wang et al., 2019;
Han et al., 2020).
4.4

Overall Objective

We can write the final objective for continual learning on text classification as the following:
L = Lcls + Lnsp + Ltask
+λg Lgreg + λs Lsreg

(1)

We set the coefficient of the first three loss terms to
1 for simplicity and only introduce two coefficients
to tune: λg and λs . In practice, Ltask and Lcls are
also conducted on each generated NotNext example
x̂, Lgreg and Lsreg are only optimized starting from
the second task. The full ID based regularization
(IDBR) algorithm is shown in Algorithm 1.

5

Experiment

5.1

Datasets

Following MBPA++ (d’Autume et al., 2019), we
use five text classification datasets (Zhang et al.,
2015) to evaluate our methods, including AG News
(news classification), Yelp (sentiment analysis),
DBPedia (Wikipedia article classification), Amazon (sentiment analysis), and Yahoo! Answer
(Q&A classification). A summary of the datasets
is shown in Table 1. We merge the label space of
Amazon and Yelp considering their domain similarity, therefore, there are 33 classes in total.
5.2

Experiment Setup

Due to the limitation of resources, for most of
our experiments, we randomly sample 2000 examples per class for every task and create a reduced
dataset. See Table 1 for the train/test size of each
dataset. We name this setting Setting (Sampled).
We tune all the hyperparameters on the basis of
Setting (Sampled). Beyond that, to have a comparison with previous State-of-the-art, we also conduct
experiments on the same training set and test set
as MbPA++ (d’Autume et al., 2019) and LAMOL

Algorithm 1 IDBR
Input Training sets {S1 , ..., Sn }, Replay Frequency β, Store ratio γ, Coefficients λg , λs
Output Optimal models B, G, S, fnsp , ftask , fcls
M = {}
. Initialize memory buffer
Initialize B using pretrained BERT
Initialize G, S, fnsp , ftask , fcls
for t = 1, . . . , n do
if t ≥ 2 then
Store G(B(x)), S(B(x)), ∀x ∈ St ∪ M
for batches ∈ St do
Optimize L in Equation 1
if step mod β = 0 then . Replay
Sample t − 1 batches from M
Optimize L in Equation 1
end if
end for
else
. No regularization on 1st task
for batches ∈ St do
Optimize L = Lcls + Lnsp + Ltask
end for
end if
C = K-Means(St , nclu =γ|St |) . C : centroid
C 0 = { Examples closest to centers ∈ C }
M ← M ∪ C0
. Add to memory
end for
return B, G, S, fnsp , ftask , fcls

(Sun et al., 2019), which contains 115,000 training
examples and 7,600 test examples for each task.
We name the latter Setting (Full).
Our experiments are mainly conducted on the
task sequences shown in Table 2. To minimize the
effect of task order and task sequence length on the
results, we examine both length-3 task sequences
and length-5 task sequences in various orders. The
first 3 task sequences are a cyclic permutation of ag
 yelp yahoo, which are three classification tasks
in different domains (news classification, sentiment
analysis, Q&A classification). The last four length5 task sequences follows d’Autume et al. (2019).
5.3

Baselines

We compare our proposed model with the following baselines in our experiments:
• Finetune (Yogatama et al., 2019): finetune BERT model sequentially without the
episodic memory module and any other loss.
• Replay (Wang et al., 2019; d’Autume et al.,

Dataset
AGNews
Yelp
Amazon
DBPedia
Yahoo

Class
4
5
5
14
10

Type
News
Sentiment
Sentiment
Wikipedia
Q&A

Train
8000
10000
10000
28000
20000

Test
7600
7600
7600
7600
7600

Table 1: Dataset statistics we used for Setting (Sampled). Type means the domain of task classification.

Order
1
2
3
4
5
6
7

Task Sequence
ag yelp yahoo
yelp yahoo ag
yahoo ag yelp
ag yelp amazon yahoo dbpedia
yelp yahoo amazon dbpedia ag
dbpedia yahoo ag amazon yelp
yelp ag dbpedia amazon yahoo

Table 2: Seven random different task sequences used
for experiments. The first 6 are used in Setting (Sampled). The last 4 are used in Setting (Full).

2019): Finetune model augmented with an
episodic memory.
• Regularization: On top of Replay, with an
L2 regularization term.
• MBPA++ (d’Autume et al., 2019): augment
BERT model with an episodic memory module and store all seen examples. MBPA++
performs experience replay at training time,
and uses KNN to select examples for local
adaptation at test time.
• LAMOL (Sun et al., 2019): a language model
which can generate pseudo training samples
used for experience replay. The text classification is performed in Q&A formats.
• Multi-task learning (MTL): The model is
trained on all tasks simultaneously, which can
be considered as an upper-bound for continual
learning methods since it has access to data
from all tasks at the same time.
5.4

Implementation Details

We use pretrained BERT-based-uncased from HuggingFace Transformers (Wolf et al., 2020) as our
base feature extractor. The task generic encoder
and task specific encoder are both one linear layer
followed by activation function T anh, their output size are both 128 dimensions. The predictors

built on encoders are all one linear layer followed
by activation function sof tmax. We use AdamW
(Loshchilov and Hutter, 2017) as optimizer. For
all modules except the task id predictor, we set the
learning rate lr = 3e−5 ; for task id predictor, we
set its learning rate lrtask = 5e−4 . The weight
decay for all parameters are 0.01.
For experience replay, we set the store ratio
γ = 0.01, i.e. we store 1% of seen examples
into the episodic memory module. Besides, we
set the replay frequency β = 10, which means we
do experience replay once every ten steps.
For information disentanglement, we mainly
tune the coefficients of the regularization loss. For
batches from memory buffer M, we set λg to 5.0,
select best λs from {1.0, 2.0, 3.0, 4.0, 5.0}. For
batches from current training set S, we set λg to
0.5, select best λs from {0.1, 0.2, 0.3, 0.4, 0.5}.

6

Results and Discussion

We evaluated models after training on all tasks
and reported their average accuracies on all test
sets as our metric. Table 3 summarizes our results
in Setting (Sampled). While continual finetuning
suffered from severe forgetting, experience replay
with 1% stored examples achieved promising results, which demonstrated the importance of experience replay for continual learning in NLP. Beyond
that, simple regularization turned out to be a robust
method on the basis of experience replay, which
showed consistent improvements on all 6 orders.
Our proposed Information Disentanglement Based
Regularization (IDBR) further improves regularization consistently under all circumstances.
Table 4 compares IDBR with previous SOTA:
MBPA++ and LAMOL in Setting (Full). We list
some differences among our settings in the table.
Although MBPA++ applies local adaptation when
testing, IDBR still outperformed it by an obvious
margin. We achieved comparative results with
LAMOL even if they need task identifiers during
inference, which makes their prediction easier.
6.1

Impact of the Lengths of Task Sequences

Comparing results of length-3 sequences and
length-5 sequences in Table 3, we found that the
gap between IDBR and multi-task learning became
bigger when the length of task sequence changed
from 3 to 5. To better understand how IDBR gradually forgot, we followed Chaudhry et al. (2018a)

Model
Order
Finetune
Replay
Regularization
IDBR
MTL

Length-3 Task Sequences
1
2
3
Average
25.79 36.56 41.01 34.45
69.32 70.25 71.31 70.29
71.50 70.88 72.93 71.77
71.80 72.72 73.08 72.53
74.16 74.16 74.16 74.16

Length-5 Task Sequences
4
5
6
Average
32.37 32.22 26.44 30.34
68.25 70.52 70.24 69.67
72.28 73.03 72.92 72.74
72.63 73.72 73.23 73.19
75.09 75.09 75.09 75.09

Table 3: Summary of results on Setting (Sampled) using averaged accuracy after training on the last task. All
results are averaged over 3 runs. The p-values of paired t-test between nine numbers of Regularization and IDBR
are 0.018 on Length-3 and 0.009 on Length-5, demonstrating the significant differences.

Model
Order
MBPA++ †
MBPA++ ††
LAMOL ††
IDBR

TT

TI

LA

LM

X
X
X
X

X

X

Length-5 Task Sequences
4
5
6
7
Average
70.7 70.2 70.9 70.8 70.7
74.9 73.1 74.9 74.1 74.3
76.1 76.1 77.2 76.7 76.5
75.9 76.2 76.4 76.7 76.3

Table 4: Summary of results on Setting (Full) using averaged accuracy after training on the last task. Our results
are averaged over 3 runs. † means we fetch numbers from d’Autume et al. (2019). †† means we fetch numbers
from Sun et al. (2019). TT: whether task-id is available during training. TI: whether task-id is available during
inference. LA: whether need local adaptation during inference. LM: whether train a language model.

Order
After 2 tasks
After 3 tasks
After 4 tasks
After 5 tasks

4
0.64
3.18
3.60
3.46

5
1.63
2.56
2.17
2.33

6
0.07
1.56
2.20
2.88

Average
0.78
2.43
2.66
2.89

Table 5: Forgetting measure Chaudhry et al. (2018a)
calculated every time finish training on a new task. All
results are averaged over 3 runs.

to measure forgetting as follows:

(a) Task Generic Space

(b) Task Specific Space

Figure 2: t-SNE visualization of task generic hidden
space and task specific hidden space of IDBR

Fk = Ej=1...t−1 fjk ,
fjk =

max
l∈{1...k−1}

al,j − ak,j

where al,j is the is the model’s accuracy on task
j after trained on task l. On order 4, 5 and 6, we
calculated the forgetting every time after IDBR
was trained on a new task and summarized them
in Table 5. For continual learning, we hypothesize
that the model is prone to suffer from more severe
forgetting as the task sequence becomes longer.
We found that although there was some big drop
after training on the 3rd task, IDBR maintained
stable performance as the length of task sequence
increased, especially after training on 4-th and 5-th
task, the forgetting increment was relatively small,
which demonstrated the robustness of IDBR.

6.2

Visualizing Disentangled Spaces

To study whether our task generic encoder G tends
to learn more generic information and task specific
encoder S captures more task specific information,
we used t-SNE (Maaten and Hinton, 2008) to visualize the two hidden spaces of IDBR, using the
final model trained on order 2, and the results are
shown in Figure 2, where Figure 2a visualizes task
generic space and Figure 2b visualizes task specific
space. We observe that compared with task specific
space, generic features from different tasks were
more mixed, which demonstrates that the next sentence prediction helped task generic space to be
more task-agnostic than task specific space, which
was induced to learn separated representations for

Model
Regularization
IDBR w/o Lnsp
IDBR w/o Ltak
IDBR

Accuracy
73.03
73.17
73.29
73.72

Rules
Random
K-Means

Table 6: Comparison among using task-id prediction
only, next sentence prediction only and both of them.
All results are averaged over 3 runs.

Model
Reg only on s
Reg only on g
Reg on both

4
72.05
72.01
72.63

5
72.54
72.98
73.72

6
72.61
72.73
73.23

Avg
72.40
72.57
73.19

Table 7: Comparison among using regularization on
task specific space only, task generic space only and
both of them. All results are averaged over 3 runs.

different tasks. Considering we only employed two
simple auxiliary tasks, the effect of information
disentanglement was noticeable.
6.3

Ablation Studies

Effect of Disentanglement In order to demonstrate that each module of our information disentanglement helps the learning process, we performed
ablation study on the two auxiliary tasks using order 5 as a case study. The results are summarized
in Table 6. We found that both task-id prediction
and next sentence prediction contribute to the final
performance. Furthermore, the performance gain
was much larger by combing these two auxiliary
tasks together. Intuitively, the model needs both
tasks to disentangle the representation well, since
it is easy for the model to ignore one of the spaces
if the constraint is not imposed appropriately. The
results show that the two tasks are likely complimentary to each other in helping the model learn
better disentangled representations.
Impact of Regularization To study the effect of
regularization on task generic hidden space g and
task specific hidden space s, we performed an ablation study which only applied regularization on
g or s, and compared the results with regularization on both in Table 7. We found that regularization on both spaces results in a much better performance than regularization only on one of them only,
which demonstrates the necessity of both regularizers. While we may expect to give more tolerance
to specific space for changing, we found that no
regularization on it would lead to severe forgetting

1
71.52
71.80

2
72.60
72.72

3
73.03
73.05

Average
72.38
72.52

Table 8: Comparison between different selection rules:
select stored examples randomly or select by K-Means.
All results are averaged over 3 runs.

of previously learnt good task specific embeddings,
hence it is necessary to add a regularizer over this
space as well. Beyond that, we also observed that
under most circumstances, adding regularization
on the task generic space g results in a more significant gain than adding regularization on the task
specific space s, consistent with our intuition that
task generic space changes less across tasks and
thus preserving it better helps more in alleviating
catastrophic forgetting.
Impact of K-Means To demonstrate our hypothesis that when the memory budget is limited, selecting the most representative subset of examples
is vital to the success of continual learning, we
performed an ablation study on order 1,2,3 using
IDBR with and without K-Means. The result is
shown in Table 8. From the table, we found that using K-Means helps boost the overall performance
to some extent. Specifically, the improvement
brought by K-Means was larger on those challenging orders, i.e. orders on which IDBR had worse
performance. This is because for these challenging orders, the forgetting is more severe and the
model needs more examples from previous tasks
to help it retain previous knowledge. Thus with the
same memory budget constraint, diversity across
saved examples will help the model better recover
knowledge learned from previous tasks.

7

Conclusion

In this work, we introduce an information disentanglement based regularization (IDBR) method for
continual text classification, where we disentangle
the hidden space into task generic space and task
specific space and further regularize them differently. We also leverage K-Means as the memory
selection rule to help the model benefit from the
augmented episodic memory module. Experiments
conducted on five benchmark datasets demonstrate
that IDBR achieves better performances compared
to previous state-of-the-art baselines on sequences
of text classification tasks with various orders and
lengths. We believe the proposed approach can be

extended to continual learning for other NLP tasks
such as sequence generation and sequence labeling
as well, and plan to explore them in the future.

Khurram Javed and Martha White. 2019.
Metalearning representations for continual learning. In
Advances in Neural Information Processing Systems,
pages 1820–1830.

Acknowledgment

Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga
Vechtomova. 2018. Disentangled representation
learning for non-parallel text style transfer. arXiv
preprint arXiv:1808.04339.

We would like to thank the anonymous reviewers
for their helpful comments, and the members of
Georgia Tech SALT group for their feedback.

References
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars.
2018. Memory aware synapses: Learning what (not)
to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154.
Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili
Mou, Olga Vechtomova, Xinyu Dai, and Jiajun
Chen. 2019. Generating sentences from disentangled syntactic and semantic spaces. arXiv preprint
arXiv:1907.05789.
Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam
Ajanthan, and Philip HS Torr. 2018a. Riemannian
walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the
European Conference on Computer Vision (ECCV),
pages 532–547.
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus
Rohrbach, and Mohamed Elhoseiny. 2018b. Efficient lifelong learning with a-gem. arXiv preprint
arXiv:1812.00420.
Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019. Episodic
memory in lifelong language learning.
arXiv
preprint arXiv:1906.01076.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan
Zhao, and Rui Yan. 2017.
Style transfer in
text: Exploration and evaluation. arXiv preprint
arXiv:1711.06861.
Xu Han, Yi Dai, Tianyu Gao, Yankai Lin, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2020. Continual relation learning via episodic memory activation
and reconsolidation. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 6429–6440.
Nithin Holla, Pushkar Mishra, Helen Yannakoudakis,
and Ekaterina Shutova. 2020. Meta-learning with
sparse experience replay for lifelong language learning. arXiv preprint arXiv:2009.04891.

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
Joel Veness, Guillaume Desjardins, Andrei A Rusu,
Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.
Proceedings of the national academy of sciences,
114(13):3521–3526.
Yuan Li, Chunyuan Li, Yizhe Zhang, Xiujun Li, Guoqing Zheng, Lawrence Carin, and Jianfeng Gao.
2020. Complementary auxiliary classifiers for labelconditional text generation. In AAAI, pages 8303–
8310.
Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
David Lopez-Paz and Marc’Aurelio Ranzato. 2017.
Gradient episodic memory for continual learning. In
Advances in neural information processing systems,
pages 6467–6476.
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne. Journal of machine
learning research, 9(Nov):2579–2605.
James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations.
In Proceedings of the fifth Berkeley symposium on
mathematical statistics and probability, volume 1,
pages 281–297. Oakland, CA, USA.
Arun Mallya and Svetlana Lazebnik. 2018. Packnet:
Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 7765–7773.
Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
Abiola Obamuyide and Andreas Vlachos. 2019. Metalearning improves lifelong relation extraction. In
Proceedings of the 4th Workshop on Representation
Learning for NLP (RepL4NLP-2019), pages 224–
229.

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual
lifelong learning with neural networks: A review.
Neural Networks, 113:54–71.
David N Perkins, Gavriel Salomon, et al. 1992. Transfer of learning. International encyclopedia of education, 2:6452–6457.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Amal Rannen, Rahaf Aljundi, Matthew B Blaschko,
and Tinne Tuytelaars. 2017. Encoder based lifelong
learning. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1320–1328.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov,
Georg Sperl, and Christoph H Lampert. 2017. icarl:
Incremental classifier and representation learning.
In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pages 2001–2010.
Mark B Ring. 1998. Child: A first step towards continual learning. In Learning to learn, pages 261–292.
Springer.
Alexey Romanov, Anna Rumshisky, Anna Rogers,
and David Donahue. 2018. Adversarial decomposition of text representation. arXiv preprint
arXiv:1808.09042.
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Hadsell.
2016. Progressive neural networks. arXiv preprint
arXiv:1606.04671.
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon
Kim. 2017. Continual learning with deep generative
replay. In Advances in Neural Information Processing Systems, pages 2990–2999.
Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee.
2019. Lamol: Language modeling for lifelong language learning. In International Conference on
Learning Representations.
Sebastian Thrun. 1998. Lifelong learning algorithms.
In Learning to learn, pages 181–209. Springer.
Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo,
Shiyu Chang, and William Yang Wang. 2019. Sentence embedding alignment for lifelong relation extraction. arXiv preprint arXiv:1903.02588.
Zirui Wang, Sanket Vaibhav Mehta, Barnabás Póczos,
and Jaime Carbonell. 2020. Efficient meta lifelonglearning with limited memory.
arXiv preprint
arXiv:2010.02500.
Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam

Shleifer, et al. 2020. Transformers: State-of-theart natural language processing. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,
pages 38–45.
Dani Yogatama, Cyprien de Masson d’Autume, Jerome
Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei
Yu, Chris Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint
arXiv:1901.11373.
Friedemann Zenke, Ben Poole, and Surya Ganguli.
2017. Continual learning through synaptic intelligence. Proceedings of machine learning research,
70:3987.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush,
and Yann LeCun. 2018. Adversarially regularized
autoencoders. In International Conference on Machine Learning, pages 5902–5911. PMLR.