Luna: Linear Unified Nested Attention

Chunting Zhou
LTI, CMU
chuntinz@cs.cmu.edu

Xiang Kong∗
LTI, CMU
xiangk@cs.cmu.edu
Jonathan May
ISI, USC
jonmay@isi.edu

Sinong Wang∗
Facebook AI
sinongwang@fb.com
Hao Ma, Luke Zettlemoyer
Facebook AI
{haom, lsz}@fb.com

Abstract
The quadratic computational and memory complexities of the Transformer’s attention mechanism have limited its scalability for modeling long sequences. In
this paper, we propose Luna, a linear unified nested attention mechanism that
approximates softmax attention with two nested linear attention functions, yielding
only linear (as opposed to quadratic) time and space complexity. As compared to
a more traditional attention mechanism, Luna introduces an additional sequence
with a fixed length as input and an additional corresponding output, which allows
Luna to perform attention operation linearly, while also storing adequate contextual
information. We perform extensive evaluations on three benchmarks of sequence
modeling tasks: long-context sequence modeling, neural machine translation and
masked language modeling for large-scale pretraining. Competitive or even better
experimental results demonstrate both the effectiveness and efficiency of Luna
compared to a variety of strong baseline methods including the full-rank attention
and other efficient sparse and dense attention methods. The implementation of our
model is available at https://github.com/XuezheMax/fairseq-apollo.

1

Introduction

59
57

Luna-256
Luna-128
Luna-16

Transformers (Vaswani et al., 2017) are surprisingly
Transformer
55
versatile models that preform well on a wide range
BigBird
of language and vision tasks, including machine
53
Synthesizer
translation (Vaswani et al., 2017; Ott et al., 2018),
Sinkhorn Linformer
language understanding (Devlin et al., 2019), im51
Performer
Reformer
age recognition (Dosovitskiy et al., 2020) and bioinLinear Transformer
49
formatics (Madani et al., 2020). Attention (Bahdanau et al., 2015) provides the key mechanism that
47
captures contextual information from the entire sequence by modeling pairwise interactions between
45
Local Attention
the inputs at every timestep. However, a common
weakness of Transformers is their quadratic time
0
1
2
3
4
5
6
7
8
and memory complexity within the attention mechRelative Speed Comparision
anism w.r.t the length of the input sequence, which
prohibitively restricts their potential application to Figure 1: Trade-off between accuracy (y-axis),
tasks requiring longer input sequences.
speed (x-axis) and memory (cir-radius) on LRA.

Avg. LRA Score (w/o Retrieval)

arXiv:2106.01540v2 [cs.LG] 2 Nov 2021

Xuezhe Ma∗
ISI, USC
xuezhema@isi.edu

A number of techniques have been recently introduced to improve the time and memory efficiency
of Transformer models (‘xformers’) (Tay et al., 2020b, 2021). One popular technique is using
sparsity to restrict the attention field range, such as local attention (Parmar et al., 2018), blockwise
attention (Qiu et al., 2019), strided attention patterns (Child et al., 2019; Beltagy et al., 2020),
∗

Equal contribution.

35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.

compressed attention (Liu et al., 2018), and attention with learnable patterns (Kitaev et al., 2020; Tay
et al., 2020a; Roy et al., 2021). Another emerging approach is to improve efficiency by leveraging
low-rank approximations of the attention matrix. Linformer (Wang et al., 2020), for example, projects
the length dimension of key and value matrices to a fixed-dimensional representation by assuming
low-rank structure in the full-rank attention matrix. Recently, some kernel-based methods, such as
Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020) and Random
Feature Attention (Peng et al., 2021), attempt to efficiently approximate regular (softmax) full-rank
attention through kernelization. Although these models demonstrate better asymptotic complexity for
long sequences, their efficiency gains are less prominent for moderate length sequences and their
performance remains behind Transformers with regular attention.
In this work, we propose a linear unified nested attention mechanism (Luna), which uses two nested
attention functions to approximate the regular softmax attention in Transformer (§2). Specifically,
with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then,
the packed sequence is unpacked using the second attention function (§3.1). As compared to a
more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as
input and an additional corresponding output. Importantly, the extra input allows Luna to perform
attention operation linearly as efficiently as Linformer (Wang et al., 2020), while also storing adequate
contextual information. Unlike Linformer, Luna is capable of modeling variable-length sequences
and autoregressive (causal) attention (§3.4). We perform extensive experiments on three sequence
modeling tasks, including long-context sequence modeling, neural machine translation, and masked
language modeling for large-scale pretraining and downstream task finetuning. Compared to a variety
of strong baseline models, Luna achieves competitive or even better performance, while acquiring
prominent gains of efficiency in both speed and memory (see Figure 1). More importantly, Luna
manages to obtain superior performance with small projection lengths such as 16 (§4).

2

Background

2.1

Attention

The traditional attention mechanism is a function:



XWQ (CWK )T
√
Y = Attn(X, C) = ω
CWV
(1)
d
n×d
m×d
n×d
where the attention function Attn : R
×R
→R
takes as inputs two sequences: the query
sequence X ∈ Rn×d with length n and the context sequence C ∈ Rm×d with length m, and output
one sequence Y ∈ Rn×d with the same length n as the query X. d is the embedding dimension,
and WQ , WK , WV ∈ Rd×d are three learnable parameters that project the input sequences into the
space of query, key and value matrices: Q = XWQ , K = CWK , V = CWV . ω is an activation
function, e.g. the softmax function in regular attention. Note that the formulation in (1) is applicable
to both cross-attention where C and X are the representations from Transformer encoder and decoder,
respectively, and self-attention where X and C are the same sequence (X = C). In practice, the
multi-head variant of attention (Vaswani et al., 2017), which performs the attention function h times
in parallel, is commonly used. Throughout this paper, we omit h for simplicity.
T

√
) ∈ Rn×m in (1) is called the attention matrix which specifies
In particular, the matrix A = ω( QK
dk
the alignment scores between every pair of tokens in sequences of queries X and contexts C.
Calculating A takes O(nm) time and space, which is quadratic with respect to the sequence length
and becomes a significant bottleneck when processing long sequences.

2.2

Transformer Layers

The other two key components of Transformer, besides attention, are position-wise feed-forward
networks (FFN) and layer normalization (Ba et al., 2016). Technically, the position-wise feedforward layer operates on each position independently and layer normalization plays a crucial role in
controlling the gradient scales (Xiong et al., 2020). Each Transformer layer can be expressed as:
XA = LayerNorm(Attn(X, C) + X)
(2)
X ′ = LayerNorm(FFN(XA ) + XA )
′
where X and C are the two input sequences and X is the output of the Transformer layer. The
Transformer layer in (2) adopts the original post-layer normalization architecture (Vaswani et al.,
2017; Devlin et al., 2019) that places layer normalization after residual connection, rather than
pre-layer normalization (Vaswani et al., 2018; Wang et al., 2019).
2

Add & N
Feed F

a d La e

Add & N
Feed F

Add & N

a d La e

Add & N
M

Add & N
M

(a) Re

- ead A e

a M

M

- ead a e

- ead A e

- ead A e

(b) L ea U

ed Ne ed A e

(L

a)

Figure 2: Illustration of the architecture of one Transformer encoder layer (left) versus one Luna
encoder layer (right).

3

Linear Unified Nested Attention (Luna)

Our goal is to design an efficient attention mechanism to solve the quadratic complexity problem
of full attention. We first introduce the proposed linear unified nested attention mechanism, named
Luna attention (§3.1), and the architecture of each Luna layer (§3.2). Then, we present the variant of
Luna for causal attention, named Luna causal attention (§3.3). Finally, we discuss the differences
between Luna and three closely related models: Linformer (Wang et al., 2019), Set Transformer (Lee
et al., 2019) (§3.4) and Shared Workspace (Goyal et al., 2021).
3.1

Pack and Unpack Attention

The key idea behind Luna is to decouple the regular attention function in (1) into two nested attention
operations, both of which have linear efficiency. To achieve this, besides the original query and
context input sequences, Luna introduces an extra input that is a sequence with fixed (constant) length.
With this extra input as the query sequence, Luna uses its first attention, named pack attention, to pack
the context sequence into a fixed-length sequence. Formally, let P ∈ Rl×d denote the extra input
sequence with fixed length l. The pack attention first packs C to YP with P as the query sequence:
YP = Attn(P, C)

(3)

where Attn(·, ·) is the regular attention function in (1), C ∈ Rm×d is the context sequence, and
YP ∈ Rl×d is the output of the pack attention, which is named the packed context. Since the length
of P is a constant l, the complexity of pack attention is O(lm), which is linear with respect to m.
To unpack the sequence back to the length of the original query sequence X, Luna leverages its
second attention, named unpack attention:
YX = Attn(X, YP )

(4)

where X ∈ Rn×d is the original query sequence. Similar to pack attention, the complexity of unpack
attention is O(ln), which is also linear with repect to n.
Encoding Contextual Information in P . The next question is where the extra input sequence
P comes from. One straightforward choice is to format P as a learnable parameter of each Luna
layer. One obvious drawback of this method, however, is that P would not capture any contextual
information. To enhance the capacity of the Luna model, we propose to formulate YP as an additional
output of each Luna layer, corresponding to P . Formally, the Luna attention function LunaAttn(·, ·, ·)
takes three sequences as input and generates two sequence as output:
YX , YP = LunaAttn(X, P, C)

(5)

where the computation of YP and YX is in (3) and (4). By stacking multiple layers of Luna attention,
the output YP from the previous layer, which captures contextual information of C, is employed as
3

the input P of the next layer. For the first layer of Luna, we formulate P as learnable positional
embeddings2 (Vaswani et al., 2017).
Reducing the Number of Parameters. Due to the two nested attention operations, there are two
sets of parameters (WQ , WK , WV ) in a single Luna attention function. There are several techniques
to reduce the number of parameters, such as parameter sharing (Xia et al., 2019). In this work, we
follow Wang et al. (2020) to share WK and WQ in each layer, and conduct experiments to analyze
performance decline against Luna with full sets of parameters (§4.2).
3.2

Luna Layers

The Luna attention is used as a drop-in-replacement for the regular attention. We incorporate the
position-wise feed-forward network and layer normalization into Luna layers. Concretely, layer
normalization is applied to both YX and YP , while FFN only to YX :
Y X , YP
XA , PA
X ′, P ′

=
=
=

LunaAttn(X, P, C)
LayerNorm(YX + X), LayerNorm(YP + P )
LayerNorm(FFN(XA ) + XA ), PA

(6)

where X ′ and P ′ are the two outputs of the Luna layer. The graphical specification of one Luna layer
is illustrated in Figure 2.
3.3

Luna Causal Attention

As discussed in Tay et al. (2020b), the ability to support causal autoregressive decoding, i.e. attending
solely to the past and current tokens, is required when designing efficient self-attention mechanisms.
However, due to the pack attention that packs the long sequence X into a fixed (shorter) length, it is
not straight-forward to support causal attention in Luna.
To design causal attention in Luna, we need to assume that the input P contains no information of X,
i.e. P will not leak any future information of X to the history. Before we describe the Luna causal
attention mechanism, we first define a causal function f : Rn×d1 × Rn×d1 × Rn×d2 → Rn×d2 :
F , f (X, Y, Z), where Ft =

t
1 X T
Xt
Yj Zj
t
j=1

(7)

where F ∈ Rn×d2 and Ft denotes the t-th row of F . From the definition of f in (7), we see that Ft
can only access the information of the past and present row of X, Y and Z.
To perform Luna causal
attention, we first compute the attention matrix of the pack attention:
√
T
Apack = ω(P X / d). For simplicity, we omit the learnable parameters, e.g. WQ , WK , WV in
(1). Note that for Apack , we cannot use the softmax function for ω, as the normalization term in
softmax leaks future information of X to the history. Inspired by the causal attention mechanism in
Linear Transformer (Katharopoulos et al., 2020), we use two activation functions: 1) ω(·) = elu(·)+1
based on the exponential linear unit (Clevert et al., 2016); 2) ω(·) = softplus(·) based on the softplus
function (Glorot et al., 2011). With the causal function f in (7), we compute the attention matrix of
the unpack attention: Aunpack = ω(f (X, X, ATpack )). Unlike Apack , we can use ω(·) = softmax(·)
for Aunpack , because the normalization is along the l-dimension rather than the n-dimension of X.
Finally, the output Y is computed by Y = f (Aunpack , ATpack , X).
The complexity of the causal attention in Luna is still linear: O(ln). One drawback of Luna causal
attention, similar to the causal attention in Random Feature Attention (RFA) (Peng et al., 2021) and
Linear Transformer (Katharopoulos et al., 2020), is its sequential computation for each timestep t.
The sources of P . In the formulation of causal attention, P is expected to contain no information
about X. Thus, we need to formulate P based on the usage mode of the causal attention. For the
encoder-decoder mode in sequence-to-sequence modeling (e.g. for machine translation), we can use
packed output from the Luna encoder as P . For the decoder-only mode (e.g. for language modeling),
P might be formulated as a learnable parameter of each layer.
2

We also experimented with sinusoidal positional embeddings, and obtained similar results.

4

Table 1: Experimental results on the long range arena (LRA) benchmark. For Luna, we explore three
projected dimensions: 16, 128 and 256. ‘Avg. (w/o rtl)’ denotes the averaged accuracy over all tasks
excluding Retrieval. The performance of previous works are from Tay et al. (2021).
Models

ListOps

Text

Retrieval

Image

Pathfinder

Avg.

Avg. (w/o rtl)

Transformer
Transformer (re-impl)

36.37
37.11

64.27
65.21

57.46
79.14

42.44
42.94

71.40
71.83

54.39
59.24

53.62
54.27

Local Attention
Sparse Trans.
Longformer
Linformer
Reformer
Sinkhorn Trans.
Synthesizer
BigBird
Linear Trans.
Performer

15.82
17.07
35.63
35.70
37.27
33.67
36.99
36.05
16.13
18.01

52.98
63.58
62.85
53.94
56.10
61.20
61.68
64.02
65.90
65.40

53.39
59.59
56.89
52.27
53.40
53.83
54.67
59.29
53.09
53.82

41.46
44.24
42.22
38.56
38.07
41.23
41.61
40.83
42.34
42.77

66.63
71.71
69.71
76.34
68.50
67.45
69.45
74.87
75.30
77.05

46.06
51.24
53.46
51.36
50.67
51.39
52.88
55.01
50.55
51.41

44.22
49.15
52.60
51.14
49.99
50.89
52.43
53.94
49.92
50.81

Luna-16
Luna-128
Luna-256

37.43
38.01
37.98

65.74
65.74
65.78

79.38
79.55
79.56

46.39
47.47
47.86

78.36
78.89
78.55

61.46
61.93
61.95

56.98
57.53
57.54

3.4

Discussion

Relation to Linformer and Shared Workspace. One previous work closely related to Luna is
Linformer (Wang et al., 2019). Linformer linearly projects the context sequence C ∈ Rm×d into a
sequence with fixed length l: C ′ = EC, where C ′ ∈ Rl×d is the projected context sequence and
E ∈ Rl×m is the learnable projection matrix of each layer. Then, the attention operation is applied
on the query X and the projected context C ′ . The pack attention in Luna is a generalization of the
linear projection in Linformer. There are two main advantages to Luna over Linformer: i) with
pack attention as the projection method, Luna is able to model sequences with various lengths. In
contrast, Linformer requires the length of all input sequences to be the same m, due to the projection
matrix E, whose shape depends on m. ii) Luna achieves better expressiveness than Linear, not only
due to the general projection method but also by encoding adequate contextual information into the
projection via P (see §3.1). Experimental improvements over non-contextual projection demonstrate
the effectiveness of Luna (see §4.2). In contemporaneous and individual work, Goyal et al. (2021)
formulate contextual p as a shared global workspace, which shares similar instantiation with Luna.
Relation to Set Transformer. The additional input P in Luna can be regarded as a side memory
module that can access the entire sequence to gather contextual information. From this view of
point, Luna is also closely related to Set Transformer (Lee et al., 2019), an early model to integrate
side memory module in Transformers. Similar to the projection matrix in Linformer, the inducing
points in Set Transformer are learnable parameters. Thus, these inducing points might be formulated
as the non-contextual version of P in Luna. Moreover, Set Transformer is designed for set-input
problems, which are problems wherein the input is a set of features and the model is thereby invariant
to permutation or ordering of the input features (Tay et al., 2020b), while Luna attention is used as a
drop-in replacement for regular softmax attention.

4

Experiments

4.1

Long-Context Sequence Modeling

We evaluate the effectiveness and efficiency of Luna on the Long Range Arena (LRA) benchmark
recently introduced by Tay et al. (2021), which is designed for the purpose of evaluating efficient
Transformer models under the long-context scenario. They collect five tasks in this benchmark which
are ListOps (Nangia and Bowman, 2018), byte-level text classification (Text; Maas et al., 2011),
byte-level document retrieval (Retrieval; Radev et al., 2013), image classification on sequences of
pixels (Image; Krizhevsky et al., 2009) and Pathfinder (Linsley et al., 2018). These tasks consist of
input sequences ranging from 1K to 8K tokens and span across a variety of data types and modalities.
5

Table 2: Training speed and peak memory consumption comparison of different models on byte-level
text classification with various input lengths (1K, 2K, 3K and 4K). The best model is in boldface.

Model

Steps per second ↑
1K 2K 3K 4K

Peak Memory Usage (GB) ↓
1K
2K
3K
4K

Transformer

1.0

1.0

1.0

1.0

1.00

1.00

1.00

1.00

Local Attention
Linformer
Reformer
Sinkhorn Trans
Synthesizer
BigBird
Linear Trans.
Performer

1.1
1.2
0.5
1.1
1.1
0.9
1.1
1.2

1.7
1.9
0.4
1.6
1.2
0.8
1.9
1.9

3.2
3.7
0.7
2.9
2.9
1.2
3.7
3.8

5.3
5.5
0.8
3.8
1.4
1.1
5.6
5.7

0.49
0.44
0.56
0.55
0.76
0.91
0.44
0.44

0.29
0.21
0.37
0.31
0.75
0.56
0.22
0.22

0.19
0.18
0.28
0.21
0.74
0.40
0.15
0.15

0.14
0.10
0.24
0.16
0.74
0.30
0.11
0.11

Luna-16
Luna-128
Luna-256

1.2
1.1
1.1

1.8
1.7
1.7

3.7
3.4
3.3

5.5
5.1
4.9

0.44
0.49
0.60

0.23
0.28
0.33

0.17
0.21
0.23

0.10
0.14
0.16

To ensure fair comparisons, for all tasks except for the task Retrieval, we closely follow the model
configurations in Tay et al. (2021) such as data preprocessing, data split, model architecture, etc. For
the task of Retrieval, we find that models are not fully converged when being trained for 5K steps as
stated in Tay et al. (2021). Therefore, we train models for 20K steps for this task and obtain much
better results. For a direct comparison, besides the average performance of models across all tasks,
we also report the average accuracy on tasks excluding Retrieval. We run each experiment for five
times with different random seeds and report the average accuracy. The hyper-parameters for each
task are shown in Appendix A.1.
Results. The results of various models on the LRA benchmark are presented in Table 1. For our
proposed method, we report results from models of three different projected dimensions (16, 128
and 256). First, we note that Luna achieves good results on all tasks consistently compared to the
Transformer model and significantly outperforms all the other baseline methods in terms of the
average accuracy. By taking a closer look at the accuracy for each individual task, Luna wins over
baseline models on three out of five tasks and performs comparably with the best performed model
on the other two tasks, i.e. ListOps and byte-level text classification. Notably, Luna improves over
the Transformer model on image classification and pathfinder by a large margin. Second, we observe
that although Luna achieves the best average performance with a projection dimension of 256, it also
performs considerably well with smaller projection dimensions (16 and 128). This demonstrates the
effectiveness of Luna even with small projected dimensions.
Memory and Speed Efficiency. Luna employs two nested linear attention functions to reduce the
time and memory complexity compared to the vanilla softmax attention. Here, we examine the
speed and memory footprint of various models with varying input lengths (1K, 2K, 3K and 4K).
Following Tay et al. (2021), all models are evaluated on the byte-level classification task with the
same batch size. The result is shown in Table 2.
Considering the memory efficiency, Luna with a projected dimension of 16 is highly memoryefficient, which is only 10% of the vanilla Transformer at 4K input sequence length. With larger
projected dimensions, i.e. 128 and 256, Luna requires more memory but is still competitive compared
to other efficient Transformer models. In terms of time efficiency, Luna-16 speeds up over the
standard Transformer by 1.2-5.5 times, varying by the sequence length. Compared to other efficient
Transformers, Luna-16 performs comparably with the fastest models, i.e. Performer and Linformer.
Overall, our models achieve competitive advantage both in time- and memory-efficiency over other
models, while attaining the best performance on the LRA benchmark (see Figure 1).
In addition, we plot the trade-off among memory, time and averaged LRA score without task Retrieval
in Figure 1. Models such as Linformer and Performer have faster speed and small memory requirement
with the sacrifice of performance. However, besides competitive time- and memory-efficiency, Luna
models retain superior performance even with a small projected dimension (l=16).
6

Contextual information in P of Luna. Table 3: Performance comparison of two sentence repreRecently, a popular method to model sentation methods on LRA benchmark.
the classification task using Transformerbased models is to prepend a special symModels
ListOps
Text
Retrieval
Avg.
bol, [CLS], to every input example. The
Luna-16, [CLS]
37.43
65.74
79.38
60.85
last hidden state of this symbol is reLuna-16, P
38.06
65.81
80.22
61.36
garded as the aggregate sequence repreLuna-128,
[CLS]
38.01
65.74
79.55
61.10
sentation. In Luna, we introduce an extra
Luna-128,
P
38.27
65.89
80.27
61.48
model input P which not only allows us
to efficiently compute the attention mechLuna-256, [CLS]
37.98
65.78
79.56
61.11
Luna-256, P
38.36
66.07
80.25
61.56
anism but learn contextual information
as well. Theoretically, the P from the
last layer is capable of learning the representation of the input sequence. To validate this, we extract
P at the last layer and employ the mean pooling strategy over positions to obtain the final feature for
classification. We test its performance on three long-text modeling tasks in LRA (Tay et al., 2021),
i.e., ListOps, Text and Retrieval and report results in Table 3. We find that P -based methods obtain
better scores across all tasks against the [CLS]-based one, validating the powerful ability of P to
encode contextual information of the input sequence.
4.2

Machine translation

To evaluate Luna on sequence-to-sequence modTable 4: Test BLEU on WMT’14 EN→DE.
eling, we conduct experiments on a standard
machine translation benchmark, i.e. WMT’14
Model
BLEU # Param.
English-German (EN→DE) dataset (4.5M senTransformer-base (Adam)
27.8
64.9M
tence pairs). The data split and preprocessing
Transformer-base (Apollo)
28.3
64.9M
steps follow those of Vaswani et al. (2017),
RFA (k = 256)
27.2
66.2M
using the scripts from FairSeq (Ott et al.,
2019). We share the source and target voLuna-16, elu, tied kv
27.1
69.6M
cabularies within the language pair, with 37K
Luna-32, elu, tied kv
27.3
69.7M
byte pair encoding (BPE) types (Sennrich et al.,
Luna-16, softplus, tied kv
27.3
69.6M
27.5
69.7M
Luna-32, softplus, tied kv
2016). The Luna models closely follow the
architecture of Transformer-base: 6 encoder
Luna-16, elu
27.4
77.5M
and decoder layers with 8 attention heads and
Luna-32, elu
27.6
77.6M
dmodel /dhidden = 512/2048. We train the
27.6
77.5M
Luna-16, softplus
Transformer-base model with two optimization
Luna-32, softplus
27.8
77.6M
methods: Adam (Kingma and Ba, 2015) and
Apollo (Ma, 2020), and find Apollo achieves better performance. Therefore, we use Apollo as the
optimizer for all Luna models. For each experiment, we conduct distributed training across eight
NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU. Further details are
provided in Appendix A.2.
Results. Table 4 presents the results of Luna on the test set BLEU scores of WMT’14 EN→DE,
along with Transformer-base and Random Feature Attention (RFA) as baselines. Different from Peng
et al. (2021) where the random feature attention is applied only to decoders, the RFA model in Table 4
applies random feature attention in both the encoder and decoder for a fair comparison. k = 256
is the number of feature maps in RFA. For Luna, we report performance of models with different
projected lengths: l = 16 and l = 32, different activation functions in (7): elu(·) + 1 and softplus(·),
and w./w.o parameter sharing.
From Table 4, the first observation is that softplus(·) consistently outperforms elu(·) + 1. Thus,
we use softplus(·) as the default activation function in the implementation. Another interesting
observation is that Luna with a small projected length (l = 16) obtains similar performance to RFA
with k = 256 feature maps. Luna with l = 32 achieves competitive performance, but still falls
behind the Transformer-base model. Further improving the machine translation performance of Luna
is left to future work. We also report the number of parameters of different models. At last, we
evaluate Luna w./w.o parameter sharing. Although there are two sets of parameters in a single Luna
attention function (WQ , WK , WV ), as mentioned in §3.1, we tie Wk with Wv to reduce the number
of parameters, and the performance decline is marginal. As a result, Luna with shared parameters has
7% and 5% more parameters compared to the vanilla Transformer and RFA models.
7

Effect of Encoding Contextual Information into P . As disTable 5: Dev and Test BLEU
cussed in §3.4, one advantage of Luna against Linformer is to
incorporate contextual P by formulating it as an extra input. To
Model
Dev. Test
investigate the importance of this design, we conduct experiments
Non-Contextual 24.4 25.2
on WMT’14 to compare Luna with the baseline model where P is
Contextual
25.9 27.3
formulated as a non-contextual learnable parameter of each layer.
For both the contextual and non-contextual models, we train Luna with l = 16, parameter sharing
and softplus. Table 5 lists the BLEU scores on the development and test sets. Luna with contextual
P significantly outperforms the baseline with non-contextual P , demonstrating the effectiveness of
this design in Luna.
4.3

Masked Language Modeling for Large-Scale Pretraining

One popular application of Transformer is to pretrain a large-scale language model on a large amount
of data which can then be fine-tuned on a wide range of downstream tasks, such as BERT (Devlin
et al., 2019), RoBERTa (Liu et al., 2019), etc. Therefore, we pretrain a Luna-based language model
with RoBERTa-base model configuration on two versions of data as our pretraining set: 1) BERT
version with BookCorpus (Zhu et al., 2015) and English Wikipedia (totally 16GB), 2) RoBERTa
version with BookCorpus, English Wikipedia, CC-News (Nagel, 2016), OpenWebText (Gokaslan and
Cohen, 2019) and Stories (Trinh and Le, 2018) (totally 160GB). For Luna models, we set l = 128.
On the larger training corpus (160GB), we train models w./w.o parameter sharing, respectively. We
compare our models with RoBERTa-base, BERT-base and Linformer which are trained on the same
training data. Experimental details are provided in Appendix A.3.
Finetuning Luna After obtaining the pretrained Luna-based language model, we finetune it on
various natural language processing tasks, including sentiment classification (SST-2; Socher et al.,
2013), natural language inference (QNLI; Rajpurkar et al., 2016), textual similarity (QQP; Chen
et al., 2018, question answering (RACE (Lai et al., 2017) and CommonsenseQA (CSQA; Talmor
et al., 2019). For GLUE tasks, following Liu et al. (2019), we consider a limited hyperparameter
sweep for each task, with batch sizes ∈ {16, 32} and learning rate ∈ {5e−6 , 1e−5 , 2e−5 }, with a
linear warmup for the first 6% of steps followed by a linear decay to 0. Finetuning is performed for
20 epochs with early stopping based on each task’s evaluation metric on the dev set3 . For QA tasks,
we concatenate each candidate answer with the corresponding question and passage. We then encode
every candidate and pass the [CLS] output at the last layer through a fully-connected layer, which is
used to predict the correct answer. We truncate question-answer pairs that are longer than 128 tokens
and, if needed, the passage so that the total length is at most 512 tokens. Following Liu et al. (2019),
we try a small range of possible values for hyperparameters, i.e., batch size ∈ {16, 32}, learning
rate ∈ {1e−5 , 2e−5 , 3e−5 } and dropout ∈ {0.0, 0.1, 0.2}. For other configurations such as warm-up
steps, optimizer, we follow thoses in Liu et al. (2019).
The result is reported in Table 6. We observe that on the smaller dataset (16GB) our Luna model has
similar or slightly better downstream results compared to other pretrained language models. On QNLI
and SST-2, Luna models obtain the best performance among all models, reaffirming the effectiveness
of Luna in pre-training. This demonstrates the strong ability of Luna for language representations. On
the larger dataset (160GB), however, the performance of Luna is slightly worse than RoBERTa with
vanilla Transformer architecture. One possible reason is that the capacity of Luna is not as sufficient
as vanilla Transformer, due to the efficient attention mechanism. This is supported by the evidence
that Luna with full sets of parameters achieves better performance than that with parameter-sharing,
because Luna with full sets of parameters has better capacity.

5

Related Work

There has been signficiant prior work on improving the efficiency of Transformers, besides the three
closely related works discussed in §3.4. The common techniques include, but are not limited to,
weight sharing (Dehghani et al., 2018), quantization (Shen et al., 2020; Fan et al., 2020), sparse
attention (Parmar et al., 2018; Kitaev et al., 2020), side memory module (Lee et al., 2019; Gupta and
Berant, 2020; Goyal et al., 2021), and low-rank or compressed context (Wang et al., 2019; Ainslie
3

We observed that Luna finetuning requires more epochs than vanilla Transformer (20 vs. 10). We also
finetuned RoBERTa with 20 epochs but did not obtain better results.

8

Table 6: Performance of various models on development set of benchmark natural language understanding tasks. Bold face indicates best performance.
Model

data

SST-2

GLUE
QNLI

QQP

QA
RACE CSQA

BERT-base
RoBERTa-base
Linformer-128
Luna-128, tied kv

16GB
16GB
16GB
16GB

92.7
93.1
92.4
93.1

88.4
90.9
90.4
91.2

89.6
90.9
90.2
90.8

64.2
65.6
65.2

53.3
53.1

RoBERTa-base
Luna-128, tied kv
Luna-128

160GB
160GB
160GB

94.8
94.3
94.6

92.8
91.5
92.2

91.9
91.2
91.3

73.50
71.50
72.25

63.61
61.48
62.08

et al., 2020). In this section, we briefly review some recently proposed methods. For a detailed
overview we refer the readers to Tay et al. (2020b).
Sparse Attention The general idea of these methods is that, instead of attending to the whole
sequence, each token only access to a fixed, predefined range such as local neighborhoods and strided
or “dilated” windows. Popular methods include local attention (Parmar et al., 2018), blockwise
attention (Qiu et al., 2019), strided attention patterns (Child et al., 2019; Beltagy et al., 2020), and
compressed attention (Liu et al., 2018). To make this range more flexible, Reformer (Kitaev et al.,
2020) employs a hash-based similarity measure to efficiently cluster tokens into chunks and Routing
Transformer(Roy et al., 2021) employ online k-means clustering on the tokens. The Sinkhorn sorting
Network (Tay et al., 2020a) exposes the sparsity in attention weights by learning to sort blocks of the
input sequence.
Kernel Methods. A recently popular method to improve the efficiency of Transformers is to avoid
explicitly computing the m × n attention matrix A in (1) by re-writing it with kernels. Typical models
leveraging kernelization are Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020) and Random Feature Attention (Peng et al., 2021). Since kernels are a form of
approximation of the attention matrix, they can be also viewed as a form of low-rank method (Choromanski et al., 2020) that compresses the context to a shorter length, such as Linformer (Wang et al.,
2019) and the proposed Luna model.
Recurrence. The simplest technique to reduce the complexity of Transformer is to chunk input
sequences into fixed blocks, with the obvious disadvantage of losing contextual information from
past chunks. As discussed in Tay et al. (2020b), these models can be regarded as fixed pattern models.
Transformer-XL (Dai et al., 2019) proposed a natural extension to the blockwise method to connect
these blocks via a recurrence mechanism. Compressive Transformer (Rae et al., 2020) further extends
Transformer-XL by maintaining a fine-grained memory of past chunk activations, which are discarded
in Transformer-XL. Technically, Luna can be adapted to a recurrence method, by simply using P as
an inherent memory module to maintain the recurrence across segments.

6

Conclusion

We have introduced Luna, a simple, efficient and effective linear attention mechanism used as a
drop-in substitute for regular softmax attention. By introducing an extra input with the fixed length,
Luna is capable of capturing adequate contextual information while performing attention operations
linearly. On three sequence modeling tasks, i.e., long-context sequence modeling, neural machine
translation, and large-scale pretraining and finetuning, Luna achieves comparable or even better
performance than a variety of strong baselines, while acquiring prominent gains of efficiency in
both speed and memory. In future work, we are interested in combining Luna with recurrence
methods where P can be used as a running memory across segments of inputs. Another interesting
direction would be to apply Luna to other tasks with long input sequences, such as document-level
summarization and translation.
9

Acknowledgments and Disclosure of Funding
This material is based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute
reprints for Government purposes notwithstanding any copyright notation therein. Xiang Kong was
supported by U.S. DARPA AIDA Program No. FA8750-18-2-0014. The views and conclusions
contained herein are those of the authors and should not be interpreted as necessarily representing the
official policies or endorsements, either expressed or implied, of Air Force Laboratory, DARPA or
the U.S. Government.

References
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured
inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 268–284, 2020.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In International Conference on Learning Representations (ICLR),
2015.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150, 2020.
Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs. University of
Waterloo, 2018.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention
with performers. arXiv preprint arXiv:2009.14794, 2020.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations
(ICLR), 2016.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers. In International Conference on Learning Representations (ICLR), 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and
Armand Joulin. Training with quantization noise for extreme fixed-point compression. arXiv
preprint arXiv:2004.07320, 2020.
10

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In
Proceedings of the fourteenth international conference on artificial intelligence and statistics,
pages 315–323. JMLR Workshop and Conference Proceedings, 2011.
Aaron Gokaslan and Vanya Cohen.
io/OpenWebTextCorpus, 2019.

Openwebtext corpus.

URl: https://skylion007. github.

Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman,
Jonathan Binas, Charles Blundell, Michael Mozer, and Yoshua Bengio. Coordination among neural
modules through a shared global workspace. arXiv preprint arXiv:2103.01197, 2021.
Ankit Gupta and Jonathan Berant. Gmat: Global memory augmentation for transformers. arXiv
preprint arXiv:2006.03274, 2020.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns:
Fast autoregressive transformers with linear attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR, 2020.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations, 2015.
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv
preprint arXiv:2001.04451, 2020.
Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical Report.
University of Toronto, 2009.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading
comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pages 785–794, 2017.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International
Conference on Machine Learning, pages 3744–3753. PMLR, 2019.
Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning
long-range spatial dependencies with horizontal gated recurrent units. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL
https://proceedings.neurips.cc/paper/2018/file/ec8956637a99787bd197eacd77acce5e-Paper.pdf.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on
Learning Representations (ICLR), 2018.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Xuezhe Ma. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex
stochastic optimization. arXiv preprint arXiv:2009.13586, 2020.
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the
association for computational linguistics: Human language technologies, pages 142–150, 2011.
Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi,
Possu Huang, and Richard Socher. Progen: Language modeling for protein generation. bioRxiv,
2020.
Sebastian Nagel.
Cc-news.
URL: http://web. archive. org/save/http://commoncrawl.
org/2016/10/newsdatasetavailable, 2016.
11

Nikita Nangia and Samuel Bowman. Listops: A diagnostic dataset for latent tree learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Student Research Workshop, pages 92–99, 2018.
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In
Proceedings of the Third Conference on Machine Translation: Research Papers, pages 1–9, 2018.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of
NAACL-HLT 2019: Demonstrations, 2019.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. In International Conference on Machine Learning, pages
4055–4064. PMLR, 2018.
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong.
Random feature attention. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=QtTKTdVrFBB.
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise
self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl
anthology network corpus. Language Resources and Evaluation, 47(4):919–944, 2013.
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap.
Compressive transformers for long-range sequence modeling. In International Conference on
Learning Representations (ICLR), 2020.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for
machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392, 2016.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney,
and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and
Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.
In Proceedings of the 2013 conference on empirical methods in natural language processing, pages
1631–1642, 2013.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2818–2826, 2016.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question
answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In
International Conference on Machine Learning, pages 9438–9447. PMLR, 2020a.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv
preprint arXiv:2009.06732, 2020b.
12

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,
Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient
transformers. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=qVyeW-grC2k.
Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint
arXiv:1806.02847, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws,
Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for neural machine
translation. In Proceedings of the 13th Conference of the Association for Machine Translation in
the Americas (Volume 1: Research Track), pages 193–199, 2018.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao.
Learning deep transformer models for machine translation. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, pages 1810–1822, 2019.
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. arXiv preprint arXiv:2006.04768, 2020.
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine
translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 5466–5473, 2019.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang,
Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture.
In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer
vision, pages 19–27, 2015.

13

Appendix: Luna: Linear Unified Nested Attention
A Experimental Details
A.1

Long-Context Sequence Modelling

For all tasks except Retrieval, we closely follow the
model configurations in Tay et al. (2021) such as Table 7: Hyperparameters of models in LRA
data preprocessing, data split, model architecture, tasks. LR and Attn-Dropout denote the learnbatch size etc. To guarantee convergence, we train ing, batch size and attention dropout.
models for the Retrieval task with 20k steps instead
of the 5k steps prescribed inTay et al. (2021). The
Tasks
LR
Dropout Attn-Dropout
hyperparameters of models in these tasks are listed
1e-4
0.1
0.1
in Table 7. We mainly tune three hyperparameters: ListOps
5e-5
0.3
0.3
Text
learning rate, dropout and attention dropout. For
Retrieval
5e-5
0.1
0.1
the other main hyperparametrs such as batch size,
5e-3
0.1
0.3
Image
number of layers and number of warmup steps, we
Pathfinder 1e-3
0.2
0.1
follow the guidance of Tay et al. (2021).
A.2

Neural Machine Translation

Our experiments on WMT 2014 English-German are based on the Transformer-base model (Vaswani
et al., 2017), with implementation from the FairSeq package (Ott et al., 2019). This dataset contains
4.5M parallel sentence pairs for training. We following the standard setting (Vaswani et al., 2017),
using Newstest2013 as the validation set and Newstest2014 as the test set. The dataset is pre-processed
following (Ma, 2020), using the scripts from FairSeq package4 . Specifically, we use word embedding
with 512 dimension and 6-layer encoder/decoder with 8 multi-head attention and 2048 feed-forward
dimensions. We apply 0.1 label smoothing (Szegedy et al., 2016), and perform totally 500, 000
updates to train each model. For Adam, we use start learning rate 0.0005, set β = (0.9, 0.98), and
apply the decoupled weight decay technique (AdamW) (Loshchilov and Hutter, 2019). For all the
models trained with A POLLO, we set the learning rate is 0.1, β = 0.9 and ǫ = 1e−4 . For learning
rate scheduling, we applied linear warm up the learning rate for both Adam, and A POLLO — 4000
updates for Adam and 1000 updates and A POLLO. After learning rate warming up, we applied the
inverse square root decay (Vaswani et al., 2017) to Adam. For A POLLO, following Ma (2020), we
decayed the learning rate at the 300, 000 and 450, 000 updates by decay rate 0.1. Gradient clips with
1.0 are applied to all the optimization methods, and the dropout ratio are set to 0.1. Weight decay
rates are 1e−4 for Adam methods and 1e−8 for A POLLO. The decoding beam size is set to 5, and the
checkpoints of the last 10 epochs are averaged before evaluation. For each experiment, we conducted
distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size as 8192 tokens
per GPU (totally 8192 × 8 tokens per batch).
A.3

Masked Language Modeling for Large-Scale Pretraining and Finetuing

We pre-trained all the models on 64 Tesla V100 GPUs with the standard masked-language-modeling
(MLM) objective and two pre-training corpus: (i)BERT version with BookCorpus (Zhu et al., 2015)
and English Wikipedia (totally 16GB); (ii) RoBERTa version with BookCorpus, English Wikipedia,
CC-News (Nagel, 2016), OpenWebText (Gokaslan and Cohen, 2019) and Stories (Trinh and Le, 2018)
(totally 160GB). We use the standard Adam optimizer with a linear decay learning rate scheduler.
Table 8 describes the hyperparameters for pre-training of Luna-128 model. For finetuning stage, we
closely follow the training configuration used in released Roberta finetuning script for different tasks
and main hyperparameters are listed in Table 9.

4

https://github.com/pytorch/fairseq

14

Table 8: Hyperparameters for pre-training LUNA-128 on two public corpus.
Hyperparameter
Number of Layers
Hidden size
FFN inner hidden size
Attention heads
Attention head size
Dropout
Attention Dropout
Warmup Steps
Peak Learning Rate
Batch Size
Weight Decay
Max Steps
Learning Rate Decay
Adam ǫ
Adam β1
Adam β2
Gradient Clipping
Project Length

LUNA (16GB)

LUNA (160GB)

12
768
3072
12
64
0.1
0.1
15k
6e-4
2k
0.01
250K
Linear
1e-6
0.9
0.98
1.0
128

12
768
3072
12
64
0.1
0.1
24k
6e-4
8k
0.01
500k
Linear
1e-6
0.9
0.98
1.0
128

Table 9: Hyperparameters for finetuning Luna on GLUE, RACE and CSQA.
Hyperparameter
GLUE RACE
CSQA
Learning Rate
Batch Size
Weight Decay
Max Epochs
Learning Rate Decay
Warmup Steps
Dropout
Attention Dropout
Activation Dropout

1e-5
32
0.1
20
Linear
6%
0.1
0.1
0.1

15

1e-5
64
0.01
20
Fixed
150
0.1
0.1
0.0

1e-5
64
0.01
20
Polynomial Decay
150
0.2
0.0
0.1