Adaptive Multi-Resolution Attention with Linear
Complexity
Yao Zhang∗, 1

arXiv:2108.04962v1 [cs.LG] 10 Aug 2021

1

Yunpu Ma∗, 1

Thomas Seidl, 1

Volker Tresp 1,2

Institute of Informatics, LMU Munich, 2 Corporate Technology, Siemens AG
Yao.Zhang@campus.lmu.de,
yunpu.ma@dbs.ifi.lmu.de
seidl@dbs.ifi.lmu.de,
volker.tresp@siemens.com

Abstract
Transformers have improved the state-of-the-art across numerous tasks in sequence
modeling. Besides the quadratic computational and memory complexity w.r.t the
sequence length, the self-attention mechanism only processes information at the
same scale, i.e., all attention heads are in the same resolution, resulting in the
limited power of the Transformer. To remedy this, we propose a novel and efficient
structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which
scales linearly to sequence length in terms of time and space. Specifically, we
leverage a multi-resolution multi-head attention mechanism, enabling attention
heads to capture long-range contextual information in a coarse-to-fine fashion.
Moreover, to capture the potential relations between query representation and clues
of different attention granularities, we leave the decision of which resolution of
attention to use to query, which further improves the model’s capacity compared to
vanilla Transformer. In an effort to reduce complexity, we adopt kernel attention
without degrading the performance. Extensive experiments on several benchmarks
demonstrate the effectiveness and efficiency of our model by achieving state-ofthe-art speed-memory-accuracy trade-off. To facilitate AdaMRA utilization by
the scientific community, the code implementation will be made publicly available.

1

Introduction

The recent emergence of the Transformer has drastically reshaped the landscape of natural language
processing research. Transformers have demonstrated superior performance in a wide variety of
tasks, such as machine translation [Vaswani et al., 2017], natural language inference [Williams et al.,
2017], text classification [Howard and Ruder, 2018], question answering [Rajpurkar et al., 2016],
automatic speech recognition [Dong et al., 2018], image generation [Parmar et al., 2018] and image
captioning [Xu et al., 2015]. The key innovation in Transformers is the introduction of a multi-head
self-attention mechanism, which models pairwise interaction of the input sequence, regardless of
their distance from each other. This operation has been shown quite effective.
Nonetheless, despite several notable successes of Transformers, computing the attention matrix,
which is their key component, also turns out to be a major efficiency bottleneck due to its quadratic
time and space complexity with respect to the sequence length. Therefore, the maximum sequence
length is restricted by the amount of memory available. This inherent limitation of Transformers has
prevented them from being successfully applied to domains requiring longer sequence lengths, like
document classification. Further, building large Transformer-based models in practice is notoriously
expensive. Although the fine-tuning stage is relatively inexpensive, the memory issue still restricts
the scenarios in which these models can be used. Besides the computational cost, qualitative analysis
∗

these authors contributed equally to this work

Preprint. Under review.

64
AdaMRA (Ours)
61

BigBird
Nystromformer

LRA Score

of attention heads [Vaswani et al., 2017]
suggests that heads tend to favor flatter or
more peaked distributions depending on
what phenomena they capture. Thus, using
an extremely long sequence may limit the
power of the model.

To this end, a wide spectrum of efficient,
58
fast Transformers has been proposed to
Linear Transformer
tackle these limitations. For instance, [BeltTransformer
agy et al., 2020, Wang et al., 2020a, Tay
et al., 2020a, Child et al., 2019, Zaheer
Linformer
55
et al., 2020, Correia et al., 2019, Qiu et al.,
2019, Roy et al., 2021, Kitaev et al., 2020,
Reformer
Zhao et al., 2019, Vyas et al., 2020] adPerformer
dresses the problematic complexity by lim52
iting the amount of keys that each query
60
-20
140
220
attends to. However, these methods either
Speed (examples per sec)
break long-term dependency or hurt the
time efficiency. There is also a long line Figure 1: Trade-off between accuracy (LRA score),
of research on using dense attention matrix model speed (examples per sec) and memory footprint
but defined by low-rank kernels substitut- (size of circle). Our AdaMRA achieves state-of-the-art
ing softmax [Katharopoulos et al., 2020, speed-memory-accuracy trade-off.
Choromanski et al., 2020, Xiong et al.,
2021, Peng et al., 2021]. Although these
approaches have achieved better speed-memory-accuracy trade-off, they still suffer from the aforementioned limitations of the self-attention mechanism. Another prominent line of work is to increase
the memory capacity [Sukhbaatar et al., 2019, Ye et al., 2019, Rae et al., 2019]. However, these
works still process information at the same scale.
Besides the computational cost, the self-attention mechanism of all models mentioned above only
processes information at the same scale, i.e., all attention heads are in the same resolution. However,
inspired by the fact that the information in most of the domain has a hierarchical structure, for
instance, word- and sentence-level information in the domain of text/language, low- and high-level
features in the domain of image, etc., we suggest processing information in a coarse-to-fine fashion
could be beneficial for capturing hierarchically structured information. We thus propose Adaptive
Multi-Resolution Attention (AdaMRA), a linear time and space attention that captures long-distance
dependencies in a coarse-to-fine manner. To be more precise, unlike vanilla Transformer, which
maintains a constant resolution throughout all attention heads, AdaMRA employs multi-resolution
attention heads that vary in the level of abstraction. Moreover, to capture the potential relations
between query representation and clues of different attention granularities, each query is routed to the
corresponding attention head. Furthermore, we adopt kernel attention [Katharopoulos et al., 2020]
without sacrificing the performance.
We evaluate the proposed method on the Long-Range-Arena (LRA) benchmark [Tay et al., 2020b] and
show that AdaMRA achieves promising performance while having a linear computational complexity
with respect to the sequence length. Impressively, as shown in Figure 1, the average LRA scores
increased by 4.32 and 3.66 from vanilla Transformer and the previous best performing model BigBird,
respectively. In terms of time and space efficiency, AdaMRA is around 10 times faster than vanilla
Transformer on GPU, while 5 times smaller in GPU running memory occupation.

2

Related Work

In this section, we briefly review the most relevant works that aim to address the Transformers’ large
memory and computational requirements.
Efficient Self-Attention A conceptual way of reducing the complexity of the full attention is to
limit the number of accessible elements to the attention. [Qiu et al., 2019, Zaheer et al., 2020, Beltagy
et al., 2020, Child et al., 2019] achieve this by using fixed, predefined patterns such as local windows
and block patterns of fixed stride. Another line of work here is to consider which part of the inputs
2

should be attended to by learning to assign tokens to buckets or clusters before performing attention.
[Kitaev et al., 2020] uses locality sensitive hashing to group together token, [Roy et al., 2021, Vyas
et al., 2020] employs online k-means to learn the space-partitioning centroids, and [Tay et al., 2020a]
sort keys in a block-wise fashion. However, they may lack the flexibility to look at the full sequence
and thus restrict the model capacity to capture long-distance dependencies. Moreover, additional
computation steps required by some approaches (e.g., LSH in [Kitaev et al., 2020]) might undermine
their final efficiency gains. Unlike these works, our method uniquely incorporates pooling-based
compression to capture the context information of different scales with only a small additional
computation budget while maintaining excellent performance.
Kernel Attention Another method is to improve efficiency by leveraging low-rank approximations
of the softmax attention matrix. Katharopoulos et al. [Katharopoulos et al., 2020] interprets the
Sof tmax as a kernel and approximate the attention matrix via kernel approximation. Subsequently,
this strategy is also employed by [Xiong et al., 2021, Peng et al., 2021, Choromanski et al., 2020].
In [Xiong et al., 2021], the approximation of standard softmax attention is based on adapting the
Nyström method, while [Peng et al., 2021, Choromanski et al., 2020] leverage random feature
methods to approximate the softmax function. Although these approaches have achieved better
speed-memory-accuracy trade-off, the performance of these methods is still affected by the quality of
approximation and the fully-connected nature of self-attention in Transformer, which, as suggested
by [Guo et al., 2019], is not a good inductive bias.
Increasing Memory Capacity Memory is crucial for many tasks. However, extending the memory
span is computationally expensive due to the attention mechanism’s quadratic time and space
complexity. Several recent works have proposed strategies to increase the memory capacity of
Transformers. BP-Transformer [Ye et al., 2019] is designed to incorporate the common-sense
inductive bias of the hierarchical linguistic structure within the sentence, i.e., each query attends
to context information from fine-grain to coarse-grain as the relative distance increase. [Rae et al.,
2019] uses some pooling operator (e.g., max/mean pooling) to reduce the number of memories in
the past, where all memories are equally compressed regardless of the content of the current query.
In [Sukhbaatar et al., 2019], each attention head separately learns its temporal context size from
data. The works mentioned above focus on increasing memory capacity without actually changing
the memory resolution. Our work differs from theirs in that we focus on capturing long-term
dependencies in a multi-resolution fashion, which, in turn, indirectly reduces our model’s memory
footprint.

3

Model

In this section, we formalize the proposed method. In Section 3.1, we first briefly revisit the attention
mechanism and present its computational complexity. We then introduce AdaMRA in Section 3.2. An
interpretation to AdaMRA is provided in Section 3.3. We close by practically analyzing AdaMRA’s
complexity in Section 3.4.
3.1

Revisiting Self-Attention and its Linearization

The self-attention function calculates, for every token, a weighted average of the feature representations of all other tokens with a weight proportional to a normalized similarity score between
representations. Formally, let X = {x(1) , ..., x(n) } ∈ Rn×d denotes an input sequence comprising n
tokens of dimension d. Given three matrices Q, K and V , which is linear projections of the layer’s
input X,
Q = XW Q , K = XW K , V = XW V ,
(1)
n×d
where Q, K, V ∈ R
and W Q , W K , W V ∈ Rd×d . Following common terminology, Q, K, and
V are referred to as the queries, keys, and values, respectively. The keys are used to compute a
similarity score between each item and query. Then, weight the values of each item at each query
context using the normalized similarity score. The attention outputs the weighted sum of the values
by the similarity score between the queries and keys. Thus, the generalized attention function for any
similarity function can be written as:
Pn
j=1 sim(Qi , Kj )Vj
Attention(Qi , K, V ) = Score(Qi , K, V ) = Pn
(2)
j=1 sim(Qi , Kj )
3

Finegrained

Concat + Linear

K̃ 3

Ṽ 3

K̃ 2

Medium- Coarsegrained grained

Head 3

Head 2

Head 1

Head 1

Head 2

Segment Mean

Segment Mean

Router

Key

Value

Query

K̃ 1

Ṽ 2

Head 1
Finegrained

Coarsegrained

Mediumgrained

Head 2

Ṽ 1

Head 3

Head 3
P1

P2

q1

Input X

P4

P3

q2

q3

P5

q4

P6

q5

q6

Figure 2: The architecture of AdaMRA. In this example, the number of heads H = 3, the number of
subheads S = 2, sequence length n = 6 and compression rate c = (1/1, 1/2, 1/3). Top right shows the
construction of multi-resolution memory/context. Bottom right diagrams 6 tokens being routed.
According to [Vaswani et al., 2017], the unified similarity function can take the form of Sof tmax.
Therefore, the quadratic complexity emerges from the computation of the similarity score between
every pair of tokens.
In order to define an attention function, sim(·) in Eq. 2 needs to be a non-negative function, which
includes all kernels k(x, y) : RF × RF → R+ [Katharopoulos et al., 2020]. Given such a kernel with
a feature representation φ(x), we can rewrite the generalized attention function (Eq. 2) as follows,
Pn
Pn
T
φ(Qi )T j=1 φ(Kj )VjT
j=1 φ(Qi ) φ(Kj )Vj
Pn
P
(3)
Attention(Qi , K, V ) =
=
n
T
φ(Qi )T j=1 φ(Kj )
j=1 φ(Qi ) φ(Kj )
Pn
Pn
T
j=1 φ(Kj )Vj and
j=1 φ(Kj ) could be reused for every query, therefore reducing the complexity
from quadratic to linear both in terms of memory and computation.
Considering the fact that different parts of a sequence may be relevant in different ways, multi-head
attention was introduced in Transformer. Assuming there are H attention heads, this is simply the
application of Eq. 2 in parallel H times, each with a different, learned linear transformation that
allows specialization:
M ultiHead(Q, K, V ) = Concat(Head1 , ..., HeadH )W O

(4)

Headh (Q, K, V ) = Attention(QWhQ , KWhK , V WhV )

(5)

where WhQ , WhK

∈ Rd×dk , WhV ∈ Rd×dv are the matrices that project the queries, keys, and
values into the h-th subspace, respectively; W O ∈ RHdv ×d is the matrix that computes a linear
transformation of the heads, with, typically, Hdv = Hdk = d.
3.2

Adaptive Multi-Resolution Attention

To capture hierarchically structured information effectively, we propose AdaMRA. The main idea
is to employ the multi-resolution attention heads in a coarse-to-fine fashion and enable the query
to choose between different resolutions of attention. This process is done independently for each
layer, allowing queries in different layers to attend to contexts of different resolutions. We describe
AdaMRA in the context of a single Transformer layer and omit the layer index for brevity.
4

In AdaMRA, the input sequence X ∈ Rn×d still pass through three linear layers to form the queries
Q ∈ Rn×d , keys K ∈ Rn×d , and values V ∈ Rn×d , where n is sequence length and d is the
embedding dimension. For each attention head h, we define a compression rate ch , a higher value
indicates more fine-grained compressed information. To encode the context information, we produce
compressed keys and values using certain compressive operations, which can be selected from kmeans clustering [Vyas et al., 2020, Roy et al., 2021], projection [Wang et al., 2020b], and convolution
[Rae et al., 2019], etc. For the sake of computation efficiency, we employ segment means [Xiong
et al., 2021] to compress the original (n × d)-dimensional K and V into (mh × d)-dimensional
e h and Ve h , where h denotes h-th head and mh = nch is the number of landmarks of
compressed K
head h. To be more precise, given the compression rate ch for head h, we separate the n keys/values
into mh segments. In our experiments, n is divisible by m. If this is not the case in practice, we can
pad inputs to a length divisible to m. Note that to obtain multi-resolution attention, compression
rate of each head is different. Such compression strategy can not only guarantee the preservation of
important information, but also simplify the model since c is usually a small number. More details
regarding the impact of compression rate is provided in Section 4.
e h and value Ve h available in hand, attention head are now in different
With compressed key K
resolution. To capture the potential relations between query representation and clues of different
attention granularities, we let query itself choose which resolution of attention to use, which is
conditioned on the information encoded in query’s representation. This is accomplished by adding
a Router [Shazeer et al., 2017] before the attention layer (see Figure 2). The router takes a token
representation qi as an input and then routes this to the best-determined expert, i.e., attention head.
Specifically, we adopt F (·), a parameterized function, for projecting query qi from d dimensions to
H dimensions, where H is the number of heads. We then normalize this value via a Sof tmax. Each
query is routed to the head with the highest router probability P . In practice, we mask out the tokens
that are not routed to the current head. Formally,
P = Sof tmax(F (Q)),

F (Q) = QW,

(6)

where F (·) is a parameterized function, Q ∈ Rn×d , the learnable parameter W ∈ Rd×H , and the
router probability P ∈ Rn×H .
Finally, in pursuit of efficiency, we adopt kernel attention [Katharopoulos et al., 2020] while calculating attention using Eq. 3,
PN
Pn
h eh eh
e h )(Ve h )T
φ(Qhi )T j=1 φ(K
j
j
j=1 sim(Qi , Kj )Vj
h eh eh
Attention(Qi , K , V ) = PN
=
, (7)
P
n
h
h
h
h
T
e
e
φ(Qi )
j=1 φ(Kj )
j=1 sim(Qi , Kj )
e h and Ve h are the compressed query and value
where Qhi is i-th query that is routed to h-th head, K
of h-th head. For our experiments, we employ ReLU as the feature function φ (see Section 4.3 for
feature function analysis).
As suggested by recent works on interpreting attention head roles, separate attention heads may
learn to look for various relationships between tokens [Voita et al., 2019]. Thus, in practice, we
use the same strategy to split the head into multiple subheads, whose resolution is the same as the
original head, allowing the model to jointly attend to information at different positions from different
representation subspaces. For our experiments, all attention heads have same number of subheads.
Multi-head AdaMRA is thus defined as:
AdaM RA(Q, K, V ) = (

H
X

Headh )W O , Headh = Concat(subheadh1 , ..., subheadhS ), (8)

h=1

where Q, K, V ∈ Rn×d , W O ∈ RSdv ×d is learned matrices, dv = d/S is the hidden dimension of
the projection subspace, H is the number of heads, S is the number of subheads, and hS denotes the
S-th subhead of h-th head. The s-th subhead of h-th head is defined as:
e h WhK , Ve h WhV ),
subheadhs = Attention(Qh WhQs , K
s
s

(9)

where WhQs , WhKs ∈ Rd×dk , WhVs ∈ Rd×dv are the matrices that project the queries, keys and values
into the hs -th subspace, respectively. For our experiments, we set Sdv = Sdk = d.
5

3.3

Interpretation of AdaMRA

e h )T Ve h as a global description/memory of the input sequence that
Intuitively, one can think of φ(K
the query will perform attention over. As discovered by previous works, global and multi-scale
representations are useful. Therefore, to combine the low-level details and high-level semantics, each
attention head has different memory scales, corresponding to a different semantic aspect of the entire
input. For instance, coarse memory and fine-scale memory could correspond to the summary of
paragraph and the word representation, respectively. To further enhance feature expression ability, we
leave the decision of which resolution of attention to use to query. Thus, a query can choose between
the different resolutions of memory based on its own representation with more flexibility.
3.4

Efficiency Advantage

We now show the efficiency advantage of AdaMRA in memory and computation. Assuming we have
H head with mh landmarks each, the landmark selection using segment means takes O(n), where n
is sequence length. The usage of kernel attention [Katharopoulos et al., 2020] eliminates the O(n2 )
terms from both the memory and computational complexities of the module. Instead, the computation
e h )T Ve h ) of dimensionality d and new values take O(mh d2 ) and
of global description/memory (φ(K
PH
2
O(nd ), respectively. Consequently, the total cost of AdaMRA scales as O(Hn + h=1 mh d2 +
Hnd2 ), i.e., scales linearly with respect to the sequence length n. In the following section we will
PH
show that a small h=1 mh (typically smaller than n) is enough for achieving good performance,
which further increase the efficiency advantage of AdaMRA over vanilla Transformer.

4

Experiments

In this section, we validate the AdaMRA in terms of computational cost, memory consumption,
and accuracy on long-range context tasks in the LRA benchmark [Tay et al., 2020b] and show that
AdaMRA performs consistently better than baselines, suggesting that leveraging multi-resolution
attention head is reasonable and effective. The experimental results show the superior ability of
AdaMRA in modeling the long-range context.
4.1

Experiment Settings

LRA is a suite of five general and challenging tasks designed to evaluate how well Transformers
capture long-term dependencies from different modalities such as text, natural and synthetic images,
and mathematical expressions requiring similarity, structural and visual-spatial reasoning. For a
complete description of the objectives and datasets, we refer the reader to [Tay et al., 2020b].
Tasks Tasks used for comparison are as follows: (1) Long Listops, designed to investigate the
model’s capability of reasoning hierarchically structured data in a long-context scenario. We use
a version of the ListOps dataset [Nangia and Bowman, 2018] of sequence lengths of up to 2K. (2)
Byte-Level Text Classification aims to test the models’ ability to deal with compositionality
as it is required to compose characters into words into higher-level phrases. We use the IMDb reviews
dataset [Maas et al., 2011] of a fixed max length of 4K. (3) Byte-Level Document Retrieval
uses the AAN dataset [Radev et al., 2013] to investigate a model’s ability to encode and store
compressed representation. The model learns a similarity score between two documents. Each
document has a sequence length of 4K. (4) Image Classification This task serves as a test of
how well models are able to capture the 2D spatial relations between input pixels. We use the CIFAR10 dataset [Krizhevsky et al., 2009] for this task. (5) Pathfinder In this task, we are interested in
the model’s ability to capture long-range spatial dependencies. The model makes a binary decision
on whether two points are connected by a path [Linsley et al., 2018].
Baselines We base our evaluation on six recently proposed efficient Transformer models. Aside
from the vanilla Transformer [Vaswani et al., 2017], we compare our model against other efficient
self-attention variants, including Reformer [Kitaev et al., 2020], Linear Transformer [Katharopoulos
et al., 2020], Performer [Choromanski et al., 2020], Linformer [Wang et al., 2020b], Big Bird [Zaheer
et al., 2020] and Nyströmformer[Xiong et al., 2021].
6

Table 1: Experimental results on LRA benchmark. We report accuracy (higher is better) of different
models. The best model is in boldface, and the second-best is underlined. Transformer’s, Reformer’s,
Linformer’s, Performer’s and Nyströmformer’s numbers are due to [Xiong et al., 2021]. Asides from
Linear Transformer, we achieve consistent results reported in [Tay et al., 2020b]. The implementation
of Linear Transformer is based on the official published code. Avg: average accuracy across all
tasks. AdaMRA significantly outperforms other Transformer models among all tasks, with +4.32,
+3.66 in average accuracy against vanilla Transformer and previous best performing model BigBird,
respectively.
Model

ListOps

Text

Retrieval

Image

Pathfinder

Avg

Transformer

37.10

65.02

79.35

38.20

74.16

58.77

BigBird
Reformer
Linformer
Linear Trans.
Performer
Nyströmformer

38.55
19.05
37.25
37.35
18.80
37.15

63.90
64.88
55.91
64.15
63.81
65.52

81.50
78.64
79.37
81.10
78.62
79.56

38.30
43.29
37.84
38.20
37.07
41.58

74.89
69.36
67.60
70.20
69.87
70.94

59.43
55.04
55.59
58.20
53.63
58.95

Ours

40.40

68.44

84.83

46.00

75.77

63.09

Implementation Details To ensure fair comparisons for all models, we train a two-layer Transformer. The embedding dimension is 64, and the hidden dimension is 128. Our data split, preprocessing, and training procedure follow those of [Xiong et al., 2021]. For all models, we use the default
PyTorch implementation.
4.2

Performance Comparison

Accuracy Comparison We compare Transformers with respect to their performance. Table 1
reports LRA score of several Transformer models, including BigBird (the previous best performing
model in terms of LRA score [Tay et al., 2020b]), variants with linear complexity (Linformer, Linear
Transformer, Performer and Nyströmformer), and Reformer.
Our model brings consistently considerable performance boosts over the baseline models to all tasks.
Specifically, our model achieves an average score of 63.09 on the LRA benchmark, increasing 4.32,
4.89, and 9.46 absolute points from vanilla Transformer, Linear Transformer, and Performer. Besides,
our model also achieves significant performance gain compared to the previous best performing
model BigBird. This might be attributed to the fact that, in contrast to BigBird, our attention layer is
able to exchange information globally on the entire sequence.
In addition, it is worth mentioning that our model boosts the score by 3.42 and 5.48 on the tasks Text
(n=4K) and Retrieval (n=4K) compare with vanilla Transformer, suggesting that our model is
advantageous in tasks that require large sequence length. More importantly, we notice the performance
gains, especially on the Image Classification task, outperforming vanilla Transformer by 7.8,
which indicates that the inductive bias of AdaMRA plays a substantial role in this task. Thus, our
model has the capability to capture 2D spatial relations.
Speed and Memory Comparison To better illustrate the boosted efficiency, we compare Transformers with respect to their computational and memory requirements. We use the IMDb dataset
with a batch size of 16 for all runs. Table 2 shows peak allocated GPU memory and required time
of the sequence lengths {1K, 2K, 4K} for several efficient Transformer models. We benchmark all
models’ speed and memory on a Tesla T4 GPU with 16GB of memory. All compared models are of
the same size as those described above.
The overall fastest models are kernel-based models (Performer and Linear Transformer). The model
with the smallest memory footprint is the Linear Transformer, coming in at 1741 MB compared to
10327 MB for the vanilla Transformer at 4K. Our method comes in a close second and is almost as
fast as the fastest one. Similar to speed, our model is also relatively compact and is almost as compact
as Linear Transformer and Performer. Importantly, our model speeds up over the vanilla Transformer
7

Table 2: Comparison of training time and peak memory consumption on various input sequence
lengths. Efficiency improvements in comparison with the vanilla Transformer are in brackets. The
best model is in boldface, and the second-best is underlined. Performer-32 denotes Performer selfattention module using a feature map of 32 dimensions. Nyströmformer-32 denotes Nyströmformer
self-attention module using 32 landmarks. Ours-(ch ) denotes AdaMRA self-attention module with
four attention heads using a compression rate of ch on each head. AdaMRA offers favorable memory
and time efficiency over standard self-attention and is almost as fast and compact as kernel-based
Transformer (Linear Transformer and Performer).
Running time (ms)

Model

Peak Memory Usage (MB)

1K

2K

4K

1K

2K

4K

Transformer

82(1×)

272(1×)

1007(1×)

1713(1×)

3829(1×)

10327(1×)

BigBird
Reformer
Linformer
Linear Transformer
Performer-32
Nyströmformer-32

104(0.79×)
53(1.5×)
37(2.2×)
22(3.7×)
30(2.7×)
76(1.1×)

211(1.3×)
103(2.6×)
71(3.8×)
38(7.1×)
54(5.0×)
160(1.7×)

420(2.4×)
200(5.0×)
139(7.2×)
69(14.6×)
100(10.7×)
233(4.3×)

1815(0.9×)
1467(1.2×)
1499(1.1×)
1113(1.5×)
1171(1.5×)
2627(0.6×)

2835(1.4×)
2007(1.9×)
1978(1.9×)
1353(2.8×)
1439(2.7×)
3381(1.1×)

4921(2.1×)
3125(3.3×)
3035(3.4×)
1741(5.9×)
1917(5.4×)
4685(2.2×)

Ours-(1/2,1/4,1/8,1/16)
Ours-(1/4,1/8,1/16,1/32)

33(2.4×)
32(2.6×)

54(5.0×)
53(5.1×)

98(10.3×)
96(10.5×)

1239(1.4×)
1201(1.4×)

1581(2.4×)
1565(2.4×)

2201(4.7×)
2117(4.9×)

Table 3: Comparison of SMAT score (higher is better). Normalized values are in brackets. Speed
stands for examples per second. The best model is in boldface, and the second-best is underlined.
AdaMRA significantly outperforms other Transformer models.
Model

Speed

Peak Memory Usage (MB)

LRA Score

SMAT Score

Transformer

14.5 (0.00)

6645 (1.00)

58.77 (0.54)

0.54

BigBird
Reformer
Linformer
Linear Transformer
Performer-32
Nyströmformer-32

36.4 (0.12)
80.0 (0.35)
111.1 (0.52)
200.0 (1.00)
142.9 (0.69)
57.1 (0.23)

2917 (0.30)
2023 (0.13)
2003 (0.12)
1353 (0.00)
1439 (0.02)
2687 (0.25)

59.43 (0.61)
55.04 (0.15)
55.59 (0.21)
58.20 (0.48)
53.63 (0.00)
58.95 (0.56)

1.43
1.37
1.61
2.48
1.67
1.54

Ours

160.0 (0.78)

1475 (0.02)

63.09 (1.00)

2.76

by about 10.4× on 4K sequence length and requires only about 20% of the memory of the vanilla
Transformer at 4K. As sequence length increases, the training time speed-up and memory savings are
even more dramatic.
Notably, kernel-based models are fast and compact at the cost of relatively lower quantitative
performance (see Table 1). In contrast, our model is competitive in both accuracy and efficiency, as
Figure 1 shows. Besides, our analysis indicates that AdaMRA efficiency gains are especially notable
on long sequences, suggesting that AdaMRA will be particularly useful in tasks that require large
sequence length, fast training speed, or low memory footprints.
Speed-Memory-Accuracy Tradeoff Comparison In real-world scenarios, speed, memory and
accuracy are three important aspects of performance. When analyzed separately, these performance
variables sometimes lead to contradictory conclusions. To avoid such conflicts, we integrate all three
aspects into a single measure, speed-memory-accuracy tradeoff (SMAT), which is defined as,
SM AT = Snorm + (1 − Mnorm ) + Accnorm

(10)

where Snorm , Mnorm , Accnorm are normalized speed (examples per sec), peak memory usage as
well as LRA score after applying the MinMaxScaler. In this experiment, we use the IMDb dataset of
the sequence length 4K with a batch size of 8 for all runs. As shown in Table 3, AdaMRA consistently
8

Table 4: Ablation studies of AdaMRA on the Byte-Level Text Classification task and Image
Classification task. H denotes the number of heads. The best model is in boldface.
ID

H

Compression Rate

Accu (Image)

Accu (Text)

1
2
3
4
5
6
7
8

2
3
3
4
4
4
6
7

(1/4, 1/32)
(1/8, 1/16, 1/32)
(1/2, 1/8, 1/32)
(1/2, 1/4, 1/8, 1/16)
(1/4, 1/8, 1/16, 1/32)
(1/8, 1/16, 1/32, 1/64)
(1/2, 1/4, 1/8, 1/16, 1/32, 1/64)
(1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128)

45.3
45.5
46.0
44.9
45.2
44.0
44.9
45.0

66.8
66.3
68.4
67.8
68.2
67.3
67.6
65.7

9
10

3
3

(1/2, 1/2, 1/2)
(1/4, 1/4, 1/4)

38.98
42.38

64.11
63.38

outperforms other variants by a large margin in terms of SMAT score, indicating that our AdaMRA
achieves state-of-the-art speed-memory-accuracy trade-off.

4.3

Ablation Study

Compression Rate To understand the impact of compression rate c, we conduct ablation experiments on two tasks in the LRA benchmark, i.e., Byte-Level Text Classification and Image
Classification. We experiment with a various number of attention heads and vary the compression rate ch of each head. We use the same ch across all layers. As indicated by the results in Table 4,
the choice of compression rate is crucial for the final performance. However, compared to vanilla
Transformer, all configurations achieve consistent improvement on both tasks (see Table 1).
Besides, there are few things to notice: i) Model 5 outperforms Model 4&6, and Model 3 outperforms
Model 2, indicating a benefit in using a moderate compression rate and using an extremely low
compression rate cause a significant performance drop. We speculate that using an extremely low
compression rate might lose too much information. ii) Model 3 outperforms Model 4-8, which means
AdaMRA does not perform better as the number of heads H increases. This indicates that having
multiple attention heads is effective, but a too large number of heads hurts. iii) When we use a
relatively low compression rate, the resulting model’s (Model 3&5) performance already outperforms
all other Transformer models. This suggests that we can decrease the compression rate to a certain
extent, which further increases the efficiency advantage of AdaMRA over vanilla Transformer. iv)
One can notice a significant accuracy drop when using the single-resolution (Model 9&10), which
indicates that multi-resolution attention head is beneficial for capturing hierarchically structured
information.
Architecture Design To understand the importance of Table 5: Ablation study for architecture
each component, we conduct ablation experiments for the
AdaMRA architecture. In Table 5, Rand means randomly
Model
Image Text
assigning each query to attention heads; Softmax means
Transformer 38.20 65.02
adding multi-resolution approach into vanilla attention
mechanism; ELU+1 [Katharopoulos et al., 2020] means
Rand
41.40 64.64
employing elu + 1 as feature function φ in Eq. 7; ReLU
Softmax
41.19 65.56
means using ReLU as φ. As indicated by the results, the
ELU+1
43.73 66.33
privilege of AdaMRA comes from the learned routing and
ReLU
46.00 68.44
multi-resolution attention simultaneously. Using ReLU as
the feature function is also advantageous, without which
we only have small gains over the vanilla model. The multi-resolution method is also compatible
with other attention mechanisms (e.g., vanilla attention) to a certain extent, and we leave it for future
work.
9

5

Conclusion

Transformer models are notoriously slow to train and deploy in practice because of its quadratic
time and space complexity with respect to the sequence length. In this paper, we propose a novel
and efficient structure AdaMRA. We see a benefit to this approach in the domain of text, image,
mathematical expressions, etc., with the model outperforming existing architectures. In particular, we
have shown that our model achieves state-of-the-art speed-memory-accuracy trade-off. The main
limitation of this work is the additional hyperparameters (number of Heads H and the compression
rate of each head ch ). However, we empirically show that multiple configurations work fairly well.
The proposed method opens several research directions towards integrating multi-resolution memory
into Transformers. Besides, the scalability of AdaMRA enables application in tasks that require
working with large inputs, fast training speed, or low memory footprints.

References
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150, 2020.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention
with performers. arXiv preprint arXiv:2009.14794, 2020.
Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers. arXiv
preprint arXiv:1909.00015, 2019.
Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence
model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 5884–5888. IEEE, 2018.
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Startransformer. arXiv preprint arXiv:1902.09113, 2019.
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146, 2018.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns:
Fast autoregressive transformers with linear attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR, 2020.
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv
preprint arXiv:2001.04451, 2020.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Drew Linsley, Junkyung Kim, Vijay Veerabadran, and Thomas Serre. Learning long-range spatial
dependencies with horizontal gated-recurrent units. arXiv preprint arXiv:1805.08315, 2018.
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the
association for computational linguistics: Human language technologies, pages 142–150, 2011.
Nikita Nangia and Samuel R Bowman. Listops: A diagnostic dataset for latent tree learning. arXiv
preprint arXiv:1804.06028, 2018.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. In International Conference on Machine Learning, pages
4055–4064. PMLR, 2018.
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong.
Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
10

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise
self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl
anthology network corpus. Language Resources and Evaluation, 47(4):919–944, 2013.
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive
transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and
Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538, 2017.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention
span in transformers. arXiv preprint arXiv:1905.07799, 2019.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In
International Conference on Machine Learning, pages 9438–9447. PMLR, 2020a.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,
Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient
transformers. arXiv preprint arXiv:2011.04006, 2020b.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head
self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418, 2019.
Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. Fast transformers with clustered
attention. Advances in Neural Information Processing Systems, 33, 2020.
Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J Smola. Transformer on a
diet. arXiv preprint arXiv:2002.06170, 2020a.
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. arXiv preprint arXiv:2006.04768, 2020b.
Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for
sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and
Vikas Singh. Nystr\" omformer: A nystr\" om-based algorithm for approximating self-attention.
arXiv preprint arXiv:2102.03902, 2021.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich
Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual
attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. Bp-transformer: Modelling
long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer
sequences. arXiv preprint arXiv:2007.14062, 2020.
Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. Explicit sparse
transformer: Concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637,
2019.
11