BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Zihao Ye†, Qipeng Guo†‡∗, Quan Gan†, Xipeng Qiu‡ , Zheng Zhang†§
†
AWS Shanghai AI Lab
‡
Fudan University
§
New York University Shanghai
{yeziha, gqipeng, quagan, zhaz}@amazon.com, xpqiu@fudan.edu.cn

Abstract

arXiv:1911.04070v1 [cs.CL] 11 Nov 2019

1

The Transformer model is widely successful
on many natural language processing tasks.
However, the quadratic complexity of selfattention limit its application on long text. In
this paper, adopting a fine-to-coarse attention
mechanism on multi-scale spans via binary
partitioning (BP), we propose BP-Transformer
(BPT for short). BPT yields O(k · n log(n/k))
connections where k is a hyperparameter to
control the density of attention. BPT has a
good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a
superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch 1 .

1

Introduction

Transformer, a self-attention based model, has
achieved many impressive results on Natural Language Processing (NLP) tasks, notably machine
translation (Vaswani et al., 2017), language modeling (Radford et al., 2018), and text classification
(Devlin et al., 2018). However, its self-attention
mechanism imposes a quadratic cost with respect
to sequence length, limiting its wider application,
especially for long text.
To address this problem, some previous works
have explored different directions. (1) Hierarchical Transformers (Miculicich et al., 2018; Liu and
Lapata, 2019) uses two Transformers in a hierarchical architecture: one Transformer models the
sentence representation with word-level context,
and another the document representation with the
sentence-level context. (2) Lightweight Transformers (Child et al., 2019; Sukhbaatar et al.,
∗
1

Work done during internship at AWS Shanghai AI Lab.
https://github.com/yzh119/BPT

2

3

4

5

6

7

1

Transformer

2

3

4

[5,6]

3

5

Transformer-XL

[1,4]

1

2

4

5

[10,11]

6

7

8

9

10

11

BPT

Figure 1: Different attention pattern in Transformerlike models. Solid line refers to direct attention, while
the dashed line denotes dependency. Unrelated connections and self loops are omitted for clarity.

2019; Guo et al., 2019; Dai et al., 2019) reduce
the complexity by reconstructing the connections
between tokens.
Besides the computational cost, the fullyconnected nature of Transformer does not incorporate the commonsensible inductive bias of language, such as sequential or syntax structure. The
dependency relations between tokens are totally
learned from scratch. Therefore, Transformer usually performs better on huge datasets and is easy to
overfit on small datasets (Guo et al., 2019).
The above observation motivates us to explore
better structure for self-attention models to balance the capability and computation complexity.
In this paper, we propose a new architecture called
BP-Transformer (BPT for short), which partitions the input sequence into different multi-scale
spans via binary partitioning (BP). BPT incorporates an inductive bias of attending the context
information from fine-grain to coarse-grain as the
relative distance increases. The farther the context
information is, the coarser its representation is.
BPT can be regard as graph neural network, whose
nodes are the multi-scale spans. A token node
can attend the smaller-scale span for the closer
context and the larger-scale span for the longer-

distance context. The representations of nodes
are updated with Graph Self-Attention (Velickovic
et al., 2018).
Moreover, to better represent the position information of the span nodes and token nodes, we generalize the notion of relative position (Shaw et al.,
2018) from sequences to trees and show that it better captures position bias.
Thus, BPT incorporates the advantages of both
hierarchical and lightweight Transformerss: (1) it
models the long-range context in an hierarchical
fashion, (2) reduces computation cost with fewer
edges, and finally, (3) introduces coarse-to-fine
connections to approximate the reasonable inductive bias of language, with a net effect of making
BPT easier to train.
We evaluate BPT on a variety of Sentence-Level
and Document-Level NLP tasks: language modeling, machine translation and text classification.
The experiment results show that BPT consistently
outperforms previous self-attention based models.
We also show that the inductive bias of BPT works
nicely on short text and can scale to large datasets.
Finally, we show BPT is faster and more memory efficient than vanilla Transformer when dealing with long sequence.

2

Related Work

2.1 Recap: Transformer
Given a sentence with n input tokens, the Transformer model iteratively computes at layer t the
d-dimensional representations of each input token
Ht ∈ Rn×d , where H0 represents the initial token embeddings. The core of a Transformer step
is Multi-head Self-Attention (MSA), which can be
formulated as follows:
MSA(H) = [head1 , · · · , headh ]W O ,


Qi KTi
√
headi = softmax
Vi ,
d
Qi = HWiQ , Ki = HWiK , Vi = HWiV ,
(1)
where h is the number of heads, and WiQ , WiK ,
WiV , WO are learnable parameters.
Transformer then computes Ht+1 from Ht :
Zt = norm(Ht + MSA(Ht )),

(2)

Ht+1 = norm(Zt + FFN(Zt )),

(3)

where norm represents the layer normalization (Ba
et al., 2016) and FFN stands for the Position-wise

Feed-Forward Network in (Vaswani et al., 2017).
Note that each step t has its own parameters.
2.2 Hierarchical Attention
Some previous work has explored the direction
of applying self-attention on hierarchical features: HAN (Yang et al., 2016) exploits a twolevel attention mechanism that first applies selfattention on word features to get a sentence representation, then uses self-attention on sentence
level features to get a document level features.
Shen et al. (2018) proposed a network structured called “bi-directional block self-attention
network(Bi-BloSAN)” that divides a sequence
into blocks, and sequentially applies intra-block
attention and inter-block attention inside a layer.
Miculicich et al. (2018) uses a HAN structure
to get sentence-level feature in Transformers for
Document-Level Machine Translation. Different
from them, our model updates hierarchical features synchronously inside a layer, and update
them iteratively by stacking layers.
2.3 Lightweight Self-Attention
Recently there has also been several works focusing on reducing the computational cost of SelfAttention in Transformers: T-DMCA (Liu et al.,
2018) reduced the memory usage by first dividing the sequence tokens into blocks with similar length and performing attention inside each
block independently. Sparse Transformer (Child
et al., 2019) decomposes attention into two categories: for a sequence with length n, we divide it
√
into n equal-sized blocks. Each token attends
√
to its previous tokens inside a n block it lies
√
in, and to n previous blocks. Compared to our
model, the Sparse Transformer does not maintain
the representations of hierarchical features, and
the computational cost of Sparse Transformer is
√
O(n n) while ours is O(n log n). TransformerXL (Dai et al., 2019) introduces the notion of
recurrence into Transformer. It divides the input sequence into multiple segments and recurrently attends to the hidden states of the previous
segments. They achieved state-of-the-art on several language modeling benchmarks. Compared
to our model, Transformer-XL could only model
sequences in one direction, making it hard to deal
with tasks where bi-directional information is required. Sukhbaatar et al. (2019) proposed a adaptive mechanism to learn optimal context length in
transformers for each head per layer, thus reducing

the total computational and memory cost of transformers. Guo et al. (2019) suggest that the fullyconnected nature of self-attention in Transformer
is not a good inductive bias, they proposed StarTransformer which links adjacency words coupled
with a central relay node to capture both local and
global dependencies, with such reduction, StarTransformer achieved significant improvements
against standard Transformer on moderate sized
datasets. However, Star-Transformer is not suitable for auto-regressive models in which each
word should only be conditioned on its previous
words, while the relay node in Star-Transformer
summarizes the whole sequence.

3

Proposed Model

In this paper, we balance the model capability and
computation complexity by incorporating the inductive bias. The key insight is that not every
token needs to be attended to for context representation. Instead, for an given input token,
we can group its context into different-scale nonoverlapping spans, and the scale of a span increases with its relative distance. That is, instead
attending to every token, the input token attends
to different spans away from it in a fine-to-coarse
fashion.
We now describe our model as graph neural network and detail it in the following sections.

Let A(u) denote the set of the neighbour nodes
of u in G, GSA(G, hu ) is detailed as follows:
Au = concat({hv | v ∈ A(u)}),

Qui = Hk WiQ ,Kui = Au WiK ,Viu = Au WiV ,
(5)
 u uT 
Q i Ki
√
headui = softmax
Viu ,
(6)
d
GSA(G, hu ) = [headu1 , · · · , headuh ]WO , (7)
where d is the dimension of h, and
WiQ , WiK , WiV are trainable parameters of
the i-th attention head.
3.2 Graph Construction
3.2.1 Node Construction
To achieve the effect of fine-to-coarse attention,
we partition a sequence into multi-granular spans
via binary partitioning (BP).
Binary partitioning is a generic process of recursively dividing a sequence into two until the
partitioning satisfies one or more requirements. In
this paper, we use a simple rule to stop subdividing when a partition just contains a single token.
For a sequence with length n, there are 2n − 1 partitions. Figure 1 illustrates the process of binary
partitioning over a sequence. Each partition can
be regarded as a node in GNN and its representation is computed according to its contained tokens.

3.1 Transformer as Graph Neural Networks
A valid perspective is to view information fusing with self-attention in Transformer as message
passing on a fully-connected graph, with input tokens as nodes and attentions between nodes as
edges (Battaglia et al., 2018). In particular, such
a process is very similar to Graph Attention Network (Velickovic et al., 2018). Thus, different
graph structure encodes different inductive bias of
attention and results in different time/space complexity.
To describe Transformer in GNN framework,
we first construct a fully-connected graph G, in
which each node is a token of the input sequence.
All nodes in G are interconnected and each node
has a self-loop edge.
We extend the self-attention mechanism of
Transformer to graph, called Graph SelfAttention (GSA). For a given node u, we update its representation according to its neighbour
nodes, formulated as hu ← GSA(G, hu ).

(4)

I was busy writing my next paper .
I was busy writing
I was
I

was

busy writing
busy writing

my next paper .
my next
my

next

paper .
paper

.

Figure 2: Binary partitioning of a sequence.

The binary partitioning of a sequence constructs
a perfect binary tree in which all internal nodes
have two children and all leaf nodes have the same
depth. Each leaf node corresponds to an input token in the sequence.
We simply divide the nodes into two types, token and span, both of which are used as nodes in
our GNN construction:
Token nodes the leaf nodes in the binary partition
tree.

Four score and seven years ago our fathers

Level 4
(root)

𝑟4𝑎𝑛𝑐

Level 3

Level 2

Four score and seven

Level 1

𝑟2𝑎𝑛𝑐

Level 0
(leaves)

Four score

Four

𝑟1𝑎𝑛𝑐

score

𝑙𝑒𝑓𝑡

𝑟0,1

and

years ago

seven

𝑟0,1

brought forth on this

years ago our fathers

and seven

𝑟𝑖𝑔ℎ𝑡

𝑟 𝑠𝑒𝑙𝑓

brought forth on this continent a new nation

Four score and seven years ago our fathers

𝑟3𝑎𝑛𝑐

𝑟𝑖𝑔ℎ𝑡

𝑟0,2

years
𝑟𝑖𝑔ℎ𝑡
𝑟1,1

ago

our fathers

our
𝑟𝑖𝑔ℎ𝑡
𝑟1,2

brought forth

fathers

brought

forth

continent a new nation

continent a

on this

on

this

𝑟𝑖𝑔ℎ𝑡
𝑟2,1

continent

a

new nation

new

nation

𝑟𝑖𝑔ℎ𝑡
𝑟2,2

Figure 3: The figure illustrates how to build the graph: nodes at different levels are colored differently, dashed lines
are edges connects token nodes to span nodes; solid lines are edges connect to token nodes. The r∗∗ are relative
positions assigned to edges.

Span nodes the internal node of the tree, each has
at least two child nodes.
3.2.2

Edge Construction

The binary partitioning generating a binary tree.
For a sequence with n tokens, we have n token
nodes and n − 1 span nodes. Formally, let ul,m denote the m-th node at level l. The level increases in
bottom-up fashion. The level of token nodes is set
to 0. A span node ul,m represents a partition consisting of token nodes u0,2l ∗m+1 , · · · , u0,2l ∗(m+1) .
To reduce the distance of information transmission, we do not directly use the tree structure to
construct the edges in graph since the path is long
for two long-distance tokens in the tree structure.
We construct two kinds of edges:
Affiliated Edges Given a span node ul,m , we
add a directed edge from each of its contained token nodes u0,2l ∗m+1 , · · · , u0,2l ∗(m+1) . There are
2l edges u0,2l ∗m+i → ul,m (1 ≤ i ≤ 2l ).
The role of affiliated edges is to shorten the path
between a span node and its corresponding token
nodes. With the affiliated edges, the representation
of a span node is computed by directly aggregating
the information from its contained token nodes.
Although we do not adopt the tree structure, the
affiliated edges still can incorporate the inductive
bias of the hierarchical linguistic structure within
the sentence.
Contextual Edges The power of Transformer
comes from relating every pair of tokens. To
reduce the computation complexity while retaining the ability to capture long-range context, we
model the context with a fine-to-coarse strategy.

For a leaf node, to model its local context, we connect it to neighbor token nodes or lower-level span
nodes. Similarly, we connect it to higher-level
span nodes for long-range context.
In detail, for a leaf node u0,i , we add the incoming edges from the different granularity. For
simplicity, we describe the process of constructing edges from its right context of node u0,i . The
edges from the left context is conducted similarly.
We use a hyper-parameter k to determine the
connection density of the graph. We add k edges
per level to capture the information from the right
context.
For node u0,i , its contextual nodes are
u0,p0 , · · · , u0,p0 +k−1 ,

u1,p1 , · · · , u1,p1 +k−1 ,

(8)
(9)

···

(10)

··· ,

(12)

ul,pl , · · · , u1,pl +k−1 ,

(11)

where pl is the start index at level l and can be
computed recursively: pl = parent(pl−1 + k) and
p0 = i + 1.
For the sake of computation efficiency, when
the index pl +k−1 is odd, we also add its next node
in the same layer as the contextual nodes. Thus,
the start index at next level is pl+1 = parent(pl +
k + 1).
In practice, it is easy to find the contextual
nodes in a recursive fashion. Given a leaf node u,
the whole procedure is described in Algorithm 1.
After collect all the contextual nodes, we add a
directed edge from each contextual node to node
u0,i .

Algorithm 1 Finding contextual nodes
function N EIGHBORS(u, k)
N ← {u}, l ← left(u), r ← right(u)
repeat
for i ← 1 to k do
N ← N ∪ {l, r}
l ← left(l)
r ← right(r)
end for
l ← parent(l)
r ← parent(r)
until l and r reach the boundary
return N
end function
Finally, for a sequence with length n, we can
construct a directed graph G. The number of nodes
is O(2n), the number of edges is O(kn log n/k).
We can see that the distances between any two token nodes are no greater than 2 in graph G. This
property enables our model to learn long-term dependencies easily.

Algorithm 2 The update of graph
Require: G = (V, E) the underlying graph, N the
number of layers, H0 initial hidden states
1: for i := 1 to N do:


2:
Zi ← norm Hi−1 + GSA(i) G, Hi−1


3:
Hi ← norm Zi + FFN(i) Zi
4: end for
5: return HN
3.4 Relative Positional Encoding on Tree
As in (Shaw et al., 2018), introducing the relative
distances between words in computing the selfattention helps encode the relative order among tokens. Here we draw a similar analogy on the tree.
For each node v in A(u), we consider the relative
positional difference on the tree between u and v,
and assign a latent representation rv,u of such difference:
• rv,u = rself if v = u.
right

left or r
• rv,u = rj,i
j,i , if v is the i-th left/right
node to join the neighborhood set of u at the
j-th level in Algorithm 1 of finding top-down
context nodes.

3.3 Graph Update
After graph G being constructed, we update representations of all nodes via Graph Self-Attention
(GSA) described in Section 3.1.
Since G is a directed graph, for a given node
u, its neighbours A(u) is set to all its predecessor
nodes in G. If we set A(u) to all the token nodes,
we recover the model to the vanilla Transformer.
Recall that the predecessors of a token node is
the multi-scale spans it attending to, while the predecessors of a span node are all its contained token
nodes, as illustrated in Figure 3. Therefore, BPT
connected each two tokens via at most two edges.
In our experiments, we update all nodes synchronously within the graph layer. The representations of span nodes are initialized with all zeroes,
while the representations of token nodes are initialized with the corresponding word embeddings.
We can stack multiple graph layers as in vanilla
Transformer, where each layer gets its own W··
and WO . Algorithm 2 demonstrates the overall
update algorithm.
Depending on the downstream tasks, we either
take as output of representation of the root node
in the final layer (e.g. in text classification and
natural language inference), or the representations
of all the token nodes in the final layer (e.g. in
language modeling and machine translation).

• rv,u = rjanc , if u is the ancestor of v in the
tree at level j.
right

left , r
anc
All the rself , rj,i
j,i and rj are trainable parameters.
Then, we modify Eq. (6) to include positional
representations:

Ru = concat({rv,u | v ∈ A(u)}),
 u

Qi (Kui + Ru )T
√
headui = softmax
Viu .
d

(13)

Note that the relative positional representations are
shared across attention heads, which is the same as
in (Shaw et al., 2018), and each layer gets its own
set of positional representations.
When k is set to be larger then the sentence
length, our model degenerates to Vanilla Transformer with positional encodings. In the following section we will show that a small k (e.g. 4) is
enough for achieving good performance in word
level NLP tasks.

4

Experiments

We measure the performance of BPT on variety of
tasks at both sentence level and document level.

We use SST-5 dataset (Socher et al., 2013) and
IMDB dataset (Maas et al., 2011) to measure the
performance of our model on classification for
short and long text. The former has fine-grained
labels with 215,154 phrases in 11,855 sentences
with average length of 19, and the latter has positive/negative labels on 50,000 multi-sentence reviews with average length 294. We use pre-trained
GloVe embedding (Pennington et al., 2014) as input features and fixed them during training. The
hidden size of all our models are set to 300. For
IMDB, we apply the same training/validation set
split ratio (0.9) as in McCann et al. (2017).
Model
BPT
Star Transformer
Transformer
Bi-LSTM (Li et al., 2015)
Tree-LSTM (Socher et al., 2013)
QRNN (Bradbury et al., 2017)
BCN+Char+CoVe (McCann et al.,
2017)

SST-5

IMDB

52.71(0.32) 92.12(0.11)
52.9
90.50
50.4
89.24
49.8
51.0
91.4
53.7

91.8

92.12
56

92

54

90

52.71
52

88

IMDB test accuracy

4.1 Text Classification

graph density and time/memory cost of BPT. The
best performance was obtained at k = 2 and k = 4
for SST and IMDB respectively, which is a small
value.

SST test accuracy

On document level tasks, we achieved state-ofthe-art performance on language modeling, machine translation and text classification. For sentence level tasks, BPT performs consistently better then vanilla Transformer and Star Transformer,
suggesting the inductive bias encoded by BPT is
reasonable and effective for natural language. The
experimental results show the superior ability of
BPT in modeling the long-range context.

50
2

4

8

16

32

64

86

k

Figure 4: Effects of hyperparameter k.

4.1.1

Sensitivity to Sequence Shift

Since BPT divide sequence in binary fashion, a
concern is whether a shift in sequence affects its
performance. To measure if the output of BPT
is sensitive to shift, we take the model trained
on SST with best validation loss and evaluate it
in a setting different from training: we append n
placeholder symbols in the front of each sentence,
and initialize their embedding with all zeros. We
varies n from 0 to 7 and found out the test accuracy
changes very little as shown in Table 2, suggesting
our model is robust towards shift.
Shift Offset Test Accuracy Shift Offset Test Accuracy

Table 1: Test accuracy on SST-5 and IMDB. In BPT,
k = 2 and k = 4 for SST and IMDB respectively.
The last model used word embeddings pretrained with
translation and additional character-level embeddings.

We report the average test accuracy of BPT of
10 runs in Table 1, the value inside brackets indicates standard derivation. On SST-5, our model
outperforms Transformer and LSTM based models. On IMDB, our proposed model outperforms
a bidirectional LSTM initialized with pre-trained
character embedding and CoVe embedding (McCann et al., 2017).
On IMDB, our model outperforms Vanilla
Transformer and Star Transformer by a large margin: 1.62 and 2.88 respectively. To study the
effect of k on final accuracy, we tried different
k ∈ {1, 2, 4, 8, 16, 32, 64}. Figure 4 shows a large
k does not bring benefits, though it increases the

0
1
2
3

52.71(0.32)
52.50(0.29)
52.81(0.18)
52.56(0.22)

4
5
6
7

52.18(0.22)
51.90(0.16)
51.85(0.35)
51.55(0.29)

Table 2: Accuracy with different sequence shift on
SST-5.

4.2 Language Modeling
To see how BPT exploits with long-term dependencies, we evaluate our model on Character Level
Language Modeling datasets of moderate size:
Enwiki8 (LLC., 2009) and Text8 (LLC., 2009).
We use bits-per-character(bpc for short, the lower
the better) to measure the performance of our
model.
Character level tasks require more fine-grained
interactions between characters, we select a much

Model
HM-LSTM (Chung et al., 2017)
Recurrent Highway (Zilly et al.,
2017)
mLSTM (Krause et al., 2016)

bits per character (BPC)

larger k = 64 for such tasks. The baseline models we select are multi-scale RNN based models
(Chung et al., 2017; Zilly et al., 2017; Krause
et al., 2016) and Transformer-based models (AlRfou et al., 2018; Dai et al., 2019; Sukhbaatar
et al., 2019). All Transformers use the same base
setting (12 layers, d = 512, df f = 2048) for fair
comparison.

BPT
Restricted Attention
Sparse Transformer

1.3

1.2

1.1

Enwiki8 Text8 Params
-

1.29
1.27

35M
45M

1.24

1.27

45M

Transformer (Al-Rfou et al., 2018)
Transformer-XL (Dai et al., 2019)
Adaptive Span (Sukhbaatar et al.,
2019)

1.11
1.06
1.02

1.18
1.11

44M
41M
39M

BPT (k = 64, l = 8192)

1.02

1.11

38M

Table 3: Test BPC on Enwiki8/Text8. Note that
Transformer-XL can be only used for language modeling. l denotes the context length.

In Table 3, we show that BPT can achieve stateof-the-art performance on both datasets with a
small number of parameters.
To compare different sparse attention patterns,
we fix the context length: l = 512 and see
how the performance of different models varies
as we change the attention degree (the number of incoming edges of each token in the
context of viewing Transformer as Graph Neural Networks). For BPT, we select different
k ∈ {1, 2, 4, 8, 16, 32, 64, 128}, for Sparse Transformer (Child et al., 2019), we use the default setting described in the paper (c = 8, stride = 128);
for Restricted Transformer (Vaswani et al., 2017)
(restrict self-attention to a neighborhood window
of size w), we select w ∈ {32, 64, 128, 256, 512}.
Figure 5 suggests that BPT’s fine-to-coarse
sparse attention is more effective than Restricted
Transformer and Sparse Transformer: with the
same attention degree, BPT always gets better performance.

To see how BPT exploits long-term dependency, we fixed k to 64 and varies context length
in {512, 1024, 2048, 4096, 8192}. We do not try
context length longer than 8192 because its exceeds the average article length in Enwik8 and
Text8. As shown in Table 4, the performance increases with the context length.

100

200
300
400
Attention Degree

500

Figure 5: Test BPC on Enwiki8 with different k.
Context Length

Enwik8

Text8

512
1024
2048
4096
8192

1.07
1.05
1.03
1.02
1.02

1.16
1.14
1.13
1.12
1.11

Table 4: Test BPC on Enwiki8/Text8 with different
context lengths.

4.3 Machine Translation
BPT can also be applied to Encoder-Decoder
frameworks by replacing backbone network in
Vaswani et al. (2017) from Transformer to BPT.
In this section we evaluate two settings:
Document-Level and Sentence-Level Machine
Translation. In Document-Level Machine Translation tasks, the self-attention in both encoder and
decoder are applied at document level, while the
attention between encoder and decoder are applied
between aligned sentences. For a mini-batch of
sentence pairs with source sentences of lengths
{ni } and target sentences of
P i }, the
P lengths {m
number of connections
areP i kni log( i ni /k)
P
for encoder,
km
log(
i
i
i mi /k) for decoder,
P
and i ni · mi for attention between encoder and
decoder.
4.3.1 Document Level Machine Translation
We conduct experiments with TED Talks Chineseto-English(Zh-En) dataset from IWSLT 2014 and
2015 (Cettolo et al., 2012, 2016), the average document length is 120 (in sentences). For each
sentence, we take its preceding context of fixed
length, and their corresponding translations as a
single sample.
The baseline models are HAN-NMT (Miculi-

cich et al., 2018) and Transformer+cache (Tu
et al., 2018). We follow the setting of Miculicich
et al. (2018) with a vocabulary size of 30k for both
Chinese and English, and use dev2010 for development and tst2010-2013 for testing. Unlike previous models, our model is trained from scratch
and do not require pre-training on sentence-level
translation tasks.
Model

BLEU

Transformer (Vaswani et al., 2017)
Transformer+cache (Tu et al., 2018)
HAN-NMT (Miculicich et al., 2018)

16.87
17.32
17.78

Transformer (ours, single sentence)
BPT (k = 4, single sentence)
BPT (k = 4, l = 64)

18.91
19.19
19.84

coder with a BPT encoder/decoder. The number of
parameters remains the same. The baseline model
we select is Transformer(base). We trained the
network for 40 epochs and take the average of last
10 checkpoint for decoding, the beam size is set to
5.
Model

BLEU

ByteNet (Kalchbrenner et al., 2016)
GNMT+RL (Wu et al., 2016)
ConvS2S (Gehring et al., 2017)
Transformer (Vaswani et al., 2017)

23.75
24.6
25.16
27.3

Transformer (our implementation)
BPT (k = 1)
BPT (k = 2)
BPT (k = 4)
BPT (k = 8)

27.2
26.9
27.4
27.6
26.7

Table 5: BLEU score on IWSLT 2015 Zh-En

Table 7: BLEU score on newstest 2014

In Table 5 we show that with careful selection of hyper-parameters, Transformer trained at
sentence-level could beat reported results of previous Document-Level models. BPT with k = 4
and context length of 32 could further improve the
baseline result by 0.93 in terms of BLEU score,
which is a significant margin.
We also examine the effect of context length
and k on final BLEU scores, the results are shown
in Table 6. Similar to Tu et al. (2018) and Miculicich et al. (2018), we found a small context
length is enough for achieving good performance
on IWSLT for Document-Level Translation. However, as we increases context size, the performance
of BPT does not get worse as these models and
Transformers, suggesting the inductive bias encoded by BPT makes the model less likely to overfit.

Table 7 report the de-tokenized SacreBLEU
score 2 (Post, 2018) of BPT and Vanilla Transformer on test set: newstest 2014. In the setting
of k = 2 and k = 4, BPT outperforms Vanilla
Transformer with the same number of parameters
and a sparse attention pattern.
The best setting of BPT on WMT14 is k = 4,
the same as the best setting of BPT on DocumentLevel Machine Translation(IWSLT) and Text
Classification(IMDB), suggesting k = 4 a general
setting for word-level NLP tasks, on both small
and large datasets.

Context length

0

32

64

128

Transformer
BPT (k=4)
BPT (k=8)

18.85
19.19
19.13

18.66
19.84
19.59

17.59
19.71
19.78

15.55
19.84
19.60

Table 6: BLEU score vs context length on different
models

4.3.2

Sentence Level Machine Translation

IWSLT is a relatively small dataset with 0.21M
sentence pairs, to see if BPT scales to large
dataset, we train a BPT on WMT14 English-toGerman dataset with 4.5M sentence pairs.
We follow the same setting as (Vaswani et al.,
2017), but to replace the Transformer encoder/de-

4.4 Throughput and GPU Memory Footprint
BPT improves the time/space complexity of
Transformer models from O(d · n2 ) to O(d ·
k · n log n/k) in theory, such speedup cannot be
achieved by tensor-based attention operators. To
address this problem, we designed a set of CUDA
kernels for sparse attentions3 .
We compare the GPU memory footprint and
throughput of BPT and vanilla Transformer during inference under the same setting4 for language
modeling. The k is set to 1, 4, 16, 64 respectively,
covering best settings for word-based tasks(k = 4)
and character-based tasks(k = 64). We fix the
number of tokens to 8192 each batch and varies
the sequence length. Figure 6 and 7 depicts how
the GPU memory and speed varies as we increases
2
Setting: BLEU+c.mixed+l.en-de+#.1+s.exp+t
.wmt14+tok.intl+v.1.4.1
3
the speed of BPT could be further improved with better
optimized kernels
4
N = 6, d = 512, df f = 2048, h = 8

sequence length.
Transformer
BPT(k = 1)
BPT(k = 4)
BPT(k = 16)
BPT(k = 64)

GPU Memory(GB)

10
8
6
4

References

2
2,000

4,000

6,000

8,000

length of sequence

Figure 6: GPU memory cost vs sequence lengh

Throughput (tokens/s)

·105

Transformer
BPT(k = 1)
BPT(k = 4)
BPT(k = 16)
BPT(k = 64)

1

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy
Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. arXiv
preprint arXiv:1808.04444.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint
arXiv:1607.06450.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst,
Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo,
Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261.
James Bradbury, Stephen Merity, Caiming Xiong, and
Richard Socher. 2017. Quasi-recurrent neural networks. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings.

0.5

2,000

4,000

6,000

8,000

length of sequence

Figure 7: Throughput vs sequence length

We show that BPT consistently utilizes less
GPU memory compared to Transformer, making
it possible to be applied on tasks that require long
sequence modeling such as time-series prediction.
As for speed, BPT increases the number of
nodes from n to 2n which brings additional overhead linear to sequence length, rendering BPT not
as fast as Transformer when dealing with short
text. However, as the sequence length grows, the
speed of BPT is steady while Transformer become
too slow for practical use.

5

efficiency.
This work can be extended in a number of interesting ways. The representations have not yet naturally captured syntactic and semantic meanings.
Instead of only using the root and the token representations, other intermediate representations can
be more directly exposed.

Conclusion and Future Work

This paper introduces a hierarchical fine-to-coarse
self-attention based model that is versatile and
flexible for a variety of natural language processing tasks. By imposing structural inductive bias
this way we are able to strike a balance between
the power of the model and training/computational

Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit3: Web inventory of transcribed and
translated talks. In Conference of European Association for Machine Translation, pages 261–268.
Mauro Cettolo, Niehues Jan, Stüker Sebastian, Luisa
Bentivogli, Roldano Cattoni, and Marcello Federico.
2016. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation.
Rewon Child, Scott Gray, Alec Radford, and
Ilya Sutskever. 2019.
Generating long sequences with sparse transformers.
URL
https://openai.com/blog/sparse-transformers.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio.
2017. Hierarchical multiscale recurrent neural networks. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings.
Zihang Dai, Zhilin Yang, Yiming Yang, William W
Cohen, Jaime Carbonell, Quoc V Le, and Ruslan
Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
preprint arXiv:1901.02860.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on
Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1243–1252.
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao,
Xiangyang Xue, and Zheng Zhang. 2019. Startransformer. arXiv preprint arXiv:1902.09113.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
Aaron van den Oord, Alex Graves, and Koray
Kavukcuoglu. 2016. Neural machine translation in
linear time. arXiv preprint arXiv:1610.10099.
Ben Krause, Liang Lu, Iain Murray, and Steve Renals.
2016. Multiplicative lstm for sequence modelling.
arXiv preprint arXiv:1609.07959.
Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eudard Hovy. 2015. When are tree structures necessary
for deep learning of representations? arXiv preprint
arXiv:1503.00185.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating wikipedia by summarizing long sequences. In 6th International Conference
on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
Yang Liu and Mirella Lapata. 2019. Hierarchical transformers for multi-document summarization. arXiv
preprint arXiv:1905.13164.
MultiMedia LLC. 2009.
benchmark.

Large text compression

Andrew L Maas, Raymond E Daly, Peter T Pham, Dan
Huang, Andrew Y Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In
Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics.
Bryan McCann, James Bradbury, Caiming Xiong, and
Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas,
and James Henderson. 2018. Document-level neural
machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computational Linguistics.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language
understanding paper. pdf.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana,
USA, June 1-6, 2018, Volume 2 (Short Papers),
pages 464–468.
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,
and Chengqi Zhang. 2018. Bi-directional block selfattention for fast and memory-efficient sequence
modeling. In 6th International Conference on
Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference
Track Proceedings.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on
empirical methods in natural language processing,
pages 1631–1642.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive
attention span in transformers.
arXiv preprint
arXiv:1905.07799.
Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong
Zhang. 2018. Learning to remember translation history with a continuous cache. Transactions of the
Association for Computational Linguistics, 6:407–
420.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Liò, and Yoshua Bengio.
2018. Graph attention networks. In 6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May
3, 2018, Conference Track Proceedings.
Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan,
Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang,
Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang,

Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J.
Smola, and Zheng Zhang. 2019. Deep graph library: Towards efficient and scalable deep learning
on graphs. CoRR, abs/1909.01315.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016.
Google’s neural machine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification.
In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
pages 1480–1489.
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan
Koutnı́k, and Jürgen Schmidhuber. 2017. Recurrent
highway networks. In Proceedings of the 34th International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017,
pages 4189–4198.

A Appendix
A.1 Implementation Details
We use Deep Graph Library (Wang et al., 2019)
for building Binary Partition graphs.
The following table summarizes the hyperparameters used in BPT.
notation

meaning

Btok
Bsent
N
M
h
k
demb
d
df f
pi
ph
pa
pc
avg
steps
epochs

number of tokens in a batch
number of sentences in a batch
number of (encoder) layers.
number of (decoder) layers.
number of heads.
connection density in BPT
embedding size
hidden size of the model
filter size in FFN sublayer
dropout rate on embedding layer
dropout rate on hidden layers
dropout rate on attention weight
dropout rate before classifier
model average checkpoints
training steps
training epochs

Table 8, 9 and 10 lists the hyper-parameters we
use in Text Classification, language Modeling and
Machine Translation respectively.

N
demb
d
df f
h
pi
ph
pa
pc
Bsent
epochs

SST

IMDB

4
300
300
600
6
0.4
0.1
0.3
0.4
1024
40

5
300
300
600
6
0.5
0.1
0.3
0.5
32
40

Table 8: Hyper-parameters for Text Classification

N
demb
d
df f
h
pi
ph
pa
pc
Btok
steps

enwik8

text8

12
512
512
2048
8
0.1
0.1
0.1
0.1
32768
400000

12
512
512
2048
8
0.1
0.1
0.1
0.1
32768
600000

Table 9: Hyper-parameters for Language Modeling

N
M
demb
d
df f
h
pi
ph
pa
pc
Bsent
avg
epochs

IWSLT

WMT

6
6
512
512
2048
8
0.1
0.1
0.1
0.1
128
10
40

6
6
512
512
2048
8
0.1
0.1
0.1
0.1
1024
10
40

Table 10: Hyper-parameters for Machine Translation

For full details please refer to the configurations
in our source code: https://github.com/
yzh119/BPT/tree/master/configs.