Published as a conference paper at ICLR 2021
R ANDOM F EATURE ATTENTION
Hao Peng♠∗ Nikolaos Pappas♠ Dani Yogatama♣ Roy Schwartz♥
Noah A. Smith♠♦ Lingpeng Kong♦∗
♠
Paul G. Allen School of Computer Science & Engineering, University of Washington
♣
DeepMind ♦ Allen Institute for Artificial Intelligence
♥
School of Computer Science & Engineering, Hebrew University of Jerusalem
♦
Department of Computer Science , The University of Hong Kong
{hapeng,npappas,nasmith}@cs.washington.edu
dyogatama@google.com, roys@cs.huji.ac.il, lpk@cs.hku.hk
A BSTRACT
Transformers are state-of-the-art models for a variety of sequence modeling tasks.
At their core is an attention function which models pairwise interactions between
the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the
sequence length. We propose R FA, a linear time and space attention that uses
random feature methods to approximate the softmax function, and explore its application in transformers. R FA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency
bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that R FA achieves similar or better performance compared to strong transformer baselines. In the machine translation
experiment, R FA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, R FA is competitive in terms of both accuracy
and efficiency on three long text classification datasets. Our analysis shows that
R FA’s efficiency gains are especially notable on long sequences, suggesting that
R FA will be particularly useful in tasks that require working with large inputs, fast
decoding speed, or low memory footprints.
1
I NTRODUCTION
Transformer architectures (Vaswani et al., 2017) have achieved tremendous success on a variety of
sequence modeling tasks (Ott et al., 2018; Radford et al., 2018; Parmar et al., 2018; Devlin et al.,
2019; Parisotto et al., 2020, inter alia). Under the hood, the key component is attention (Bahdanau
et al., 2015), which models pairwise interactions of the inputs, regardless of their distances from each
other. This comes with quadratic time and memory costs, making the transformers computationally
expensive, especially for long sequences. A large body of research has been devoted to improving
their time and memory efficiency (Tay et al., 2020c). Although better asymptotic complexity and
prominent gains for long sequences have been achieved (Lee et al., 2019; Child et al., 2019; Beltagy
et al., 2020, inter alia), in practice, many existing approaches are less well-suited for moderatelength ones: the additional computation steps required by some approaches can overshadow the
time and memory they save (Kitaev et al., 2020; Wang et al., 2020; Roy et al., 2020, inter alia).
This work proposes random feature attention (R FA), an efficient attention variant that scales linearly in sequence length in terms of time and space, and achieves practical gains for both long and
moderate length sequences. R FA builds on a kernel perspective of softmax (Rawat et al., 2019).
Using the well-established random feature maps (Rahimi & Recht, 2007; Avron et al., 2016; §2),
R FA approximates the dot-then-exponentiate function with a kernel trick (Hofmann et al., 2008):
exp(x · y) ≈ φ(x) · φ(y). Inspired by its connections to gated recurrent neural networks (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) and fast weights (Schmidhuber, 1992), we further
augment R FA with an optional gating mechanism, offering a straightforward way of learning with
recency bias when locality is desired.
∗
The majority of this work was done while these authors were at DeepMind.
1
Published as a conference paper at ICLR 2021
R FA and its gated variant (§3) can be used as a drop-in substitute for the canonical softmax attention,
and increase the number of parameters by less than 0.1%. We explore its applications in transformers
on language modeling, machine translation, and long text classification (§4). Our experiments show
that R FA achieves comparable performance to vanilla transformer baselines in all tasks, while outperforming a recent related approach (Katharopoulos et al., 2020). The gating mechanism proves
particularly useful in language modeling: the gated variant of R FA outperforms the transformer
baseline on WikiText-103. R FA shines in decoding, even for shorter sequences. In our head-to-head
comparison on machine translation benchmarks, R FA decodes around 2× faster than a transformer
baseline, without accuracy loss. Comparisons to several recent efficient transformer variants on
three long text classification datasets show that R FA is competitive in terms of both accuracy and
efficiency. Our analysis (§5) shows that more significant time and memory efficiency improvements
can be achieved for longer sequences: 12× decoding speedup with less than 10% of the memory for
2,048-length outputs.
2
BACKGROUND
2.1
ATTENTION IN S EQUENCE M ODELING
The attention mechanism (Bahdanau et al., 2015) has been widely used in many sequence modeling
tasks. Its dot-product variant is the key building block for the state-of-the-art transformer architectures (Vaswani et al., 2017). Let {qt }N
t=1 denote a sequence of N query vectors, that attend to
sequences of M key and value vectors. At each timestep, the attention linearly combines the values
weighted by the outputs of a softmax:
X exp (qt · ki /τ )
P
attn (qt , {ki }, {vi }) =
vi⊤ .
(1)
exp
(q
·
k
/τ
)
t
j
j
i
τ is the temperature hyperparameter determining how “flat” the softmax is (Hinton et al., 2015).1
Calculating attention for a single query takes O(M ) time and space. For the full sequence of N
queries the space amounts to O(M N ). When the computation cannot be parallelized across the
queries, e.g., in autoregressive decoding, the time complexity is quadratic in the sequence length.
2.2
R ANDOM F EATURE M ETHODS
The theoretical backbone of this work is the unbiased estimation of the Gaussian kernel by Rahimi
& Recht (2007). Based on Bochner’s theorem (Bochner, 1955), Rahimi & Recht (2007) proposed
random Fourier features to approximate a desired shift-invariant kernel. The method nonlinearly
transforms a pair of vectors x and y using a random feature map φ; the inner product between
φ(x) and φ(y) approximates the kernel evaluation on x and y. More precisely:
Theorem 1 (Rahimi & Recht, 2007). Let φ : Rd → R2D be a nonlinear transformation:
h
i⊤
p
(2)
φ (x) = 1/D sin (w1 · x) , . . . , sin (wD · x) , cos (w1 · x) , . . . , cos (wD · x) .
When d-dimensional random vectors wi are independently sampled from N (0, σ 2 Id ),
2
Ewi [φ (x) · φ (y)] = exp − kx − yk /2σ 2 .
(3)
Variance of the estimation is inversely proportional to D (Appendix A.2; Yu et al., 2016).
Random feature methods proved successful in speeding up kernel methods (Oliva et al., 2015; Avron
et al., 2017; Sun, 2019, inter alia), and more recently are used to efficiently approximate softmax (Rawat et al., 2019). In §3.1, we use it to derive an unbiased estimate to exp(h· , ·i) and then an
efficient approximation to softmax attention.
3
M ODEL
This section presents R FA (§3.1) and its gated variant (§3.2). In §3.3 we lay out several design
choices and relate R FA to prior works. We close by practically analyzing R FA’s complexity (§3.4).
1
M = N in self-attention; they may differ, e.g., in the cross attention of a sequence-to-sequence model.
2
Published as a conference paper at ICLR 2021
AAAKLXicbdZJb9tGGAZgOt0ipYvTHHshahToQTA0juwohwKJvO/yosU2VWFIDUeMuXk4lCMT/C29tpf+mlwCBL32b5SS5b5frQ4gYPh8H6nBzHsYO/a9RFerHxeefPb5F19+9bRUfvb1N99+t/j8+3YSpcoRLSfyI9W1eSJ8LxQt7WlfdGMleGD7omNfr0/qnZFQiReF53oci17AZei5nsN1Qf3FF1ZmjW76npX3M+8Xlv+aHeX9xaXqcnU6zPkJm02WjNlo9p+XFqxB5KSBCLXj8yS5YtVY9zKutOf4Ii9baSJi7lxzKa6KacgDkfSy6epz86dCBqYbqeIXanOq9I2MB0kyDuyiM+B6mDyuTfD/alepduu9zAvjVIvQuf8jN/VNHZmTrTAHnhKO9sfFhDvKK9ZqOkOuuKOLDStbA+EWmzpdThaMbT8VeXa63ciz2lqlXq+wWjV/3KTEYNbD6qyy9rpSX5vriRQP5cOnVqq1islqLyvm6upcZ5yq2H/oZC+Lztqront1/ptSCRE+rK5YGmMVc+V1Xi5PG63RnVBRllmTLbLdrJrn+awQhQLO4EEKtoIUBT0UmpPa9BllUiJqQ22oA3WgA+gASlYpoC7UhUqohA6hQ6gH9aDvoO+g19BrqA/1yf5BA2gIDckZQCNoDI2hN9AbqIIqaAJNyAFCNZQcNznsEXQEvYXeQt9D30PH0DH0Dno30XsO3oLf/tscNKAN6Dp0HboB3YBuQjehW9At6DZ0G7oD3YHuQnehe9A96D50H3oAPYAeQg+hR9Aj6DH0GNqENqEn0BPoKfQUegY9g55Dz6EtaAvahrahHWgH2oV2oRfQC+gl9BLhkQ/hcbhP0iMbhBEfuU4Y+ZEbhBEguUkYCZJbhBEhuU0YGZI7hBEiuUsYKZJ7hBEjuU8YOZIHhBEkeUgYSZJHhBEleUwYWZJNwgiTPCGMNMlTwoiTPCOMPMlzwgiUbBFGomSbMCIlO4SRKdkljFDJC8JIlbwkPI2VFYpbJwoCHg4yK4xUkF+xXmb5wtWW3xZKLzFLeXKoLTV5mlya2OMr0vyks7LMasuMndSW3jRm96enxg/Gj8bPBjNeGW+MHaNptAzHGBu/Gb8bf5T+LH0ofSr9dd/6ZGH2zgvjP6P09z917M4L
{qi }N
i=1
AAAKLXicbdZJb9tGGAZgOt0ipYvTHHshahToQTA0juwohwKJvO/yosU2VWFIDUeMuXk4lCMT/C29tpf+mlwCBL32b5SS5b5frQ4gYPh8H6nBzHsYO/a9RFerHxeefPb5F19+9bRUfvb1N99+t/j8+3YSpcoRLSfyI9W1eSJ8LxQt7WlfdGMleGD7omNfr0/qnZFQiReF53oci17AZei5nsN1Qf3FF1ZmjW76npX3M+8Xlv+aHeX9xaXqcnU6zPkJm02WjNlo9p+XFqxB5KSBCLXj8yS5YtVY9zKutOf4Ii9baSJi7lxzKa6KacgDkfSy6epz86dCBqYbqeIXanOq9I2MB0kyDuyiM+B6mDyuTfD/alepduu9zAvjVIvQuf8jN/VNHZmTrTAHnhKO9sfFhDvKK9ZqOkOuuKOLDStbA+EWmzpdThaMbT8VeXa63ciz2lqlXq+wWjV/3KTEYNbD6qyy9rpSX5vriRQP5cOnVqq1islqLyvm6upcZ5yq2H/oZC+Lztqront1/ptSCRE+rK5YGmMVc+V1Xi5PG63RnVBRllmTLbLdrJrn+awQhQLO4EEKtoIUBT0UmpPa9BllUiJqQ22oA3WgA+gASlYpoC7UhUqohA6hQ6gH9aDvoO+g19BrqA/1yf5BA2gIDckZQCNoDI2hN9AbqIIqaAJNyAFCNZQcNznsEXQEvYXeQt9D30PH0DH0Dno30XsO3oLf/tscNKAN6Dp0HboB3YBuQjehW9At6DZ0G7oD3YHuQnehe9A96D50H3oAPYAeQg+hR9Aj6DH0GNqENqEn0BPoKfQUegY9g55Dz6EtaAvahrahHWgH2oV2oRfQC+gl9BLhkQ/hcbhP0iMbhBEfuU4Y+ZEbhBEguUkYCZJbhBEhuU0YGZI7hBEiuUsYKZJ7hBEjuU8YOZIHhBEkeUgYSZJHhBEleUwYWZJNwgiTPCGMNMlTwoiTPCOMPMlzwgiUbBFGomSbMCIlO4SRKdkljFDJC8JIlbwkPI2VFYpbJwoCHg4yK4xUkF+xXmb5wtWW3xZKLzFLeXKoLTV5mlya2OMr0vyks7LMasuMndSW3jRm96enxg/Gj8bPBjNeGW+MHaNptAzHGBu/Gb8bf5T+LH0ofSr9dd/6ZGH2zgvjP6P09z917M4L
AAAKLXicbdZJb9tGGAZgOukSqZuTHHshahToQTA0juwohwKJvO/yosU2VWFIDUeMuGU4lCMT/C29tpf+ml4KFL32b5SS5b5frQ4gYPh8H6nBzHsYO/a9RFerfyw9efrJp599/qxU/uLLr77+Zvn5i3YSpcoRLSfyI9W1eSJ8LxQt7WlfdGMleGD7omOPNqf1zlioxIvCSz2JRS/gMvRcz+G6oP7ySyuzxqO+Z+X9zPuR5T9lx3l/eaW6Wp0Nc3HC5pMVYz6a/eelJWsQOWkgQu34PEluWDXWvYwr7Tm+yMtWmoiYOyMuxU0xDXkgkl42W31ufl/IwHQjVfxCbc6UvpHxIEkmgV10BlwPk8e1Kf5f7SbVbr2XeWGcahE693/kpr6pI3O6FebAU8LR/qSYcEd5xVpNZ8gVd3SxYWVrINxiU2fLyYKJ7aciz853G3lW26jU6xVWq+aPm5QYzHtYnVU23lTqGws9keKhfPjUWrVWMVntVcVcX1/ojFMV+w+d7FXRWXtddK8vflMqIcKH1RVLY6xirr3Jy+VZozW+EyrKMmu6RbabVfM8nxeiUMAZPEjBVpCioIdCc1KbPaNMSkRtqA11oA50AB1AySoF1IW6UAmV0CF0CPWgHvQ99D10BB1BfahP9g8aQENoSM4AGkFjaAz9AP0AVVAFTaAJOUCohpLjJoc9ho6ht9Bb6EfoR+gEOoHeQe+mes/BO/C7f5uDBrQB3YRuQregW9Bt6DZ0B7oD3YXuQvege9B96D70AHoAPYQeQo+gR9Bj6DH0BHoCPYWeQpvQJvQMegY9h55DL6AX0EvoJbQFbUHb0Da0A+1Au9Au9Ap6Bb2GXiM88iE8DvdJemSDMOIjNwkjP3KLMAIktwkjQXKHMCIkdwkjQ3KPMEIk9wkjRfKAMGIkDwkjR/KIMIIkjwkjSfKEMKIkTwkjS7JJGGGSZ4SRJnlOGHGSF4SRJ3lJGIGSLcJIlGwTRqRkhzAyJbuEESp5RRipkteEZ7GyQnHrREHAw0FmhZEK8hvWyyxfuNry20LpFWYpTw61paZP00sTe3xFWpx01lZZbZWxs9rK28b8/vTM+Nb4zvjBYMZr462xZzSNluEYE+Nn4xfj19Jvpd9Lf5b+um99sjR/56Xxn1H6+x8xpM4E
{ki }M
i=1
AAAKLXicbdbLbttGFAZgOknbSOnFSZbdEDUKdCEYGkd2lEWBRL7f5YsutqkKQ2o4YsxbhkM5MsFnybbd9GmyKVB029coJcv9T60OIGD4nUNqMPMvxo59L9HV6h8Ljx4/+eLLr56Wys++/ubb7xafv2gnUaoc0XIiP1JdmyfC90LR0p72RTdWgge2Lzr29fqk3hkJlXhReK7HsegFXIae6zlcF9RffGll1mjU96y8n3k/s/yX7DDvLy5Vl6vTYc5P2GyyZMxGs/+8tGANIicNRKgdnyfJFavGupdxpT3HF3nZShMRc+eaS3FVTEMeiKSXTVefmz8WMjDdSBW/UJtTpW9kPEiScWAXnQHXw+RhbYL/V7tKtVvvZV4Yp1qEzt0fualv6sicbIU58JRwtD8uJtxRXrFW0xlyxR1dbFjZGgi32NTpcrJgbPupyLPT7Uae1dYq9XqF1ar5wyYlBrMeVmeVtTeV+tpcT6R4KO8/tVKtVUxWe1UxV1fnOuNUxf59J3tVdNZeF92r89+USojwfnXF0hirmCtv8nJ52miNboWKssyabJHtZtU8z2eFKBRwBg9SsBWkKOih0JzUps8okxJRG2pDHagDHUAHULJKAXWhLlRCJXQIHUI9qAd9D30PvYZeQ32oT/YPGkBDaEjOABpBY2gM/QD9AFVQBU2gCTlAqIaS4yaHPYKOoDfQG+hH6EfoGDqG3kJvJ3rHwTvwu3+bgwa0AV2HrkM3oBvQTegmdAu6Bd2GbkN3oDvQXegudA+6B92H7kMPoAfQQ+gh9Ah6BD2GHkOb0Cb0BHoCPYWeQs+gZ9Bz6Dm0BW1B29A2tAPtQLvQLvQCegG9hF4iPPI+PA73SXpkgzDiI9cJIz9ygzACJDcJI0FyizAiJLcJI0NyhzBCJHcJI0VyjzBiJPcJI0fygDCCJA8JI0nyiDCiJI8JI0uySRhhkieEkSZ5ShhxkmeEkSd5ThiBki3CSJRsE0akZIcwMiW7hBEqeUEYqZKXhKexskJx40RBwMNBZoWRCvIr1sssX7ja8ttC6SVmKU8OtaUmT5NLE3t4RZqfdFaWWW2ZsZPa0tvG7P701Pje+MH4yWDGa+OtsWM0jZbhGGPjk/Gr8Vvp99Ln0p+lv+5aHy3M3nlp/GeU/v4HnQXODw==
{qi }N
i=1
AAAKLnicbdbLbttGFAZgOk3bSOnFabLLhqhRIAUEQ3RkR9kl8v0uX3SxTcEYDocjxrxlOLQjE3yXbttNnqaLLopu+xilZLn/qVUCgkbfOaQOZv4FnSTwU12v/zH36IvHX3719ZNK9ek33373/fyzH7ppnCkuOjwOYtV3WCoCPxId7etA9BMlWOgEoudcrY7rvWuhUj+OTvUoEYOQycj3fM50SZfzL17ZThy46Sgsv3Kbu7Eufr6cX6gv1ieXObuwposFY3q1L59V5mw35lkoIs0DlqYXVj3Rg5wp7fNAFFU7S0XC+BWT4qJcRiwU6SCfjF+YP5Ximl6syk+kzYnSO3IWpuMBy86Q6WH6sDbG/6tdZNprDnI/SjItIn73R14WmDo2x3thur4SXAejcsG48stZTT5kinFd7ljVdoVX7upknDwcOUEmivx4s1XkjZVas1mzGvXiYZMS7rTHalq1lbe15spMT6xYJO8ftVRv1Eyr8bpmLi/PdCaZSoL7Tut12dl4U3Yvzz5TKiGi++nK0SyrZi69LarVSaN9fStUnOf2eIscL68XRTEtxJGAW/AwA9thhoIeCs1IbfIbZVIi6kAdKIdyqAt1oWRKAfWgHlRCJXQIHUJ9qA/9AP0AvYJeQQNoQPYPGkIjaETOABpDE2gC/Qj9CFVQBU2hKTlAqIaS4yaHfQ29ht5Ab6CfoJ+gI+gIegu9Hesdh+/B7/9tDlvQFnQVugpdg65B16Hr0A3oBnQTugndgm5Bt6Hb0B3oDnQXugvdg+5B96H70APoAfQQeghtQ9vQI+gR9Bh6DD2BnkBPoafQDrQD7UK70B60B+1D+9Az6Bn0HHqO8Mj78HAWkPTIFmHER64SRn7kGmEESK4TRoLkBmFESG4SRobkFmGESG4TRorkDmHESO4SRo7kHmEESe4TRpLkAWFESR4SRpZkmzDCJI8II03ymDDiJE8II0/ylDACJTuEkSjZJYxIyR5hZEr2CSNU8owwUiXPCU9iZUfihsdhyCI3t6NYhcWFNcjtQHjaDrpC6QXLVr4caluNfxXlS5P18BVpdtFbWrQai5Z11Fh415q+Pz0xXho/Gq8My3hjvDO2jLbRMbhxa/xi/Gr8Vvlc+b3yZ+Wvu9ZHc9N7nhv/uSp//wPhaM5q
(·)
softmax
AAACAHicbVBNS8NAEN3Ur1q/ooIXL4tF8FQSKeix6MVjBWuFNpTNdtMu3c2G3YlYYg/+FS8eFMSrP8Ob/8ZNm4O2Phh4vDfDzLwwEdyA5307paXlldW18nplY3Nre8fd3bs1KtWUtagSSt+FxDDBY9YCDoLdJZoRGQrWDkeXud++Z9pwFd/AOGGBJIOYR5wSsFLPPeiqhGkCSsdEssyoCCR5mPTcqlfzpsCLxC9IFRVo9tyvbl/RVLIYqCDGdHwvgSAjGjgVbFLppoYlhI7IgHUszZeZIJveP8HHVunjSGlbMeCp+nsiI9KYsQxtpyQwNPNeLv7ndVKIzoOMx0kKLKazRVEqMCich4H7XDMKYmwJoZrbWzEdEk0o2MgqNgR//uVF0j6t+fWa71/Xq42LIo8yOkRH6AT56Aw10BVqohai6BE9o1f05jw5L8678zFrLTnFzD76A+fzB+x9lyM=
AAAKP3icbdbLbttGFAZgOk3bSL057bIbIkaAFBAM0ZEdZZfI97t80cU2BWM4HI4Y85bh0I5M8AnyNNm2m75GX6DLotvuSsly/xOrBAQNv3NIjWb+xThJ4Ke6Xv9j7tEXj7/86usnleo33373/Q/zT3/spnGmuOjwOIhV32GpCPxIdLSvA9FPlGChE4iec7U6rveuhUr9ODrVo0QMQiYj3/M50yVdzj+3nThw01FYfuV2MvSLF58Jd2Nd/HI5v1BfrE8uc3ZgTQcLxvRqXz6tzNluzLNQRJoHLE0vrHqiBzlT2ueBKKp2loqE8SsmxUU5jFgo0kE++T+F+bwU1/RiVX4ibU6UPpGzMB1PsOwMmR6mD2tj/L/aRaa95iD3oyTTIuJ3P+Rlgaljc7w4pusrwXUwKgeMK7+cq8mHTDGuyyWs2q7wymWeTCcPR06QiSI/3mwVeWOl1mzWrEa9eNikhDvtsZpWbeV1rbky0xMrFsn7Vy3VGzXTarysmcvLM51JppLgvtN6WXY2XpXdy7PvlEqI6H525dQsq2YuvS6q1UmjfX0rVJzn9niJHC+vF0UxLcSRgFvwMAPbYYaCHgrNSG1yjzIpEXWgDpRDOdSFulAySwH1oB5UQiV0CB1CfagPfQd9B72CXkEDaEDWDxpCI2hE9gAaQxNoAn0PfQ9VUAVNoSnZQKiGku0mm30NvYbeQG+gH6AfoCPoCHoLvR3rHYdvwW//aw5b0BZ0FboKXYOuQdeh69AN6AZ0E7oJ3YJuQbeh29Ad6A50F7oL3YPuQfeh+9AD6AH0EHoIbUPb0CPoEfQYegw9gZ5AT6Gn0A60A+1Cu9AetAftQ/vQM+gZ9Bx6jvDI+/BwFpD0yBZhxEeuEkZ+5BphBEiuE0aC5AZhREhuEkaG5BZhhEhuE0aK5A5hxEjuEkaO5B5hBEnuE0aS5AFhREkeEkaWZJswwiSPCCNN8pgw4iRPCCNP8pQwAiU7hJEo2SWMSMkeYWRK9gkjVPKMMFIlzwlPYmVH4obHYcgiN7ejWIXFhTXI7UB42g66QukFy1a+HGpbje+K8tBkPTwizQ56S4tWY9GyjhoLb1rT89MT42fjmfHCsIxXxhtjy2gbHYMbH41Pxq/Gb5XfK39W/qr8fdf6aG76zE/GZ1fln38BKG3WPA==
φ(·)
AAAKLXicbdZJb9tGGAZgOukSqZuTHHshahToQTA0juwohwKJvO/yosU2VWFIDUeMuGU4lCMT/C29tpf+ml4KFL32b5SS5b5frQ4gYPh8H6nBzHsYO/a9RFerfyw9efrJp599/qxU/uLLr77+Zvn5i3YSpcoRLSfyI9W1eSJ8LxQt7WlfdGMleGD7omOPNqf1zlioxIvCSz2JRS/gMvRcz+G6oP7ySyuzxqO+Z+X9zPuR5T9lx3l/eaW6Wp0Nc3HC5pMVYz6a/eelJWsQOWkgQu34PEluWDXWvYwr7Tm+yMtWmoiYOyMuxU0xDXkgkl42W31ufl/IwHQjVfxCbc6UvpHxIEkmgV10BlwPk8e1Kf5f7SbVbr2XeWGcahE693/kpr6pI3O6FebAU8LR/qSYcEd5xVpNZ8gVd3SxYWVrINxiU2fLyYKJ7aciz853G3lW26jU6xVWq+aPm5QYzHtYnVU23lTqGws9keKhfPjUWrVWMVntVcVcX1/ojFMV+w+d7FXRWXtddK8vflMqIcKH1RVLY6xirr3Jy+VZozW+EyrKMmu6RbabVfM8nxeiUMAZPEjBVpCioIdCc1KbPaNMSkRtqA11oA50AB1AySoF1IW6UAmV0CF0CPWgHvQ99D10BB1BfahP9g8aQENoSM4AGkFjaAz9AP0AVVAFTaAJOUCohpLjJoc9ho6ht9Bb6EfoR+gEOoHeQe+mes/BO/C7f5uDBrQB3YRuQregW9Bt6DZ0B7oD3YXuQvege9B96D70AHoAPYQeQo+gR9Bj6DH0BHoCPYWeQpvQJvQMegY9h55DL6AX0EvoJbQFbUHb0Da0A+1Au9Au9Ap6Bb2GXiM88iE8DvdJemSDMOIjNwkjP3KLMAIktwkjQXKHMCIkdwkjQ3KPMEIk9wkjRfKAMGIkDwkjR/KIMIIkjwkjSfKEMKIkTwkjS7JJGGGSZ4SRJnlOGHGSF4SRJ3lJGIGSLcJIlGwTRqRkhzAyJbuEESp5RRipkteEZ7GyQnHrREHAw0FmhZEK8hvWyyxfuNry20LpFWYpTw61paZP00sTe3xFWpx01lZZbZWxs9rK28b8/vTM+Nb4zvjBYMZr462xZzSNluEYE+Nn4xfj19Jvpd9Lf5b+um99sjR/56Xxn1H6+x8xpM4E
AAAKLXicbdbLbttGFAZgOknbSOnFSZbdEDUKdCEYGkd2lEWBRL7f5YsutqkKQ2o4YsxbhkM5MsFnybbd9GmyKVB029coJcv9T60OIGD4nUNqMPMvxo59L9HV6h8Ljx4/+eLLr56Wys++/ubb7xafv2gnUaoc0XIiP1JdmyfC90LR0p72RTdWgge2Lzr29fqk3hkJlXhReK7HsegFXIae6zlcF9RffGll1mjY96y8n3k/s/yX7CjvLy5Vl6vTYc5P2GyyZMxGs/+8tGANIicNRKgdnyfJFavGupdxpT3HF3nZShMRc+eaS3FVTEMeiKSXTVefmz8WMjDdSBW/UJtTpW9kPEiScWAXnQHXw+RhbYL/V7tKtVvvZV4Yp1qEzt0fualv6sicbIU58JRwtD8uJtxRXrFW0xlyxR1dbFjZGgi32NTpcrJgbPupyLPT7Uae1dYq9XqF1ar5wyYlBrMeVmeVtTeV+tpcT6R4KO8/tVKtVUxWe1UxV1fnOuNUxf59J3tVdNZeF92r89+USojwfnXF0hirmCtv8nJ52miNboWKssyabJHtZtU8z2eFKBRwBg9SsBWkKOih0JzUps8okxJRG2pDHagDHUAHULJKAXWhLlRCJXQIHUI9qAd9D30PvYZeQ32oT/YPGkBDaEjOABpBY2gM/QD9AFVQBU2gCTlAqIaS4yaHPYKOoDfQG+hH6EfoGDqG3kJvJ3rHwTvwu3+bgwa0AV2HrkM3oBvQTegmdAu6Bd2GbkN3oDvQXegudA+6B92H7kMPoAfQQ+gh9Ah6BD2GHkOb0Cb0BHoCPYWeQs+gZ9Bz6Dm0BW1B29A2tAPtQLvQLvQCegG9hF4iPPI+PA73SXpkgzDiI9cJIz9ygzACJDcJI0FyizAiJLcJI0NyhzBCJHcJI0VyjzBiJPcJI0fygDCCJA8JI0nyiDCiJI8JI0uySRhhkieEkSZ5ShhxkmeEkSd5ThiBki3CSJRsE0akZIcwMiW7hBEqeUEYqZKXhKexskJx40RBwMNBZoWRCvIr1sssX7ja8ttC6SVmKU8OtaUmT5NLE3t4RZqfdFaWWW2ZsZPa0tvG7P701Pje+MH4yWDGa+OtsWM0jZbhGGPjk/Gr8Vvp99Ln0p+lv+5aHy3M3nlp/GeU/v4HHhHOAg==
AAAKLnicbdbLbttGFAZgOk3bSOnFabLLhqhRIAUEQ3RkR9kl8v0uX3SxTcEYDocjxrxlOLQjE3yXbttNnqaLLopu+xilZLn/qVUCgkbfOaQOZv4FnSTwU12v/zH36IvHX3719ZNK9ek33373/fyzH7ppnCkuOjwOYtV3WCoCPxId7etA9BMlWOgEoudcrY7rvWuhUj+OTvUoEYOQycj3fM50SZfzL17ZThy46Sgsv3Kbu7Eufr6cX6gv1ieXObuwposFY3q1L59V5mw35lkoIs0DlqYXVj3Rg5wp7fNAFFU7S0XC+BWT4qJcRiwU6SCfjF+YP5Ximl6syk+kzYnSO3IWpuMBy86Q6WH6sDbG/6tdZNprDnI/SjItIn73R14WmDo2x3thur4SXAejcsG48stZTT5kinFd7ljVdoVX7upknDwcOUEmivx4s1XkjZVas1mzGvXiYZMS7rTHalq1lbe15spMT6xYJO8ftVRv1Eyr8bpmLi/PdCaZSoL7Tut12dl4U3Yvzz5TKiGi++nK0SyrZi69LarVSaN9fStUnOf2eIscL68XRTEtxJGAW/AwA9thhoIeCs1IbfIbZVIi6kAdKIdyqAt1oWRKAfWgHlRCJXQIHUJ9qA/9AP0AvYJeQQNoQPYPGkIjaETOABpDE2gC/Qj9CFVQBU2hKTlAqIaS4yaHfQ29ht5Ab6CfoJ+gI+gIegu9Hesdh+/B7/9tDlvQFnQVugpdg65B16Hr0A3oBnQTugndgm5Bt6Hb0B3oDnQXugvdg+5B96H70APoAfQQeghtQ9vQI+gR9Bh6DD2BnkBPoafQDrQD7UK70B60B+1D+9Az6Bn0HHqO8Mj78HAWkPTIFmHER64SRn7kGmEESK4TRoLkBmFESG4SRobkFmGESG4TRorkDmHESO4SRo7kHmEESe4TRpLkAWFESR4SRpZkmzDCJI8II03ymDDiJE8II0/ylDACJTuEkSjZJYxIyR5hZEr2CSNU8owwUiXPCU9iZUfihsdhyCI3t6NYhcWFNcjtQHjaDrpC6QXLVr4caluNfxXlS5P18BVpdtFbWrQai5Z11Fh415q+Pz0xXho/Gq8My3hjvDO2jLbRMbhxa/xi/Gr8Vvlc+b3yZ+Wvu9ZHc9N7nhv/uSp//wPhaM5q
(·)
O(M N )
AAAKH3icbdZJb9tGGAZgOrvUNOsxFyJGgRQQDI0jO8otkfddXrTYphAMqeGIEbcMh3Zkgj8i1/bSX9NjkGv+TSlZ7vvVKgEBw+f7SI5m3sPYse8lulr9OXfn7r37Dx4+KpV/efzrk6fPnr9oJ1GqHNFyIj9SXZsnwvdC0dKe9kU3VoIHti869nBlXO9cCJV4UXiiR7HoBVyGnus5XBfUseTBm7393z89m68uVCeXOTtg08G8Mb2an56X5qx+5KSBCLXj8yQ5Z9VY9zKutOf4Ii9baSJi7gy5FOfFMOSBSHrZZL65+VshfdONVPELtTlR+kTGgyQZBXbRGXA9SG7Xxvh/tfNUu/Ve5oVxqkXoXH/ITX1TR+b4z5t9TwlH+6NiwB3lFXM1nQFX3NHFEpWtvnCLZZxMJwtGtp+KPDvaaORZbblSr1dYrZrfblKiP+1hdVZZfl+pL8/0RIqH8uZVi9VaxWS1txVzaWmmM05V7N90srdFZ+1d0b00+06phAhvZldMjbGKufg+L5cnjdbFlVBRllnjJbLdrJrn+bQQhQLO4EEKtoIUBT0QmpPa5B5lUiJqQ22oA3WgfWgfSmYpoC7UhUqohA6gA6gH9aCfoZ+hQ+gQ6kN9sn7QABpCQ7IH0AgaQ2PoF+gXqIIqaAJNyAZCNZRsN9nsC+gF9BJ6Cf0K/QodQUfQK+jVWK85+Aj++G9z0IA2oCvQFegqdBW6Bl2DrkPXoRvQDegmdBO6Bd2CbkO3oTvQHegudBe6B92D7kP3oQfQA2gT2oQeQg+hR9Aj6DH0GHoCPYG2oC1oG9qGdqAdaBfahZ5CT6Fn0DOER96Ex+E+SY9sEEZ85Aph5EeuEkaA5BphJEiuE0aE5AZhZEhuEkaI5BZhpEhuE0aM5A5h5EjuEkaQ5B5hJEnuE0aU5AFhZEk2CSNM8pAw0iSPCCNO8pgw8iRPCCNQskUYiZJtwoiU7BBGpmSXMEIlTwkjVfKM8CRWVigunSgIeNjPrDBSQX7OepnlC1dbflsoPc8s5cmBttT4Li8OTez2EWl20FlcYLUFxg5r8x8a0/PTI+OV8dp4YzDjnfHB2DSaRstwjKHxzfjD+LP0V+nv0vfSj+vWO3PTZ14a/7lKP/8Bqy/H0w==
{ki }M
i=1
{hi }N
i=1
O(M )
AAAKLXicbdbLbttGFAZgOknbSOnFSZbdEDUKdCEYGkd2lEWBRL7f5YsutqkKQ2o4YsxbhkM5MsFnybbd9GmyKVB029coJcv9T60OIGD4nUNqMPMvxo59L9HV6h8Ljx4/+eLLr56Wys++/ubb7xafv2gnUaoc0XIiP1JdmyfC90LR0p72RTdWgge2Lzr29fqk3hkJlXhReK7HsegFXIae6zlcF9RffGll1mjY96y8n3k/s/yX7CjvLy5Vl6vTYc5P2GyyZMxGs/+8tGANIicNRKgdnyfJFavGupdxpT3HF3nZShMRc+eaS3FVTEMeiKSXTVefmz8WMjDdSBW/UJtTpW9kPEiScWAXnQHXw+RhbYL/V7tKtVvvZV4Yp1qEzt0fualv6sicbIU58JRwtD8uJtxRXrFW0xlyxR1dbFjZGgi32NTpcrJgbPupyLPT7Uae1dYq9XqF1ar5wyYlBrMeVmeVtTeV+tpcT6R4KO8/tVKtVUxWe1UxV1fnOuNUxf59J3tVdNZeF92r89+USojwfnXF0hirmCtv8nJ52miNboWKssyabJHtZtU8z2eFKBRwBg9SsBWkKOih0JzUps8okxJRG2pDHagDHUAHULJKAXWhLlRCJXQIHUI9qAd9D30PvYZeQ32oT/YPGkBDaEjOABpBY2gM/QD9AFVQBU2gCTlAqIaS4yaHPYKOoDfQG+hH6EfoGDqG3kJvJ3rHwTvwu3+bgwa0AV2HrkM3oBvQTegmdAu6Bd2GbkN3oDvQXegudA+6B92H7kMPoAfQQ+gh9Ah6BD2GHkOb0Cb0BHoCPYWeQs+gZ9Bz6Dm0BW1B29A2tAPtQLvQLvQCegG9hF4iPPI+PA73SXpkgzDiI9cJIz9ygzACJDcJI0FyizAiJLcJI0NyhzBCJHcJI0VyjzBiJPcJI0fygDCCJA8JI0nyiDCiJI8JI0uySRhhkieEkSZ5ShhxkmeEkSd5ThiBki3CSJRsE0akZIcwMiW7hBEqeUEYqZKXhKexskJx40RBwMNBZoWRCvIr1sssX7ja8ttC6SVmKU8OtaUmT5NLE3t4RZqfdFaWWW2ZsZPa0tvG7P701Pje+MH4yWDGa+OtsWM0jZbhGGPjk/Gr8Vvp99Ln0p+lv+5aHy3M3nlp/GeU/v4HHhHOAg==
AAAKP3icbdbLbttGFAZgOk3bSL057bIbIkaAFBAM0ZEdZZfI97t80cU2BWM4HI4Y85bh0I5M8AnyNNm2m75GX6DLotvuSsly/xOrBAQNv3NIjWb+xThJ4Ke6Xv9j7tEXj7/86usnleo33373/Q/zT3/spnGmuOjwOIhV32GpCPxIdLSvA9FPlGChE4iec7U6rveuhUr9ODrVo0QMQiYj3/M50yVdzj+3nThw01FYfuV2MvSLF58Jd2Nd/HI5v1BfrE8uc3ZgTQcLxvRqXz6tzNluzLNQRJoHLE0vrHqiBzlT2ueBKKp2loqE8SsmxUU5jFgo0kE++T+F+bwU1/RiVX4ibU6UPpGzMB1PsOwMmR6mD2tj/L/aRaa95iD3oyTTIuJ3P+Rlgaljc7w4pusrwXUwKgeMK7+cq8mHTDGuyyWs2q7wymWeTCcPR06QiSI/3mwVeWOl1mzWrEa9eNikhDvtsZpWbeV1rbky0xMrFsn7Vy3VGzXTarysmcvLM51JppLgvtN6WXY2XpXdy7PvlEqI6H525dQsq2YuvS6q1UmjfX0rVJzn9niJHC+vF0UxLcSRgFvwMAPbYYaCHgrNSG1yjzIpEXWgDpRDOdSFulAySwH1oB5UQiV0CB1CfagPfQd9B72CXkEDaEDWDxpCI2hE9gAaQxNoAn0PfQ9VUAVNoSnZQKiGku0mm30NvYbeQG+gH6AfoCPoCHoLvR3rHYdvwW//aw5b0BZ0FboKXYOuQdeh69AN6AZ0E7oJ3YJuQbeh29Ad6A50F7oL3YPuQfeh+9AD6AH0EHoIbUPb0CPoEfQYegw9gZ5AT6Gn0A60A+1Cu9AetAftQ/vQM+gZ9Bx6jvDI+/BwFpD0yBZhxEeuEkZ+5BphBEiuE0aC5AZhREhuEkaG5BZhhEhuE0aK5A5hxEjuEkaO5B5hBEnuE0aS5AFhREkeEkaWZJswwiSPCCNN8pgw4iRPCCNP8pQwAiU7hJEo2SWMSMkeYWRK9gkjVPKMMFIlzwlPYmVH4obHYcgiN7ejWIXFhTXI7UB42g66QukFy1a+HGpbje+K8tBkPTwizQ56S4tWY9GyjhoLb1rT89MT42fjmfHCsIxXxhtjy2gbHYMbH41Pxq/Gb5XfK39W/qr8fdf6aG76zE/GZ1fln38BKG3WPA==
φ(·)
AAAKLnicbdbLbttGFAZgOk3bSOnFabLLhqhRIAUEQ3RkR9kl8v0uX3SxTcEYDocjxrxlOLQjE3yXbttNnqaLLopu+xilZLn/qVUCgkbfOaQOZv4FnSTwU12v/zH36IvHX3719ZNK9ek33373/fyzH7ppnCkuOjwOYtV3WCoCPxId7etA9BMlWOgEoudcrY7rvWuhUj+OTvUoEYOQycj3fM50SZfzL17ZThy46Sgsv3Kbu7Eufr6cX6gv1ieXObuwposFY3q1L59V5mw35lkoIs0DlqYXVj3Rg5wp7fNAFFU7S0XC+BWT4qJcRiwU6SCfjF+YP5Ximl6syk+kzYnSO3IWpuMBy86Q6WH6sDbG/6tdZNprDnI/SjItIn73R14WmDo2x3thur4SXAejcsG48stZTT5kinFd7ljVdoVX7upknDwcOUEmivx4s1XkjZVas1mzGvXiYZMS7rTHalq1lbe15spMT6xYJO8ftVRv1Eyr8bpmLi/PdCaZSoL7Tut12dl4U3Yvzz5TKiGi++nK0SyrZi69LarVSaN9fStUnOf2eIscL68XRTEtxJGAW/AwA9thhoIeCs1IbfIbZVIi6kAdKIdyqAt1oWRKAfWgHlRCJXQIHUJ9qA/9AP0AvYJeQQNoQPYPGkIjaETOABpDE2gC/Qj9CFVQBU2hKTlAqIaS4yaHfQ29ht5Ab6CfoJ+gI+gIegu9Hesdh+/B7/9tDlvQFnQVugpdg65B16Hr0A3oBnQTugndgm5Bt6Hb0B3oDnQXugvdg+5B96H70APoAfQQeghtQ9vQI+gR9Bh6DD2BnkBPoafQDrQD7UK70B60B+1D+9Az6Bn0HHqO8Mj78HAWkPTIFmHER64SRn7kGmEESK4TRoLkBmFESG4SRobkFmGESG4TRorkDmHESO4SRo7kHmEESe4TRpLkAWFESR4SRpZkmzDCJI8II03ymDDiJE8II0/ylDACJTuEkSjZJYxIyR5hZEr2CSNU8owwUiXPCU9iZUfihsdhyCI3t6NYhcWFNcjtQHjaDrpC6QXLVr4caluNfxXlS5P18BVpdtFbWrQai5Z11Fh415q+Pz0xXho/Gq8My3hjvDO2jLbRMbhxa/xi/Gr8Vvlc+b3yZ+Wvu9ZHc9N7nhv/uSp//wPhaM5q
(·)
AAAKHnicbdZJb9tGGAZgOt0idUnSHnshahRIAcHQOLKj3BJ53+VFi20KwXA0HNHmluHQjkzwP/TaXvpreix6bf9NKVnu+9UqAQHD5/tIjmbew7hJ4KemXv974cknn372+RdPK9Uvv/r6m2fPX3zbTeNMC9kRcRDrvstTGfiR7BjfBLKfaMlDN5A993ptUu/dSJ36cXRmxokchFxFvucLbkrqOuro5cFP758v1pfq08ueH7DZYNGaXe33LyoLzjAWWSgjIwKeppesnphBzrXxRSCLqpOlMuHimit5WQ4jHsp0kE+nW9g/ljK0vViXv8jYU6VP5DxM03Holp0hN6P0cW2C/1e7zIzXHOR+lGRGRuL+Q14W2Ca2J//dHvpaChOMywEX2i/naosR11yYcoWqzlB65SpOp5OHYzfIZJGfbLWKvLFaazZrrFEvHjdpOZz1sCarrb6pNVfnemLNI/XwquV6o2azxquavbIy15lkOgkeOtmrsrPxuuxemX+n0lJGD7Mrp8ZYzV5+U1Sr00bn5k7qOM+dyRK5Xl4vimJWiCMJZ/AwAzthhoIZScNJbXqPMikRdaEuVEAFdAgdQsksJdSDelAFVdARdAT1oT70CnoFvYZeQwNoQNYPGkIjaET2ABpDE2gC/QD9ANVQDU2hKdlAqIGS7SabfQO9gd5Cb6EfoR+hY+gYege9m+g9h+/A7/5tDlvQFnQNugZdh65DN6Ab0E3oJnQLugXdhm5Dd6A70F3oLnQPugfdh+5DD6AH0EPoIfQIegRtQ9vQY+gx9AR6Aj2FnkLPoGfQDrQD7UK70B60B+1D+9Bz6Dn0AnqB8KiH8AgekPSoFmHER60RRn7UOmEESG0QRoLUJmFESG0RRobUNmGESO0QRorULmHESO0RRo7UPmEESR0QRpLUIWFESR0RRpZUmzDCpI4JI03qhDDipE4JI0/qjDACpTqEkSjVJYxIqR5hZEr1CSNU6pwwUqUuCE9j5UTyVsRhyKNh7kSxDotLNsidQHrGCbpSm0XmaF+NjKMnd0V5aGKPj0jzg97yEmssMXbcWHzbmp2fnlrfWz9YLy1mvbbeWttW2+pYwrqyfrZ+sX6t/Fb5vfJH5c/71icLs2e+s/5zVf76B4box3s=
{vi }M
i=1
AAAKLXicbdbLbttGFAZgOknbSOnFSZbdEDUKdCEYGkd2lEWBRL7f5YsutqkKQ2o4YsxbhkM5MsFnybbd9GmyKVB029coJcv9T60OIGD4nUNqMPMvxo59L9HV6h8Ljx4/+eLLr56Wys++/ubb7xafv2gnUaoc0XIiP1JdmyfC90LR0p72RTdWgge2Lzr29fqk3hkJlXhReK7HsegFXIae6zlcF9RffGll1mjU96y8n3k/s/yX7DDvLy5Vl6vTYc5P2GyyZMxGs/+8tGANIicNRKgdnyfJFavGupdxpT3HF3nZShMRc+eaS3FVTEMeiKSXTVefmz8WMjDdSBW/UJtTpW9kPEiScWAXnQHXw+RhbYL/V7tKtVvvZV4Yp1qEzt0fualv6sicbIU58JRwtD8uJtxRXrFW0xlyxR1dbFjZGgi32NTpcrJgbPupyLPT7Uae1dYq9XqF1ar5wyYlBrMeVmeVtTeV+tpcT6R4KO8/tVKtVUxWe1UxV1fnOuNUxf59J3tVdNZeF92r89+USojwfnXF0hirmCtv8nJ52miNboWKssyabJHtZtU8z2eFKBRwBg9SsBWkKOih0JzUps8okxJRG2pDHagDHUAHULJKAXWhLlRCJXQIHUI9qAd9D30PvYZeQ32oT/YPGkBDaEjOABpBY2gM/QD9AFVQBU2gCTlAqIaS4yaHPYKOoDfQG+hH6EfoGDqG3kJvJ3rHwTvwu3+bgwa0AV2HrkM3oBvQTegmdAu6Bd2GbkN3oDvQXegudA+6B92H7kMPoAfQQ+gh9Ah6BD2GHkOb0Cb0BHoCPYWeQs+gZ9Bz6Dm0BW1B29A2tAPtQLvQLvQCegG9hF4iPPI+PA73SXpkgzDiI9cJIz9ygzACJDcJI0FyizAiJLcJI0NyhzBCJHcJI0VyjzBiJPcJI0fygDCCJA8JI0nyiDCiJI8JI0uySRhhkieEkSZ5ShhxkmeEkSd5ThiBki3CSJRsE0akZIcwMiW7hBEqeUEYqZKXhKexskJx40RBwMNBZoWRCvIr1sssX7ja8ttC6SVmKU8OtaUmT5NLE3t4RZqfdFaWWW2ZsZPa0tvG7P701Pje+MH4yWDGa+OtsWM0jZbhGGPjk/Gr8Vvp99Ln0p+lv+5aHy3M3nlp/GeU/v4HnQXODw==
{vi }M
i=1
(a) Softmax attention.
(·)
AAAKLnicbdbLbttGFAZgOk3bSOnFabLLhqhRIAUEQ3RkR9kl8v0uX3SxTcEYDocjxrxlOLQjE3yXbttNnqaLLopu+xilZLn/qVUCgkbfOaQOZv4FnSTwU12v/zH36IvHX3719ZNK9ek33373/fyzH7ppnCkuOjwOYtV3WCoCPxId7etA9BMlWOgEoudcrY7rvWuhUj+OTvUoEYOQycj3fM50SZfzL17ZThy46Sgsv3Kbu7Eufr6cX6gv1ieXObuwposFY3q1L59V5mw35lkoIs0DlqYXVj3Rg5wp7fNAFFU7S0XC+BWT4qJcRiwU6SCfjF+YP5Ximl6syk+kzYnSO3IWpuMBy86Q6WH6sDbG/6tdZNprDnI/SjItIn73R14WmDo2x3thur4SXAejcsG48stZTT5kinFd7ljVdoVX7upknDwcOUEmivx4s1XkjZVas1mzGvXiYZMS7rTHalq1lbe15spMT6xYJO8ftVRv1Eyr8bpmLi/PdCaZSoL7Tut12dl4U3Yvzz5TKiGi++nK0SyrZi69LarVSaN9fStUnOf2eIscL68XRTEtxJGAW/AwA9thhoIeCs1IbfIbZVIi6kAdKIdyqAt1oWRKAfWgHlRCJXQIHUJ9qA/9AP0AvYJeQQNoQPYPGkIjaETOABpDE2gC/Qj9CFVQBU2hKTlAqIaS4yaHfQ29ht5Ab6CfoJ+gI+gIegu9Hesdh+/B7/9tDlvQFnQVugpdg65B16Hr0A3oBnQTugndgm5Bt6Hb0B3oDnQXugvdg+5B96H70APoAfQQeghtQ9vQI+gR9Bh6DD2BnkBPoafQDrQD7UK70B60B+1D+9Az6Bn0HHqO8Mj78HAWkPTIFmHER64SRn7kGmEESK4TRoLkBmFESG4SRobkFmGESG4TRorkDmHESO4SRo7kHmEESe4TRpLkAWFESR4SRpZkmzDCJI8II03ymDDiJE8II0/ylDACJTuEkSjZJYxIyR5hZEr2CSNU8owwUiXPCU9iZUfihsdhyCI3t6NYhcWFNcjtQHjaDrpC6QXLVr4caluNfxXlS5P18BVpdtFbWrQai5Z11Fh415q+Pz0xXho/Gq8My3hjvDO2jLbRMbhxa/xi/Gr8Vvlc+b3yZ+Wvu9ZHc9N7nhv/uSp//wPhaM5q
{hi }N
i=1
O(N )
AAAKHnicbdZJb9tGGAZgOt0idUnSHnshahRIAcHQOLKj3BJ53+VFi20KwXA0HNHmluHQjkzwP/TaXvpreix6bf9NKVnu+9UqAQHD5/tIjmbew7hJ4KemXv974cknn372+RdPK9Uvv/r6m2fPX3zbTeNMC9kRcRDrvstTGfiR7BjfBLKfaMlDN5A993ptUu/dSJ36cXRmxokchFxFvucLbkrqOuro5eFP758v1pfq08ueH7DZYNGaXe33LyoLzjAWWSgjIwKeppesnphBzrXxRSCLqpOlMuHimit5WQ4jHsp0kE+nW9g/ljK0vViXv8jYU6VP5DxM03Holp0hN6P0cW2C/1e7zIzXHOR+lGRGRuL+Q14W2Ca2J//dHvpaChOMywEX2i/naosR11yYcoWqzlB65SpOp5OHYzfIZJGfbLWKvLFaazZrrFEvHjdpOZz1sCarrb6pNVfnemLNI/XwquV6o2azxquavbIy15lkOgkeOtmrsrPxuuxemX+n0lJGD7Mrp8ZYzV5+U1Sr00bn5k7qOM+dyRK5Xl4vimJWiCMJZ/AwAzthhoIZScNJbXqPMikRdaEuVEAFdAgdQsksJdSDelAFVdARdAT1oT70CnoFvYZeQwNoQNYPGkIjaET2ABpDE2gC/QD9ANVQDU2hKdlAqIGS7SabfQO9gd5Cb6EfoR+hY+gYege9m+g9h+/A7/5tDlvQFnQNugZdh65DN6Ab0E3oJnQLugXdhm5Dd6A70F3oLnQPugfdh+5DD6AH0EPoIfQIegRtQ9vQY+gx9AR6Aj2FnkLPoGfQDrQD7UK70B60B+1D+9Bz6Dn0AnqB8KiH8AgekPSoFmHER60RRn7UOmEESG0QRoLUJmFESG0RRobUNmGESO0QRorULmHESO0RRo7UPmEESR0QRpLUIWFESR0RRpZUmzDCpI4JI03qhDDipE4JI0/qjDACpTqEkSjVJYxIqR5hZEr1CSNU6pwwUqUuCE9j5UTyVsRhyKNh7kSxDotLNsidQHrGCbpSm0XmaF+NjKMnd0V5aGKPj0jzg97yEmssMXbcWHzbmp2fnlrfWz9YLy1mvbbeWttW2+pYwrqyfrZ+sX6t/Fb5vfJH5c/71icLs2e+s/5zVf76B5Cex3w=
O(M )
AAAKHnicbdZJb9tGGAZgOt0idUnSHnshahRIAcHQOLKj3BJ53+VFi20KwXA0HNHmluHQjkzwP/TaXvpreix6bf9NKVnu+9UqAQHD5/tIjmbew7hJ4KemXv974cknn372+RdPK9Uvv/r6m2fPX3zbTeNMC9kRcRDrvstTGfiR7BjfBLKfaMlDN5A993ptUu/dSJ36cXRmxokchFxFvucLbkrqOuro5cFP758v1pfq08ueH7DZYNGaXe33LyoLzjAWWSgjIwKeppesnphBzrXxRSCLqpOlMuHimit5WQ4jHsp0kE+nW9g/ljK0vViXv8jYU6VP5DxM03Holp0hN6P0cW2C/1e7zIzXHOR+lGRGRuL+Q14W2Ca2J//dHvpaChOMywEX2i/naosR11yYcoWqzlB65SpOp5OHYzfIZJGfbLWKvLFaazZrrFEvHjdpOZz1sCarrb6pNVfnemLNI/XwquV6o2azxquavbIy15lkOgkeOtmrsrPxuuxemX+n0lJGD7Mrp8ZYzV5+U1Sr00bn5k7qOM+dyRK5Xl4vimJWiCMJZ/AwAzthhoIZScNJbXqPMikRdaEuVEAFdAgdQsksJdSDelAFVdARdAT1oT70CnoFvYZeQwNoQNYPGkIjaET2ABpDE2gC/QD9ANVQDU2hKdlAqIGS7SabfQO9gd5Cb6EfoR+hY+gYege9m+g9h+/A7/5tDlvQFnQNugZdh65DN6Ab0E3oJnQLugXdhm5Dd6A70F3oLnQPugfdh+5DD6AH0EPoIfQIegRtQ9vQY+gx9AR6Aj2FnkLPoGfQDrQD7UK70B60B+1D+9Bz6Dn0AnqB8KiH8AgekPSoFmHER60RRn7UOmEESG0QRoLUJmFESG0RRobUNmGESO0QRorULmHESO0RRo7UPmEESR0QRpLUIWFESR0RRpZUmzDCpI4JI03qhDDipE4JI0/qjDACpTqEkSjVJYxIqR5hZEr1CSNU6pwwUqUuCE9j5UTyVsRhyKNh7kSxDotLNsidQHrGCbpSm0XmaF+NjKMnd0V5aGKPj0jzg97yEmssMXbcWHzbmp2fnlrfWz9YLy1mvbbeWttW2+pYwrqyfrZ+sX6t/Fb5vfJH5c/71icLs2e+s/5zVf76B4box3s=
(b) Random feature attention.
Figure 1: Computation graphs for softmax attention (left) and random feature attention (right). Here,
we assume cross attention with source length M and target length N .
3.1
R ANDOM F EATURE ATTENTION
R FA builds on an unbiased estimate to exp(h· , ·i) from Theorem 1, which we begin with:
2
2
2
exp x · y/σ 2 = exp kxk /2σ 2 + kyk /2σ 2 exp − kx − yk /2σ 2
2
2
≈ exp kxk /2σ 2 + kyk /2σ 2 φ (x) · φ (y) .
(4)
The last line does not have any nonlinear interaction between φ(x) and φ(y), allowing for a linear
time/space approximation to attention. For clarity we assume the query and keys are unit vectors.2
X exp qt · ki /σ 2
P
v⊤
attn (qt , {ki }, {vi }) =
2) i
exp
(q
·
k
/σ
t
j
j
i
X φ (qt )⊤ φ (ki ) v⊤
i
P
φ
(q
)
·
φ
(k
)
t
j
j
i
P
⊤
φ (qt )
i φ (ki ) ⊗ vi
P
=
= R FA (qt , {ki }, {vi }) .
φ (qt ) · j φ (kj )
≈
(5)
⊗ denotes the outer product between vectors, and σ 2 corresponds to the temperature term τ in Eq. 1.
R FA can be used as a drop-in-replacement for softmax-attention.
(a) The input is revealed in full to cross attention and encoder self-attention. Here R FA
calculates attention using Eq. 5.
(b) In causal attention R FA attends only to the prefix.3 This allows for a recurrent computation. Tuple (St ∈ R2D×d , zt ∈ R2D ) is used as the “hidden state” at time step t to
keep track of the history, similar to those in RNNs. Then R FA(qt , {ki }i≤t , {vi }i≤t ) =
φ(qt )⊤ St /(φ(qt ) · zt ), where
St = St−1 + φ (kt ) ⊗ vt ,
zt = zt−1 + φ (kt ) .
(6)
2D denotes the size of φ(·). Appendix A.1 summarizes the computation procedure of R FA, and
Figure 1 compares it against the softmax attention. Appendix A.3 derives causal R FA in detail.
Analogously to the softmax attention, R FA has its multiheaded variant (Vaswani et al., 2017). In our
experiments we use causal R FA in a transformer language model (§4.1), and both cross and causal
R FA in the decoder of a sequence-to-sequence machine translation model.
3.2
R FA -G ATE : L EARNING WITH R ECENCY B IAS
The canonical softmax attention does not have any explicit modeling of distance or locality. In
learning problems where such inductive bias is crucial (Ba et al., 2016; Parmar et al., 2018; Miconi
et al., 2018; Li et al., 2019, inter alia), transformers heavily rely on positional encodings. Answering
to this, many approaches have been proposed, e.g., learning the attention spans (Sukhbaatar et al.,
2
3
This can be achieved by ℓ2 -normalizing the query and keys. See §3.3 for a related discussion.
It is also sometimes called “decoder self-attention” or “autoregressive attention.”
3
Published as a conference paper at ICLR 2021
2019; Wu et al., 2020), and enhancing the attention computation with recurrent (Hao et al., 2019;
Chen et al., 2019) or convolutional (Wu et al., 2019; Mohamed et al., 2019) components.
R FA faces the same issue, but its causal attention variant (Eq. 6) offers a straightforward way of
learning with recency bias. We draw inspiration from its connections to RNNs, and augment R FA
with a learned gating mechanism (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Peng et al.,
2018, inter alia):
gt = sigmoid(wg · xt + bg ),
St = gt St−1 + (1 − gt ) φ (kt ) ⊗ vt ,
zt = gt zt−1 + (1 − gt ) φ (kt ) .
(7)
wg and bg are learned parameters, and xt is the input representation at timestep t.4 By multiplying
the learned scalar gates 0 < gt < 1 against the hidden state (St , zt ), history is exponentially
decayed, favoring more recent context.
The gating mechanism shows another benefit of R FA: it would be otherwise more difficult to build
similar techniques into the softmax attention, where there is no clear sense of “recurrence” (Appendix A.5). It proves useful in our language modeling experiments (§4.1).
3.3
D ISCUSSION
On query and key norms, and learned random feature variance. Eq. 5 assumes both the query
and keys are of norm-1. It therefore approximates a softmax attention that normalizes the queries and
keys before multiplying them, and then scales the logits by dividing them by σ 2 . Empirically, this
normalization step scales down the logits (Vaswani et al., 2017) and enforces that −1 ≤ q⊤ k ≤ 1.
In consequence, the softmax outputs would be “flattened” if not for σ, which can be set a priori as
a hyperparameter (Yu et al., 2016; Avron et al., 2017; Sun, 2019, inter alia). Here we instead learn
it from data with the reparameterization trick (Kingma & Welling, 2014):
e i ∼ N (0, Id ),
w
e i.
wi = σ ◦ w
(8)
Id is the d × d identity matrix, and ◦ denotes elementwise product between vectors. d-dimensional
e i are not.5
vector σ is learned, but random vectors w
This norm-1 constraint is never mandatory. Rather, we employ it for notation clarity and easier
implementation. In preliminary experiments we find it has little impact on the performance when σ
is set properly or learned from data. Eq. 12 in Appendix A presents R FA without imposing it.
Going beyond the Gaussian kernel. More broadly, random feature methods can be applied to
a family of shift-invariant kernels, with the Gaussian kernel being one of them. In the same
family, the order-1
p arc-cosine kernel (Cho & Saul, 2009) can be approximated with feature map:
φarccos (x) = 1/D[ReLU(w1 · x), . . . , ReLU(wD · x)]⊤ (Alber et al., 2017).6 In our experiments, the Gaussian and arc-cosine variants achieve similar performance. This supplements the
exploration of alternatives to softmax in attention (Tsai et al., 2019; Gao et al., 2019).
Relations to prior work. Katharopoulos et al. (2020) inspire the causal attention variant of R FA.
They use a feature map based on the exponential linear unit activation (Clevert et al., 2016):
elu(·) + 1. It significantly underperforms both the baseline and R FA in our controlled experiments,
showing the importance of a properly-chosen feature map. Random feature approximation of attention is also explored by a concurrent work (Choromanski et al., 2021), with applications in masked
language modeling for proteins. They propose positive random features to approximate softmax,
aiming for a lower variance in critical regions. R FA instead normalizes the queries and keys before
random projection to reduce variance. Going beyond both, R FA establishes the benefits of random
feature methods as a more universal substitute for softmax across all attention variants, facilitating
its applications in, e.g., sequence-to-sequence learning.
4
In multihead attention (Vaswani et al., 2017), kt and vt are calculated from xt using learned affine transformations.
5
This departs from Eq. 2 by lifting the isotropic assumption imposed on the Gaussian distribution: note the
difference between the vector σ in Eq. 8 and the scalar σ in Eq. 3. We find this improves the performance in
practice (§4), even though the same result in Theorem 1 may not directly apply.
6
Apart from replacing the sinusoid functions with ReLU, it constructs wi in the same way as Eq. 8.
4
Published as a conference paper at ICLR 2021
There are interesting connections between gated R FA and fast weights (Schmidhuber, 1992; 1993;
Ba et al., 2016; Miconi et al., 2018, inter alia). Emphasizing recent patterns, they learn a temporal
memory to store history similarly to Eqs. 7. The main difference is that R FA additionally normalizes
the output using φ(qt ) · z as in Eq. 6, a by-product of approximating softmax’s partition function.
It is intriguing to study the role of this normalization term, which we leave to future work.
3.4
C OMPLEXITY A NALYSIS
Time. Scaling linearly in the sequence lengths, R FA needs less computation (in terms of number of
operations) for long sequences. This implies speedup wherever the quadratic-time softmax attention
cannot be fully-parallelized across time steps. More specifically:
• Significant speedup can be expected in autoregressive decoding, both conditional (e.g.,
machine translation) and unconditional (e.g., sampling from a language model). For example, 1.9× speedup is achieved in our machine translation experiments (§4.2); and more for
longer sequences (e.g., 12× for 2,048-length ones; §5).
• Some applications (e.g., language modeling, text classification) reveal inputs to the model
in full.7 When there are enough threads to parallelize softmax attention across time steps,
hardly any speedup from R FA can be achieved; when there are not, typically for very long
sequences (>1,000), substantial speed gain is possible. For example, R FA does not achieve
any speedup when working with 512-length context (§4.1), but achieves a 5.3× speedup
with 4,000-length context (§4.2).
Memory. Asymptotically, R FA has a better memory efficiency than its softmax counterpart (linear
vs. quadratic). To reach a more practical conclusion, we include in our analysis the cost of the feature
maps. φ’s memory overhead largely depends on its size D. For example,
let’s considerP
the cross
P
attention of a decoder. R FA uses O(4D + 2Dd) space to store φ(qt ), i φ(ki ) ⊗ vi , and i φ(ki )
(Eq. 5; line 12 of Algo. 2).8 In contrast, softmax cross attention stores the encoder outputs with
O(M d) memory, with M being the source length. In this case R FA has a lower memory overhead
when 2D ≪ M . Typically D should be no less than d in order for reasonable approximation (Yu
et al., 2016); In a transformer model, d is the size of an attention head, which is usually around 64 or
128 (Vaswani et al., 2017; Ott et al., 2018). This suggests that R FA can achieve significant memory
saving with longer sequences, which is supported by our empirical analysis in §5. Further, using
moderate sized feature maps is also desirable, so that its overhead does not overshadow the time and
memory R FA saves. We experiment with D at d and 2d; the benefit of using D > 2d is marginal.
Appendix A.6 discusses the time and space complexity in more detail, and Appendix C.2 studies the
effect of random feature size on performance.
4
E XPERIMENTS
We evaluate R FA on language modeling, machine translation, and long text classification.
4.1
L ANGUAGE M ODELING
Setting. We experiment with WikiText-103 (Merity et al., 2017). It is based on English Wikipedia.
Table 5 in Appendix B summarizes some of its statistics. We compare the following models:
• BASE is our implementation of the strong transformer-based language model by Baevski
& Auli (2019).
• R FA builds on BASE, but replaces the softmax attention with random feature attention. We
experiment with both Gaussian and arc-cosine kernel variants.
• R FA -G ATE additionally learns a sigmoid gate on top of R FA (§3.2). It also has a Gaussian
kernel variant and a arc-cosine kernel one.9
• φelu is a baseline to R FA. Instead of the random feature methods it uses the elu(·) + 1
feature map, as in Katharopoulos et al. (2020).
7
A causal masking is usually used to prevent the model from accessing future tokens in language models.
R FA never constructs the M × 2D × d tensor [φ(ki ) ⊗ vi ]i , but sequentially processes the sequence.
9
This gating technique is specific to R FA variants, in the sense that it is less intuitive to apply it in BASE.
8
5
Published as a conference paper at ICLR 2021
To ensure fair comparisons, we use comparable implementations, tuning, and training procedure. All
models use a 512 block size during both training and evaluation, i.e., they read as input a segment of
512 consecutive tokens, without access to the context from previous mini-batches. R FA variants use
64-dimensional random feature maps. We experiment with two model size settings, small (around
38M parameters) and big (around 242M parameters); they are described in Appendix B.1 along with
other implementation details.
Small
Big
Model
Dev.
Test
Dev.
Test
BASE
33.0
34.5
24.5
26.2
φelu (Katharopoulos et al., 2020)
38.4
40.1
28.7
30.2
R FA-Gaussian
R FA-arccos
33.6
36.0
35.7
37.7
25.8
26.4
27.5
28.1
R FA -G ATE-Gaussian
R FA -G ATE-arccos
31.3
32.8
32.7
34.0
23.2
24.8
25.0
26.3
R FA -G ATE-Gaussian-Stateful
29.4
30.5
22.0
23.5
Table 1: Language model perplexity (lower is better) on the WikiText-103 development and test
sets. Bolded numbers outperform BASE.
Results. Table 1 compares the models’ performance in perplexity on WikiText-103 development and
test data. Both kernel variants of R FA, without gating, outperform φelu by more than 2.4 and 2.1
test perplexity for the small and big model respectively, confirming the benefits from using random
feature approximation.10 Yet both underperform BASE, with R FA-Gaussian having a smaller gap.
Comparing R FA against its gated variants, a more than 1.8 perplexity improvement can be attributed
to the gating mechanism; and the gap is larger for small models. Notably, R FA -G ATE-Gaussian
outperforms BASE under both size settings by at least 1.2 perplexity. In general, R FA models with
Gaussian feature maps outperform their arc-cosine counterparts.11 From the analysis in §3.4 we
would not expect speedup by R FA models, nor do we see any in the experiments.12
Closing this section, we explore a “stateful” variant of R FA -G ATE-Gaussian. It passes the last hidden
state (St , zt ) to the next mini-batch during both training and evaluation, a technique commonly
used in RNN language models (Merity et al., 2018). This is a consequence of R FA’s RNN-style
computation, and is less straightforward to be applicable in the vanilla transformer models.13 From
the last row of Table 1 we see that this brings a more than 1.5 test perplexity improvement.
4.2
M ACHINE T RANSLATION
Datasets. We experiment with three standard machine translation datasets.
• WMT14 EN-DE and EN-FR (Bojar et al., 2014). Our data split and preprocessing follow those
of Vaswani et al. (2017). We share the source and target vocabularies within each language pair,
with 32,768 byte pair encoding types (BPE; Sennrich et al., 2016).
• IWSLT14 DE-EN (Cettolo et al., 2014) is based on TED talks. The preprocessing follows Edunov
et al. (2018). Separate vocabularies of 9K/7K BPE types are used for the source and target.
Table 5 in Appendix B summarizes some statistics of the datasets.
10
All models are trained for 150K steps; this could be part of the reason behind the suboptimal performance
of φelu : it may need 3 times more gradient updates to reach similar performance to the softmax attention
baseline (Katharopoulos et al., 2020).
11
We observe that R FA Gaussian variants are more stable and easier to train than the arc-cosine ones as well
as φelu . We conjecture that this is because the outputs of the Gaussian feature maps have an ℓ2 -norm of 1,
which can help stabilize training. To see why, sin2 (x) + cos2 (x) = cos(x − x) = 1.
12
In fact, R FA trains around 15% slower than BASE due to the additional overhead from the feature maps.
13
Some transformer models use a text segment from the previous mini-batch as a prefix (Baevski & Auli,
2019; Dai et al., 2019). Unlike R FA, this gives the model access to only a limited amount of context, and
significantly increases the memory overhead.
6
Published as a conference paper at ICLR 2021
Setting. We compare the R FA variants described in §4.1. They build on a BASE model that is our
implementation of the base-sized transformer (Vaswani et al., 2017). All R FA models apply random
feature attention in decoder cross and causal attention, but use softmax attention in encoders. This
setting yields the greatest decoding time and memory savings (§3.4). We use 128/64 for D in
cross/causal attention. R FA -G ATE learns sigmoid gates in the decoder causal attention. The φelu
baseline uses the same setting and applies feature map in both decoder cross and causal attention,
but not in the encoders. Further details are described in Appendix B.2.
WMT14
IWSLT14
Model
EN-DE
EN-FR
DE-EN
Speed
BASE
28.1
39.0
34.6
1.0×
φelu (Katharopoulos et al., 2020)
21.3
34.0
29.9
2.0×
R FA-Gaussian
R FA-arccos
28.0
28.1
39.2
38.9
34.5
34.4
1.8×
1.9×
R FA -G ATE-Gaussian
R FA -G ATE-arccos
28.1
28.2
39.0
39.2
34.6
34.4
1.8×
1.9×
Table 2: Machine translation test set BLEU. The decoding speed (last column) is relative to BASE.
All models are tested on a single TPU v2 accelerator, with batch size 32.
Results. Table 2 compares the models’ test set BLEU on three machine translation datasets. Overall
both Gaussian and arc-cosine variants of R FA achieve similar performance to BASE on all three
datasets, significantly outperforming Katharopoulos et al. (2020). Differently from the trends in
the language modeling experiments, here the gating mechanism does not lead to substantial gains.
Notably, all R FA variants decode more than 1.8× faster than BASE.
4.3
L ONG T EXT C LASSIFICATION
We further evaluate R FA’s accuracy and efficiency when used as text encoders on three NLP tasks
from the recently proposed Long Range Arena benchmark (Tay et al., 2021), designed to evaluate
efficient Transformer variants on tasks that require processing long sequences.14
Experimental setting and datasets. We compare R FA against baselines on the following datasets:
• ListOps (LO; Nangia & Bowman, 2018) aims to diagnose the capability of modelling hierarchically structured data. Given a sequence of operations on single-digit integers, the
model predicts the solution, also a single-digit integer. It is formulated as a 10-way classification. We follow Tay et al. (2021) and consider sequences with 500–2,000 symbols.
• Character-level text classification with the IMDb movie review dataset (Maas et al., 2011).
This is a binary sentiment classification task.
• Character-level document retrieval with the ACL Anthology Network (AAN; Radev et al.,
2009) dataset. The model classifies whether there is a citation between a pair of papers.
To ensure fair comparisons, we implement R FA on top of the transformer baseline by Tay et al.
(2021), and closely follow their preprocessing, data split, model size, and training procedure. Speed
and memory are evaluated on the IMDb dataset. For our R FA model, we use D = 64 for the IMDb
dataset, and D = 128 for others. We refer the readers to Tay et al. (2021) for further details.
Results. From Table 3 we can see that R FA outperforms the transformer baseline on two out of the
three datasets, achieving the best performance on IMDb with 66% accuracy. Averaging across three
datasets, R FA outperforms the transformer by 0.3% accuracy, second only to Zaheer et al. (2020)
with a 0.1% accuracy gap. In terms of time and memory efficiency, R FA is among the strongest. R FA
speeds up over the transformer by 1.1–5.3×, varying by sequence length. Importantly, compared to
the only two baselines that perform comparably to the baseline transformer model (Tay et al., 2020a;
Zaheer et al., 2020), R FA has a clear advantage in both speed and memory efficiency, and is the only
model that is competitive in both accuracy and efficiency.
14
https://github.com/google-research/long-range-arena
7
Published as a conference paper at ICLR 2021
Accuracy
Speed
Memory
Model
LO
IMDb AAN Avg.
1K 2K 3K 4K
1K
Transformer
36.4
64.3
57.5
52.7
1.0 1.0 1.0 1.0
1.00 1.00 1.00 1.00
Wang et al. (2020)
Kitaev et al. (2020)
Tay et al. (2020b)
Tay et al. (2020a)
Zaheer et al. (2020)
Katharopoulos et al. (2020)
Choromanski et al. (2021)
35.7
37.3
17.1
37.0
36.0
16.1
18.0
53.9
56.1
63.6
61.7
64.0
65.9
65.4
52.3
53.4
59.6
54.7
59.3
53.1
53.8
47.3
48.9
46.8
51.1
53.1
45.0
45.7
1.2
0.5
1.1
1.1
0.9
1.1
1.2
0.44
0.56
0.55
0.76
0.90
0.44
0.44
RFA-Gaussian (This work)
36.8
66.0
56.1
53.0
1.1 1.7 3.4 5.3
1.9
0.4
1.6
1.2
0.8
1.9
1.9
3.7
0.7
2.9
2.9
1.2
3.7
3.8
5.5
0.8
3.8
1.4
1.1
5.6
5.7
2K
0.21
0.37
0.31
0.75
0.56
0.22
0.22
3K
0.18
0.28
0.20
0.74
0.40
0.14
0.15
4K
0.10
0.24
0.16
0.74
0.30
0.11
0.11
0.53 0.30 0.21 0.16
Table 3: Accuracy (higher is better) of different models on LO, IMDb, and AAN, along with their
speed (higher is better) and peak memory consumption (lower is better) varying sequence lengths
(1–4K). Speed and memory are evaluated on the IMDb dataset and relative to the transformer’s.
Bold font indicates the best performance in each column, and underlined numbers outperform the
transformer in accuracy. Transformer’s and previous works’ numbers are due to Tay et al. (2021).
9000
Softmax
RFA-arccos
RFA-Gaussian
5
7000
4
6000
Memory: GB
Decoding Speed: Tokens/s
8000
5000
4000
3000
Softmax
RFA-arccos
RFA-Gaussian
2000
1000
23
24
25
3
2
1
26
27
28
Sequence Length
29
210
0
211
(a) Speed vs. lengths.
23
24
25
26
27
28
Sequence Length
29
210
211
(b) Memory vs. lengths.
Figure 2: Conditional decoding speed (left) and memory overhead (right) varying the output lengths.
All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size 16.
5
A NALYSIS
Decoding time and memory varying by sequence length. §3.4 shows that R FA can potentially
achieve more significant speedup and memory saving for longer sequences, which we now explore.
We use a simulation conditional generation experiment on to compare R FA’s sequence-to-sequence
decoding speed and memory overhead against the baseline’s. Here we assume the input and output
sequences are of the same length. The compared models are of the same size as those described in
§4.2, with 6-layer encoders and decoders. Other hyperparameters are summarized in Appendix B.2.
All models are tested using greedy decoding with the same batch size of 16, on a TPU v2 accelerator.
From Figures 2 (a) and (b) we observe clear trends. Varying the lengths, both R FA variants achieve
consistent decoding speed with nearly-constant memory overhead. In contrast, the baseline decodes
slower for longer sequences, taking an increasing amount of memory. Notably, for 2,048-length
sequences, R FA decodes around 12× faster than the baseline while using less than 10% of the
memory. R FA-arccos slightly outperforms R FA-Gaussian in terms of speed and memory efficiency.
This is because when using the same D (as we do here), the φarccos is half the size of φGaussian .
These results suggest that R FA can be particularly useful in sequence-to-sequence tasks with longer
sequences, e.g., document-level machine translation (Miculicich et al., 2018).
Figure 3 in Appendix C.1 compares the speed and memory consumption in unconditional decoding
(e.g., sampling from a language model). The overall trends are similar to those in Figure 2.
Notes on decoding speed. With a lower memory overhead, R FA can use a larger batch size than
the baseline. As noted by Katharopoulos et al. (2020) and Kasai et al. (2021), if we had used mini8
Published as a conference paper at ICLR 2021
batches as large as the hardware allows, R FA could have achieved a more significant speed gain.
Nonetheless, we control for batch size even though it is not the most favorable setting for R FA, since
the conclusion translates better to common applications where one generates a single sequence at
a time (e.g., instantaneous machine translation). For the softmax attention baseline, we follow Ott
et al. (2018) and cache previously computed query/key/value representations, which significantly
improves its decoding speed (over not caching).
Further analysis results. R FA achieves comparable performance to softmax attention. Appendix C.3 empirically shows that this cannot be attributed to R FA learning a good approximation to softmax: when we train with one attention but evaluate with the other, the performance is
hardly better than randomly-initialized untrained models. Yet, an R FA model initialized from a
pretrained softmax transformer achieves decent training loss after a moderate amount of finetuning
steps (Appendix C.4). This suggests some potential applications, e.g., transferring knowledge from
a pretrained transformer (e.g., GPT-3; Brown et al., 2020) to an R FA model that is more efficient to
sample from.
6
R ELATED W ORK
One common motivation across the following studies, that is shared by this work and the research
we have already discussed, is to scale transformers to long sequences. Note that there are plenty
orthogonal choices for improving efficiency such as weight sharing (Dehghani et al., 2019), quantization (Shen et al., 2020), knowledge distillation (Sanh et al., 2020), and adapters (Houlsby et al.,
2019). For a detailed overview we refer the reader to Tay et al. (2020c).
Sparse attention patterns. The idea behind these methods is to limit the reception field of attention
computation. It motivates earlier attempts in improving attention’s efficiency, and still receives lots
of interest. The sparse patterns can be set a priori (Liu et al., 2018; Qiu et al., 2020; Ho et al., 2020;
You et al., 2020, inter alia) or learned from data (Sukhbaatar et al., 2019; Roy et al., 2020, inter
alia). For most of these approaches, it is yet to be empirically verified that they are suitable for
large-scale sequence-to-sequence learning; few of them have recorded decoding speed benefits.
Compressed context. Wang et al. (2020) compress the context along the timesteps so that the effective sequence length for attention computation is reduced. Another line of work aims to store past
context into a memory module with limited size (Lee et al., 2019; Ainslie et al., 2020; Rae et al.,
2020, inter alia), so that accessing longer history only moderately increases the overhead. Reminiscent of RNN language models, R FA attends beyond a fixed context window through a stateful
computation, without increasing time or memory overhead.
7
C ONCLUSION
We presented random feature attention (R FA). It views the softmax attention through the lens of
kernel methods, and approximates it with random feature methods. With an optional gating mechanism, R FA provides a straightforward way of learning with recency bias. R FA’s time and space
complexity is linear in the sequence length. We use R FA as a drop-in substitute for softmax attention in transformer models. On language modeling, machine translation, and long text classification
benchmarks, R FA achieves comparable or better performance than strong baselines. In the machine
translation experiment, R FA decodes twice as fast. Further time and memory efficiency improvements can be achieved for longer sequences.
ACKNOWLEDGMENTS
We would like to thank Phil Blunsom, Chris Dyer, Nando de Freitas, Jungo Kasai, Adhiguna Kuncoro, Dianqi Li, Ofir Press, Lianhui Qin, Swabha Swayamdipta, Sam Thomson, the language team at
DeepMind and the ARK group at the University of Washington for their helpful feedback. We also
thank Tay Yi for helping run the Long Range Arena experiments, Richard Tanburn for the advice
on implementations, and the anonymous reviewers for their thoughtful comments. This work was
supported in part by NSF grant 1562364 and a Google Fellowship. Nikolaos Pappas was supported
by the Swiss National Science Foundation under grant number P400P2 183911 “UNISON.”
9
Published as a conference paper at ICLR 2021
R EFERENCES
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC: Encoding long and structured
inputs in transformers. In Proc. of EMNLP, 2020.
Maximilian Alber, Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Fei Sha. An
empirical study on the properties of random bases for kernel methods. In Proc. of NeurIPS, 2017.
Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney. Quasi-Monte Carlo feature
maps for shift-invariant kernels. Journal of Machine Learning Research, 17(120):1–38, 2016.
Haim Avron, L. Kenneth Clarkson, and P. David and Woodruff. Faster kernel ridge regression using
sketching and preconditioning. SIAM J. Matrix Analysis Applications, 2017.
Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast
weights to attend to the recent past. In Proc. of NeurIPS, 2016.
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In
Proc. of ICLR, 2019.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In Proc. of ICLR, 2015.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv: 2004.05150, 2020.
S. Bochner. Harmonic Analysis and the Theory of Probability. University of California Press, 1955.
Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia,
and Aleš Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proc.
of WMT, 2014.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv: 2005.14165,
2020.
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on
the 11th IWSLT evaluation campaign. In Proc. of IWSLT, 2014.
Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro Sumita. Recurrent positional embedding for
neural machine translation. In Proc. of EMNLP, 2019.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv: 1904.10509, 2019.
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder
for statistical machine translation. In Proc. of EMNLP, 2014.
Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Proc. of NeurIPS,
2009.
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser,
David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In Proc. of ICLR, 2021.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network
learning by exponential linear units (ELUs). In Proc. of ICLR, 2016.
10
Published as a conference paper at ICLR 2021
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov.
Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. of ACL,
2019.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers. In Proc. of ICLR, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proc. of NAACL, 2019.
Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical
structured prediction losses for sequence to sequence learning. In Proc. of NAACL, 2018.
Yingbo Gao, Christian Herold, Weiyue Wang, and Hermann Ney. Exploring kernel functions in the
softmax layer for contextual word classification. In International Workshop on Spoken Language
Translation, 2019.
Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling
recurrence for transformer. In Proc. of NAACL, 2019.
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In
NeurIPs Deep Learning and Representation Learning Workshop, 2015.
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv: 1912.12180, 2020.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.
Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning. Annals of Statistics, 36(3):1171–1220, 2008.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for
NLP. In Proc. of ICML, 2019.
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. Deep encoder, shallow
decoder: Reevaluating the speed-quality tradeoff in machine translation. In Proc. of ICLR, 2021.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are
rnns: Fast autoregressive transformers with linear attention. In Proc. of ICML, 2020.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of ICLR,
2015.
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proc. of ICLR, 2014.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In Proc.
of ICLR, 2020.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set
transformer: A framework for attention-based permutation-invariant neural networks. In Proc. of
ICML, 2019.
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng
Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series
forecasting. In Proc. of NeurIPS, 2019.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. Generating wikipedia by summarizing long sequences. In Proc. of ICLR, 2018.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
Potts. Learning word vectors for sentiment analysis. In Proc. of ACL, 2011.
11
Published as a conference paper at ICLR 2021
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models. In Proc. of ICLR, 2017.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM
Language Models. In Proc. of ICLR, 2018.
Thomas Miconi, Kenneth Stanley, and Jeff Clune. Differentiable plasticity: training plastic neural
networks with backpropagation. In Proc. of ICML, 2018.
Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. Document-level neural
machine translation with hierarchical attention networks. In Proc. of EMNLP, 2018.
Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional
context for ASR. arXiv: 1904.11660, 2019.
Nikita Nangia and Samuel Bowman. ListOps: A diagnostic dataset for latent tree learning. In Proc.
of NAACL Student Research Workshop, 2018.
Junier Oliva, William Neiswanger, Barnabas Poczos, Eric Xing, Hy Trac, Shirley Ho, and Jeff
Schneider. Fast function to function regression. In Proc. of AISTATS, 2015.
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation.
In Proc. of WMT, 2018.
Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M.
Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M.
Botvinick, Nicolas Heess, and Raia Hadsell. Stabilizing transformers for reinforcement learning. In Proc. of ICML, 2020.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. In Proc. of ICML, 2018.
Hao Peng, Roy Schwartz, Sam Thomson, and Noah A. Smith. Rational recurrences. In Proc. of
EMNLP, 2018.
Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith. A mixture of h − 1 heads is better than h
heads. In Proc. of ACL, 2020.
Matt Post. A call for clarity in reporting BLEU scores. In Proc. of WMT, 2018.
Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise selfattention for long document understanding. In Findings of EMNLP, 2020.
Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian. The ACL Anthology network.
In Proc. of the Workshop on Text and Citation Analysis for Scholarly Digital Libraries, 2009.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners, 2018.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap.
Compressive transformers for long-range sequence modelling. In Proc. of ICLR, 2020.
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proc. of
NeurIPS, 2007.
Ankit Singh Rawat, Jiecao Chen, Felix Xinnan X Yu, Ananda Theertha Suresh, and Sanjiv Kumar.
Sampled softmax with random Fourier features. In Proc. of NeurIPS, 2019.
Aurko Roy, Mohammad Taghi Saffar, David Grangier, and Ashish Vaswani. Efficient content-based
sparse attention with routing transformers. arXiv: 2003.05997, 2020.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version
of BERT: smaller, faster, cheaper and lighter. arXiv: 1910.01108, 2020.
12
Published as a conference paper at ICLR 2021
J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent
networks. Neural Computation, 4(1):131–139, 1992.
J. Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In Proc. of ICANN, 1993.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
subword units. In Proc. of ACL, 2016.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney,
and Kurt Keutzer. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. of
AAAI, 2020.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention
span in transformers. In Proc. of ACL, 2019.
Yitong Sun. Random Features Methods in Supervised Learning. PhD thesis, The University of
Michigan, 2019.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
Rethinking self-attention in transformer models. arXiv: 2005.00743, 2020a.
Yi Tay, Dara Bahri, Liu Yang, Don Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In Proc.
of ICML, 2020b.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv:
2009.06732, 2020c.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,
Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient
transformers. In Proc. of ICLR, 2021.
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan
Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via
the lens of kernel. In Proc. of EMNLP, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. of NeurIPS, 2017.
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention
with linear complexity. arXiv: 2006.04768, 2020.
Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent
neural networks. Neural Computation, 1:270–280, 1989.
Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with
lightweight and dynamic convolutions. In Proc. of ICLR, 2019.
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range
attention. In Proc. of ICLR, 2020.
Weiqiu You, Simeng Sun, and Mohit Iyyer. Hard-coded Gaussian attention for neural machine
translation. In Proc. of ACL, 2020.
Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice,
and Sanjiv Kumar. Orthogonal random features. In Proc. of NeurIPS, 2016.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers
for longer sequences. arXiv: 2007.14062, 2020.
13
Published as a conference paper at ICLR 2021
Appendices
A
R ANDOM F EATURE ATTENTION IN M ORE D ETAIL
A.1
D ETAILED C OMPUTATION P ROCEDURE
Algorithms 1 and 2 describe causal and cross random feature attention’s computation procedures.
Algorithm 1 Causal random feature attention.
N
N
1: procedure RFA-C AUSAL( {qi }N
i=1 , {ki }i=1 , {vi }i=1 )
2:
⊲ S is a D × d matrix
3:
⊲ z is a D-dimensional vector
4:
S, z ← 0, 0
5:
for i = 1 to N do
ei ← φ(qi ), φ(ki ) ⊲ Random feature maps
ei , k
6:
q
e i ⊗ vi
7:
S←S+k
e
8:
z ← z + ki
e⊤
9:
h⊤
qi · z)
i ←q
i S/(e
10:
end for
11:
return {hi }N
i=1
12: end procedure
Algorithm 2 Cross random feature attention.
M
M
1: procedure RFA-C ROSS( {qi }N
i=1 , {ki }i=1 , {vi }i=1 )
2:
⊲ S is a D × d matrix
3:
⊲ z is a D-dimensional vector
4:
S, z ← 0, 0
5:
for i = 1 to M do
ei ← φ(ki ) ⊲ Random feature map
6:
k
e i ⊗ v⊤
7:
S←S+k
i
e
8:
z ← z + ki
9:
end for
10:
for i = 1 to N do
ei ← φ(qi ) ⊲ Random feature map
11:
q
e⊤
12:
h⊤
qi · z)
i ←q
i S/(e
13:
end for
14:
return {hi }N
i=1
15: end procedure
A.2
VARIANCE OF R ANDOM F OURIER F EATURES
The following result is due to Yu et al. (2016). Using the same notation as in §2.2:
Var(φ (x) · φ (y)) =
where z = kx − yk /σ.
14
2 2
1
1 − e−z
,
2D
(9)
Published as a conference paper at ICLR 2021
A.3
D ERIVATION OF C AUSAL R FA
This section presents a detailed derivation of causal R FA as in §3.1. Following Eq. 5 but changing
the attended keys and values to the prefix:
⊤P
φ (qt )
i≤t φ (ki ) ⊗ vi
P
(10)
R FA(qt , {ki }i≤t , {vi }i≤t ) =
φ (qt ) · j≤t φ (kj )
P
P
Let St , i≤t φ(ki ) ⊗ vi , and zt , i≤t φ(ki ); both can be calculated recurrently. Assuming
S0 = 0 and z0 = 0:
St = St−1 + φ (kt ) ⊗ vt ,
zt = zt−1 + φ (kt ) ,
t ≥ 1.
(11)
This completes the derivation of causal R FA as in §3.1.
A.4
R FA WITHOUT N ORM -1 C ONSTRAINTS
§3.1 assumes that the queries and keys are unit vectors. This norm-1 constraint is not a must. Here
2
we present a R FA without imposing this constraint. Let C(x) = exp(kxk /2σ 2 ). From Eq. 4 we
have attn (qt , {ki }, {vi }) =
X exp qt · ki /σ 2
X C(qt ) C(ki ) φ (qt )⊤ φ (ki ) v⊤
i
⊤
P
P
≈
v
2) i
exp
(q
·
k
/σ
C(q
)
C(k
)
φ
(q
)
·
φ
(k
)
t
j
t
j
t
j
j
j
i
i
(12)
P
⊤
C(k
)
φ
(k
)
⊗
v
φ (qt )
i
i
i
i
P
=
.
φ (qt ) · j C(kj ) φ (kj )
The specific attention computation is similar to those in §3.1. In sum, lifting the norm-1 constraint
brings an additional scalar term C(·).
A.5
R ELATING R FA -G ATE TO S OFTMAX ATTENTION
Drawing inspiration from gated RNNs, §3.2 introduces a gated variant of R FA. Now we study its
“softmax counterpart.”
ei = ki (1 − gi )
k
t
Y
j=i+1
gj ,
ei = vi (1 − gi )
v
t
Y
j=i+1
gj ,
i = 1, . . . , t
(13)
ei }i≤t , {e
ht = attn(qt , {k
vi }i≤t ).
ht is the output at timestep t and is used for onward computation.
At each step, all prefix keys and values are decayed by a gate value before calculating the attention. This implies that the attention computation for qt+1 cannot start until that of qt is finished.
Combined with the linear complexity of softmax normalization, this amounts to quadratic time in
sequence length, even for language modeling training.
The above model is less intuitive and more expensive in practice, without the R FA perspective. This
shows that R FA brings some benefits in developing new attention models.
A.6
D ETAILED C OMPLEXITY A NALYSIS
Table 4 considers a sequence-to-sequence model, and breaks down the comparisons to training (with
teacher forcing; Williams & Zipser, 1989) and autoregressive decoding. Here we assume enough
threads to fully parallelize softmax attention across timesteps when the inputs are revealed to the
model in full. R FA has a lower space complexity, since it never explicitly populates the attention
matrices. As for time, R FA trains in linear time, and so does the softmax attention: in teacher-forcing
training a standard transformer decoder parallelizes the attention computation across time steps. The
trend of the time comparison differs during decoding: when only one output token is produced at a
time, R FA decodes linearly in the output length, while softmax attention decodes quadratically.
15
Published as a conference paper at ICLR 2021
Time Complexity
Setting
Model
Encoder
Cross
Space Complexity
Causal
Encoder
Cross
Causal
Training w/
teacher forcing
softmax
R FA
O(M )
O(M )
O(N )
O(N )
O(M ) O(M N ) O(N 2 )
O(M ) O(M + N ) O(N )
Decoding
softmax
R FA
O(M )
O(M N ) O(N 2 )
O(M ) O(M + N ) O(N )
O(M 2 ) O(M N ) O(N 2 )
O(M ) O(M + N ) O(N )
O(M )
O(M )
2
Table 4: Time and space complexity comparisons between R FA and its softmax counterpart in
a sequence-to-sequence attentive model, assuming an infinite amount of available threads. M
and N denote the lengths of the source and target sequences respectively. Teacher forcing training (Williams & Zipser, 1989) and autoregressive decoding are assumed. Blue color indicates the
cases where R FA asymptotically outperforms softmax attention.
Data
Train
Dev.
Test
Vocab.
WikiText-103
103M
218K
246K
268K
WMT14 EN-DE
WMT14 EN-FR
IWSLT14 DE-EN
4.5M
4.5M
160K
3K
3K
7K
3K
3K
7K
32K
32K
9K/7K
Table 5: Some statistics for the datasets. WikiText-103 split sizes are in number of tokens, while
others are in number of instances.
B
E XPERIMENTAL D ETAILS
Table 5 summarizes some statistics of the datasets used in our experiments. Our implementation is
based on JAX.15
# Random Matrices
1
50
100
200
BLEU
24.0
25.7
25.8
25.8
Table 6: WMT14 EN-DE development set performance varying the number of random matrices to
sample from during training. No beam search or checkpoint averaging is used.
During training, we sample a different random projection matrix for each attention head. Preliminary experiments suggest this performs better than using the same random projection throughout
training (Table 6). Our conjecture is that this helps keep the attention heads from “over committing” to any particular random projection (Peng et al., 2020). To avoid the overhead of sampling
from Gaussian during training, we do this in an offline manner. I.e., before training we construct a
pool of random matrices (typically 200), at each training step we draw from the pool. At test time
each attention head uses the same random projection, since no accuracy benefit is observed by using
different ones for different test instances.
B.1
L ANGUAGE M ODELING
We compare the models using two model size settings, summarized in Table 7. We use the fixed
sinusoidal position embeddings by Vaswani et al. (2017). All models are trained for up to 150K
gradient steps using the Adam optimizer (Kingma & Ba, 2015). No ℓ2 -regularization is used. We
apply early stopping based on development set perplexity. All models are trained using 16 TPU v3
accelerators, and tested using a single TPU v2 accelerator.
15
https://github.com/google/jax.
16
Published as a conference paper at ICLR 2021
Hyperprams.
# Layers
# Heads
Embedding Size
Head Size
FFN Size
Batch Size
Learning Rate
Warmup Steps
Gradient Clipping Norm
Dropout
Random Feature Map Size
Small
Big
6
16
8
16
512
1024
64
64
2048
4096
64
64
[1 × 10−4 , 2.5 × 10−4 , 5 × 10−4 ]
6000
6000
0.25
0.25
[0.05, 0.1]
[0.2, 0.25, 0.3]
64
64
Table 7: Hyperparameters used in the language modeling experiments.
B.2
M ACHINE T RANSLATION
WMT14. We use the fixed sinusoidal position embeddings by Vaswani et al. (2017). For both
EN-DE and EN-FR experiments, we train the models using the Adam (with β1 = 0.1, β2 = 0.98,
and ǫ = 10−9 ) optimizer for up to 350K gradient steps. We use a batch size of 1,024 instances for
EN-DE, while 4,096 for the much larger EN-FR dataset. The learning rate follows that by Vaswani
et al. (2017). Early stopping is applied based on development set BLEU. No ℓ2 regularization or
gradient clipping is used. All models are trained using 16 TPU v3 accelerators, and tested using a
single TPU v2 accelerator. Following standard practice, we average 10 most recent checkpoints at
test time. We evaluate the models using SacreBLEU (Post, 2018).16 A beam search with beam size
4 and length penalty 0.6 is used. Other hyperparameters are summarized in Table 8.
Hyperprams.
# Layers
# Heads
Embedding Size
Head Size
FFN Size
Warmup Steps
Dropout
Cross Attention Feature Map
Causal Attention Feature Map
WMT14
IWSLT14
6
8
512
64
2048
6000
0.1
128
64
6
8
512
64
2048
4000
0.3
128
64
Table 8: Hyperparameters used in the machine translation experiments.
C
M ORE A NALYSIS R ESULTS
C.1
M ORE R ESULTS ON D ECODING S PEED AND M EMORY OVERHEAD
Figure 3 compares the R FA’s unconditional decoding speed and memory against the softmax attention. The setting is the same as that in §5 except that here the models do not have an encoder. This
experiment aims to simulate the applications such as sampling from a language model.
C.2
E FFECT OF R ANDOM F EATURE S IZE
This section studies how the size of φ(·) affects the performance. Table 9 summarize R FAGaussian’s performance on WMT14 EN-DE development set. The model and training are the same
as that used in §4.2 except random feature size. Recall from §2.2 that the size of φ(·) is 2D for
16
https://github.com/mjpost/sacrebleu
17
Published as a conference paper at ICLR 2021
1.8
Softmax
RFA-arccos
RFA-Gaussian
10000
1.4
8000
Memory: GB
Decoding Speed: Tokens/s
1.6
6000
Softmax
RFA-arccos
RFA-Gaussian
4000
2000
23
24
25
1.2
1.0
0.8
0.6
0.4
0.2
26
27
28
Sequence Length
29
210
211
23
(a) Speed vs. lengths.
24
25
26
27
28
Sequence Length
29
210
211
(b) Memory vs. lengths.
Figure 3: Unconditional decoding speed (left) and memory overhead (right) varying the output
lengths. All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size
16.
R FA-Gaussian. When the size of φ(·) is too small (32 or 64 for cross attention, 32 for causal attention), training does not converge. We observe accuracy improvements by using random features
sufficiently large (256 for cross attention and 128 for causal attention); going beyond that, the benefit
is marginal.
φ Size
32
64
128
256
512
φ Size
32
64
128
256
512
BLEU
N/A
N/A
24.9
25.8
26.0
BLEU
N/A
25.3
25.8
25.8
25.6
(a) Varying cross attention φ sizes while fixing that
of causal attention to be 128.
(b) Varying causal attention φ sizes while fixing that
of cross attention to be 256.
Table 9: WMT14 EN-DE development set performance of R FA-Gaussian (the size of φ is 2D; §2.2)
varying the random feature sizes. N/A indicates training does not converge. No beam search or
checkpoint averaging is used.
C.3
T RAIN AND E VALUATE WITH D IFFERENT ATTENTION F UNCTIONS
R FA achieves comparable performance to its softmax counterpart. Does this imply that it learns a
good approximation to the softmax attention? To answer this question, we consider:
(i) an R FA-Gaussian model initialized from a pretrained softmax-transformer;
(ii) a softmax-transformer initialized from a pretrained an R FA-Gaussian model.
If R FA’s good performance can be attributed to learning a good approximation to softmax, both,
without finetunining, should perform similarly to the pretrained models. However, this is not the
case on IWSLT14 DE-EN. Both pretrained models achieve more than 35.2 development set BLEU.
In contrast, (i) and (ii) respectively get 2.3 and 1.1 BLEU without finetuning, hardly beating a
randomly-initialized untrained model. This result aligns with the observation by Choromanski et al.
(2021), and suggests that it is not the case that R FA performs well because it learns to imitate softmax
attention’s outputs.
C.4
K NOWLEDGE T RANSFER FROM S OFTMAX ATTENTION TO RFA
We first supplement the observation in Appendix C.3 by finetuning (i) on the same pretraining data.
Figure 4 plots the learning curves. It takes R FA roughly 1,500 steps to reach similar training loss
to the pretrained model. As a baseline, “R FA Reset” resets the multihead attention parameters (i.e.,
those for query, key, value, and output projections) to randomly initialized ones. Its learning curve
is similar to that of (i), suggesting that the pretrained multihead attention parameters are no more
useful to R FA than randomly initialized ones. To further confirm this observation, “softmax Reset”
18
Published as a conference paper at ICLR 2021
12
RFA
RFA Reset
Softmax Reset
Pretrained
Training Loss
10
8
6
4
2
0
500
1000
1500
Number of Finetuning Steps
2000
2500
Figure 4: Finetuning an R FA-Gaussian model with its parameters initialized from a pretrained
softmax-transformer. “Reset” indicates resetting the multihead attention parameters to randomlyinitialized ones. The dashed line indicates the training loss of the pretrained model.
resets the multihead attention parameters without changing the attention functions. It converges to
the pretraining loss in less than 200 steps.
Takeaway. From the above results on IWSLT14, pretrained knowledge in a softmax transformer
cannot be directly transferred to an R FA model. However, from Figure 4 and a much larger-scale
experiment by Choromanski et al. (2021), we do observe that R FA can recover the pretraining loss,
and the computation cost of finetuning is much less than training a model from scratch. This suggests
some potential applications. For example, one might be able to initialize an R FA language model
from a softmax transformer pretrained on large-scale data (e.g., GPT-3; Brown et al., 2020), and
finetune it at a low cost. The outcome would be an R FA model retaining most of the pretraining
knowledge, but is much faster and more memory-friendly to sample from. We leave such exploration
to future work.
19