Under review as a conference paper at ICLR 2022 A D OT P RODUCT ATTENTION F REE T RANSFORMER Anonymous authors Paper under double-blind review A BSTRACT We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (Vaswani et al., 2017) that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time. 1 I NTRODUCTION Self attention mechanisms, represented by Transformers (Vaswani et al., 2017), have driven the advancement of various machine learning problems, including language understanding (Devlin et al., 2018; Radford et al.) and computer vision applications (Chen et al.; Dosovitskiy et al., 2020; Touvron et al., 2020). Different from classic model architectures such as Convolutional Neural Nets (CNNs) or Recurrent Neural Nets (RNNs), Transformers enable direct interaction between every pair of elements within a sequence, which makes them especially powerful at capturing long term dependencies. Transformers build on top of Multi-head Attention (MHA), which performs dot product attention on an input h times and concatenates the outputs along the channel dimension. This introduces a space complexity of O(hT 2 ), where T is the input sequence length. It is thus challenging for Transformers to scale to long sequences and large model size at the same time. A number of recent works have been dedicated to addressing the scalability issue of Transformers (Child et al., 2019; Kitaev et al., 2020; Rae et al., 2020; Wang et al., 2020b; Katharopoulos et al., 2020; Tay et al., 2020a; Choromanski et al., 2020). The common idea here is to approximate the full attention operation, with the techniques ranging from sparsity (Child et al., 2019; Zaheer et al., 2020), locality sensitive hashing (Kitaev et al., 2020), low rank decomposition (Wang et al., 2020b), kernel approximation (Katharopoulos et al., 2020; Choromanski et al., 2020), etc.. In this paper, we propose a computational module that does not use or approximate the standard dot product attention. We hence name our model Dot product Attention Free Transformer (DAFT). DAFT explores two simple ideas 1) decrease the computation of an attention head by making it additive and decomposable 2) increase the number of heads to be the same as the input dimension with minimal cost. In our implementation, DAFT rearranges the computational graph such that the key&value are multiplied element-wise, reduced along the time dimension (with a set of learned pairwise position biases), and finally multiplied with the query element-wise. This removes the explicit computation of the attention matrices, which results in a space complexity of O(T d). A closely related line of work to DAFT is the “linear attention” Transformers (LAT) (Katharopoulos et al., 2020; Choromanski et al., 2020; Peng et al., 2021; Bello, 2021). LAT can be interpreted as a special case of dot product attention, where the nonlinearity is changed from exp to identity. This modification allows for a similar rearrangement of the computation, where the context reduction of query&key happens before interacting with the query. There are two important differences 1 Under review as a conference paper at ICLR 2022 Table 1: Comparison for DAFT with standard Transformer and linear attention Transformers: Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). Here T, d denote the sequence length and feature dimension, respectively. DAFT has linear space complexity wrt both T and d, while not performing query&key dot product, keeping the exp nonlinearity, and not computing the explicit attention. Both Linear Transformer and Performer have at least quadratic complexity in d, perform query&key dot product but without the exp nonlinearity, and do not need to compute explicit attention. Model Transformer Linear Transformer Performer DAFT Time O(T 2 d + T d2 ) O(T d2 ) O(T d2 log d) O(T 2 d + T d2 ) Space O(T 2 + T d) O(T d + d2 ) O(T d log d + d2 log d) O(T d) Dot Product? ✓ ✓ ✓ ✗ exp Nonlinearity? ✓ ✗ ✗ ✓ Explicit Attention? ✓ ✗ ✗ ✗ between DAFT and LAT: 1) DAFT maintains the same exp nonlinearity as standard MHA, which LAT gets rid of. Empirically, the removal of the exp nonlinearity often causes performance degradation. 2) LAT introduces a space complexity of O(T dd′ ), where d′ is the dimension of query, key projections. This makes LAT expensive for large model sizes. DAFT on the other hand enjoys a space complexity of O(T d), friendly to both large inputs and large models. See Table 1 for the summarized comparison. In addition to the basic DAFT formulation, we also propose DAFT-conv, a variant that exploits locality and weight sharing in the position biases. DAFT-conv inherents the the benefit of both Transformers and ConvNets. It is wired similar to a Transformer, maintaining the global connectivity and multiplicative interactions among the query, key and value; it also enjoys the parameter efficiency, sparse computation and translational equivariance property of ConvNets. We show that DAFT-conv provides excellent performance and efficiency gain over the basic version on all tasks we tested. We perform experiments with DAFT on image classification, image auto-regressive modeling and character level language modeling tasks. We show that DAFT provides competitive performance, often matching or beating standard Transformers and other variants, while providing excellent efficiency. We also provide ablation studies to several design choices of DAFT, and discuss its unique properties such as compatibility with Transformers, robustness to the choice of kernel size and variable sized inputs. 2 BACKGROUND 2.1 M ULTI -H EAD ATTENTION We now introduce the Multi-Head Attention (MHA) operation in the mode of self attention. We denote an input sequence as X ∈ RT ×d , with T, d, h being the sequence length, the feature dimension and the number of heads h, respectively. We use the superscript i to denote the ith attention head, and subscript t to denote the tth location. MHA performs a scaled dot product attention for each head i, defined as: Yti = PT i⊺ i i i t′ =1 exp(Qt Kt′ )Vt′ , s.t. Qi = XWQi , Ki = XWK , V i = XWVi . ⊺ i i K ) exp(Q t t′ t′ =1 PT (1) i Here WQi ∈ Rd×dk , WK ∈ Rd×dk , WVi ∈ Rd×dv are linear transformations for head i; Yti ∈ Rdv is output of the ith attention head for the tth query location. dk , dv are dimensions for key and value, respectively. MHA concatenates the output of h attention heads along the channel dimension, resulting in feature dimension hdv . Unless otherwise mentioned, we assume dk = dv and h = ddk . This means the query, key and value are the same dimension within each head, and the output dimension matches that of the input. 2 Under review as a conference paper at ICLR 2022 2.2 L INEAR ATTENTION One can remove the exp nonlinearity in Eq. 1 to give rise to the linear attention used in (Katharopoulos et al., 2020; Choromanski et al., 2020). This special form of dot product attention enables one to rearrange the order of computation as: Yti = PT | i i⊺ i t′ =1 (Qt Kt′ )Vt′ PT i⊺ i t′ =1 Qt Kt′ {z } = explicit linear dot product attention ′ ⊺ PT i⊺ i t′ =1 Kt′ Vt′ . P ⊺ T i Qit t′ =1 Kt Qit | {z implicit attention (2) } Here we allow Q and K to have dimension RT ×d , where d′ can be selected independently of d (e.g., in Performer (Choromanski et al., 2020), it is recommended to let d′ = O(d log d)); we also assume that Q, K contain additional linear projections and nonlinearities when required. 3 M ETHODOLOGY 3.1 D OT PRODUCT ATTENTION F REE T RANSFORMER We now define Dot product Attention free transformer (DAFT), which is a plugin replacement of MHA without the need of changing other architectural aspects of Transformers. The basic idea is to simplify the dot product attention to a form of additive attention (Bahdanau et al., 2014) that amounts to an efficient implementation. We start by considering an attention head defined in the following way: (3) attit,t′ ∝ exp(Qit + Kti′ + wt,t′ ), i = 1, 2, ..., d, where w ∈ RT ×T is a set of learned pairwise position biases. Here the attention score (before exp) becomes an additive composition of the query, the key and a “static attention” score (w). By doing so, we are also automatically making the number of attention heads equal to the feature dimension. One can then interpret this new formulation as trading the expressivity of a single attention head for more heads. However, because of the normalizing mechanism of attention, the query term cancels out from the numerator and denominator. We thus make a slight modification and make it: σq (Qi ) exp(Kti′ + wt,t′ ) attit,t′ = PT t , i = 1, 2, ..., d, i t′ =1 exp(Kt′ + wt,t′ ) (4) where σq is a nonlinearity defaulting to sigmoid. By doing so, we take Qit out of the denominator, which makes it similar to the role of an output gate in LSTMs Hochreiter & Schmidhuber (1997). We can now define the full computation of DAFT as: PT ′ =1 exp(Kt′ + wt,t′ ) ⊙ Vt′ , (5) Yt = σq (Qt ) ⊙ tP T ′ ′ t′ =1 exp(Kt + wt,t ) where ⊙ is the element-wise product, Qt , Kt′ , Vt′ ∈ Rd . Efficient Implementation. The question now remains to be answered is how to implement DAFT in a efficient way. It is obvious that explicitly computing the attention in Eq. 4 as done in Transformers is infeasible, as it requires storing a tensor of size T 2 d. Luckily, the compositionality of DAFT allows for a rearrangement of computation similar to that in linear attention. In essence, we take advantage of the simple equality of exp(x + y) = exp(x) exp(y), and decompose the computation into a series of element-wise multiplication/division and matrix-tensor multiplication. In addition, a naive parameterization of w is parameter intensive. We thus adopt a factorized form of w as: w = pe0 pe1 ⊺ , s.t. pe0 ∈ RT ×n , pe1 ∈ RT ×n , (6) where n is a small embedding dimension (e.g., 128). This simple factorization not only greatly reduces the parameter counts (2T n vs T 2 ), but also empirically improves model’s performance in both training and testing. We denote this version as DAFT-full, and show the PyTorch-style pseudo code in Algorithm 1. 3 Under review as a conference paper at ICLR 2022 Algorithm 1 Pseudo code of DAFT-full attention layer in a PyTorch-like style. def daft_attn(q, k, v, pe0, pe1): # B: batch size; T: sequence length; d feature dimension. # q, k, v: query, key and value each of shape (B, T, d) # pe0, pe1: position embeddings of shape (T, n) # y: output of shape (B, T, d) q = q.sigmoid() k = k.exp() kv = k * v w = pe0.mm(pe1.T) # compute pairwise position bias (T, T) w = w.exp() numerator = einsum(’lt,btd->bld’, w, kv) # context aggregation denominator = einsum(’lt,btd->bld’, w, k) y = q * numerator / denominator # normalization and output gating return y Algorithm 2 Pseudo code of DAFT-conv attention layer in a PyTorch-like style. def daft_conv_attn(q, k, v, r): # B: batch size; T: 1d sequence length; (H, W): 2d input size # s: half of window size; d feature dimension. # q, k, v: query, key and value each of shape (B, d, T) or (B, d, H, W) # r: depth-wise separable convolutional filters of shape (d, 2s+1) or (d, 2s+1, 2s+1) # y: output of shape (B, d, T) or (B, d, H, W) q = q.sigmoid() k = k.exp() kv = k * v r = r.exp() - 1 # assuming r is already reparameterized as Eq. (9) numerator = conv1d(kv, r, groups=d, padding=’same’) + kv.sum(dim=2, keepdim=True) # for 2d inputs # numerator = conv2d(kv, r, groups=d, padding=’same’) + kv.sum(dim=[2,3], keepdim=True) denominator = conv1d(k, r, groups=d, padding=’same’) + k.sum(dim=2, keepdim=True) # for 2d inputs # denominator = conv2d(k, r, groups=d, padding=’same’) + k.sum(dim=[2,3], keepdim=True) y = q * numerator / denominator # normalization and output gating return y 3.2 DAFT- CONV In DAFT-full, w encodes to the static bias between two positions. It is thus straightforward to conceive a “multi-head” version where one learns h sets of position biases {wi }i=1...h , each interacting with a subset of feature dimensions in the same way as in MHA. However, as w is learned as free parameters, increasing h to large values poses immediate challenge to the parameter efficiency. We propose a simple solution to this dilemma, where we simultaneously make wi local and share parameters across positions. To be more exact, we exploit the native structure of the input (1d or 2d), and assign a non-zero value to a position pair (t, t′ ) based on their relative position, if they are within a specified window size 2s + 1 (assuming the window size is an odd number). In the 1d case (the 2d case can be derived similarly), this becomes  i r′ , if |t − t′ | ≤ s i wt,t′ = t −t+s (7) 0, otherwise. Here ri ∈ R2s+1 , i = 1, 2, ..., h are the relative position vectors 1 . We can now introduce DAFT-conv, which is a multi-head, convolutional version of DAFT. DAFTconv can be implemented with the convolution primitive as (again using 1d convolution as an example): PT dw-conv(exp(K i ) ⊙ V i , exp(ri ) − 1) + t′ =1 exp(Kti′ ) ⊙ Vti′ i i , i = 1, 2, ..., h. Yt = σq (Qt ) ⊙ PT dw-conv(exp(K i ), exp(ri ) − 1) + t′ =1 exp(Kti′ ) (8) here h is the number of heads; dw-conv(x, r) denotes the depth-wise separable 1d convolution with x ∈ RT ×d being the input signal and r ∈ R(2s+1) being the convolutional filter, which are convolved along the first dimension with “same” padding. 1 One can construct the pairwise matrix wi ∈ RT ×T by tiling ri along the spatial dimension. 4 Under review as a conference paper at ICLR 2022 Compared with DAFT-full, DAFT-conv provides both parameter and computational efficiency. Moreover, DAFT-conv also enjoys the same locality and translational equivariance inductive bias of ConvNets, which have been proven useful in many domains. There is also a distinctive property of DAFT-conv, which is that global connectivity between any two positions is maintained regardless of the selection of the window size. We verify experimentally that this enables DAFT-conv to achieve strong performance even with very small window sizes. Implementation. During training, we adopt a re-parameterization of the convolutional filters r, where for each head i, we let ri − mean(ri ) + βi, (9) ri = γ i std(ri ) where γ ∈ Rd , β ∈ Rd are learnable gain and bias parameters, both initialized as 0. We found that this accelerates learning and gives consistent boost to the performance. We show the PyTorch-style pseudo code in Algorithm 2, where we assume h = d for simplicity. 4 R ELATED W ORK Since the Transformer was introduced, there have been numerous attempts to address the major source of inefficiency in the architecture, the quadratic cost of the attention operation. Improving this operation can enable larger context sizes and more efficient implementations. For a comprehensive, recent survey of efficient transformers, see (Tay et al., 2020c). Approximating the dot product. Katharopoulos et al. (2020); Choromanski et al. (2020); Peng et al. (2021) propose to approximate the exponential kernel with inner product of projections, which leads to a linearized attention operation of complexity O(T d2 ). The d2 term of these models however makes it difficult to scale with model size, which is not a problem for DAFT. Reformers (Kitaev et al., 2020) apply LSH as an approximation to the dot product, where DAFT completely gets rid of it. Sparse, local attention. Sparse Transformers (Child et al., 2019) and Image Transformer (Parmar et al., 2018) proposes to use fixed sparse or local context patterns. Attention models in vision tasks (often combined with convolutions) use image structure to help handcraft relevant spatial patterns to attend (Wang et al., 2020a; Huang et al., 2019b; Zhu et al., 2019; Huang et al., 2019a; Ramachandran et al., 2019). DAFT-conv also borrows the locality idea, but we put it as a bias rather than hard constraint. This allows DAFT-conv to take advantage of the full context, rather than relying only on a subset. Context compression. Other approaches try to learn context patterns. Adaptive-Span Transformers (Sukhbaatar et al., 2019) learn a range for each attention head within which to attend. Routing transformers (Roy et al., 2020) use clustering to compute dot-product attention only over a subset of elements within the same cluster. The Linformer (Wang et al., 2020b) reduces the length of the context by compressing the keys and values with a linear layer. Compressive Transformers (Rae et al., 2020) compute and update reduced representations of the input that are far enough back in the input sequence, and attend to those compressed representations. DAFT is largely complementary to these approaches, as our focus is to improve the complexity of any given sequence from the operation level. Eliminating dot product attention. Instead of limiting the number of comparisons, other methods change the operation used to compute attention. The Synthesizer (Tay et al., 2020a) uses attention weights predicted from inputs, rather than derived from dot-product interactions. The LightConv module introduced in (Wu et al., 2019) proposes to replace the dot product self-attention with dynamic lightweight depthwise convolution, where the weights are normalized across temporal dimension. The Sinkhorn Transformer (Tay et al., 2020b) uses a differentiable sorting operation to identify relevant comparisons that may not be local in the original sequence order. DAFT offers a different approach along this line, while highlighting strong empirical performance and efficiency. MLPs for vision. Recent works (Tolstikhin et al., 2021; Liu et al., 2021) explore the use of MLP inplace of the attention operation for vision tasks. While DAFT can be viewed in a similar way, it is also equipped with a more sophisticated gating mechanism. In particular, the weighting of values are composed of both the key and position biases, which are normalized to non-negative values 5 Under review as a conference paper at ICLR 2022 Figure 1: Left: The 197 × 197 position bias from the 3rd layer of DAFT-full. Right: relative position biases learned by a DAFT-conv. Each row represents a layer (with layer index ranging from {0, 2, 4, 6, 8, 10}); Each column represents a head. Table 2: Imagenet 1K classification results with the Transformer architecture from DeiT (Touvron et al., 2020), cropsize is 224. Speed and memory consumption are measured in inference mode on a V100 GPU, batch size is 256. Model ResNet50 DeiT-S MLP Mixer Linear Trans. Performer DAFT-full DAFT-conv DAFT-conv DAFT-conv Kernel 3 11 11 11 Heads 6 6 6 1 16 384 384 Epochs 90 300 300 300 300 300 300 300 200 Top1 Acc 76.9 79.9 75.0 74.9 79.0 79.8 80.2 80.8 80.1 #Params (MB) 25.6 22.1 18.6 22.1 22.1 22.6 20.3 23.0 23.0 Images/Sec 1257 1010 1033 1089 1024 1011 989 936 936 Mem (GB) 6.5 2.9 2.4 2.6 2.9 2.6 2.5 3.2 3.2 (similar to attention). This allows DAFT to be a plugin module to existing Transformers without any architectural changes and extra tuning. Besides, DAFT-conv inherits the valuable properties of CNNs, allowing it to achieve excellent parameter efficiency, strong performance as well as ability to handle variable sized inputs. 5 E XPERIMENTS We conduct experiments on three tasks: image classification (Sec. 5.1), image autoregressive modeling (Sec. 5.2) and character level language modeling (Sec. 5.3). All the experiments are designed in the plug and play fashion, where we obtain a baseline Transformer architecture for the specific task and replace the attention module with an DAFT module. Hyperparameters such as initialization, learning rate scheduling are also directly inherited from the Transformer counterparts. Unless otherwise mentioned, all experiments are conducted on 8×V100 GPU machines. 5.1 I MAGE C LASSIFICATION We first test the encoder version of DAFT, focusing on an image classification task. We adopt the Vision Transformer architecture (Dosovitskiy et al., 2020), and perform experiments on the Imagent 1K classification dataset. We adopt training setting and hyper parameters (batch size, data augmentation, regularization and learning rate scheduling) from DeiT (Touvron et al., 2020). In a nutshell, A ViT splits an image into 16 × 16 non-overlapping patches, then linearly projects each patch with shared weights to the equivalence of token embeddings. A learned class token H/16 is appended to the resulting representation, resulting in a sequence of length T = 1 + W/16 . A 6 Under review as a conference paper at ICLR 2022 81.5 global connectivity local connectivity 81.0 Top1 Acc 80.5 80.0 79.5 79.0 78.5 78.0 3 5 7 11 13 kernel size 15 17 19 21 Figure 2: Effectiveness of the global connectivity of DAFT-conv, with varied kernel size on ImageNet. DAFT-conv’s performance is robust to the kernel size. When removing the global connectivity, small kernel sizes lead to severe performance degradation. linear classification head is attached to the final layer’s class token to obtain the final outputs. See (Dosovitskiy et al., 2020) for more details of the model configuration. All the experiments are conducted on the ImageNet-1K dataset, without using extra data. Since the sequence size is relatively small in this task (T = 197 for input sizes of 224 × 224), we first experiment with DAFT-full. The hidden dimension of factorized position bias is set as n = 128. We also experiment with DAFT-conv. In this setting, we remove the use of position embedding and class token, and apply global average pooling after the final layer’s output, which is then fed into the classification linear layer. This modification not only simplifies the model design, but also makes DAFT-conv fully convolutional. We adopt the DeiT “small” (L=12, d=384, h=6, denoted as DeiT-S) configuration as the baseline. We also compare with MLP-Mixer (Tolstikhin et al., 2021), where we replace the attention layer with an MLP of hidden layer size DS = 4T . This roughly matches the number of parameters in the corresponding MHA layer. We also compare with two linear attention models, Linear Transformer (Katharopoulos et al., 2020) and Performer (Choromanski et al., 2020). We keep the Q, K dimension the same as that in DeiT-S in both cases, and adopt the 1+elu and relu nonlinearity, as recommended in the respective papers. Our result is shown in Table 2. We first see that DAFT-full achieves comparable performance with the baseline Transformer DeiT-S, while with better memory footprint and similar speed. DAFT-conv significantly improves the top-1 accuracy of both configurations, with similar or smaller parameter counts. DAFT also outperforms Linear Transformer and Performer wrt the top 1 accuracy. Empirically, we also observe that DAFT has faster convergence than Transformers. In fact, we are able to match the performance of DeiT-S by training a DAFT-conv model for only 23 of the total epochs. Visualization. We also tried to visualize the position biases (exp(w) − 1 to be precise) learned by DAFT-full and DAFT-conv, as shown in Figure 1. Note that interesting local, symmetric sparse patterns emerge. We show in the Appendix that we can regularize the position biases to achieve more sparsity. We also show an extreme version of DAFT-conv, where each head is assigned one non-zero context points, while still keep good accuracy. This effectively transforms convolution into indexing. Variable size inputs. DAFT-conv is fully convolutional, which means that it can handle an input size different from that in training. We tested a DAFT-conv model (second to last row of Table 2, trained with crop size 224) on a larger crop size of 384. This results in an improved accuracy of 81.6, compared with the original 80.8. This makes DAFT-conv well suited for the pretraining finetuning workflows, as often seen in Vision tasks. 7 Under review as a conference paper at ICLR 2022 Table 3: Finetuning DAFT-conv for 100 epochs from a pretrained “DeiT base” on 384 × 384 crops. “ft” and “rand” stand for finetuning and random initialization, respectively. See the Apendix for more details. Model Deit base DAFT-conv ft DAFT-conv rand Kernel 25 25 Heads 12 32 32 Top1 Acc 82.9 83.4 81.6 #Params (MB) 86.9 79.7 79.7 Images/Sec 89.6 98.5 98.5 Mem (GB) 13.6 8.9 8.9 Table 4: NLL results on CIFAR10, evaluated by bits/dim, the lower the better. Speed and memory are measured during training time, with a batch size of 32 across 8 V100 GPUs. Method PixelCNN PixelSNAIL Sparse Transformer strided Image Transformer local2d Transformer Synthesizer Reformer DAFT-full DAFT-conv-256 L 128 12 12 12 12 12 12 d 256 512 512 512 512 512 512 h 2 4 4 4 4 1 512 Train loss 3.08 2.90 2.79 2.88 2.78 2.73 Test loss 3.14 2.85 2.80 2.90 2.88 2.91 2.93 2.89 2.81 Iters/Sec 1.61 1.35 1.45 1.41 1.68 1.63 GB/GPU 22.3 30.6 23.0 14.3 10.2 9.0 Compatibility with Transformers. Although DAFT is not designed to directly approximate MHA, they do share considerable similarity in that the value vectors are aggregated with learned nonnegative weighting in both models. We hypothesize that representations learned by one model can be transferred to another. To test this, we obtain a pretrained “DeiT base” model with crop size 384. We then train an DAFT-conv by initializing its weights with that of the DeiT model, excluding the position embeddings, the class token, key and query projections. We use a batch size of 64 and train the model for 100 epochs. As a control, we also train a randomly initialized DAFT-conv for the same number of epochs. The results are shown in Table 3. Interestingly, we see that the finetuned version of DAFT-conv achieves significantly higher accuracy than that randomly initialized version. The resulting model is also more accurate, faster and memory efficient than the original DeiT model. Global connectivity. DAFT-conv maintains global connectivity regardless of the local kernel size, which is distinctive from sparse and local attention works. To see the benefit of this design, we trained a degenerate variant of DAFT-conv, where we modify Equation 7 to assign −∞ values to wt,t′ outside the local window (zero weights after exponentiation). We then train a set of DAFTconv models by varying the kernel size, the results are shown in Fig. 2. We see that DAFT-conv’s performance is extremely robust to the choice of kernel-size. Without the global connectivity, small kernel sizes lead to severe performance degradation. 5.2 I MAGE AUTOREGRESSIVE M ODELING In the next set of experiments, we consider the problem of image autoregressive modeling by minimizing the negative log likelihood (NLL). Similar to (Parmar et al., 2018), we represent an RGB image as a sequence of length H × W × 3, with H, W being the height and width, respectively. Each sub-pixel is represented as a 256-way discrete variable. We use CIFAR10 as the benchmarking dataset. Our reference Transformer design largely follows that of (Chen et al.), which follows a 12 layer, 512 dimensional and 4 heads design. We use AdamW (Loshchilov & Hutter, 2019), and follow a standard warmup learning rate schedule as in (Vaswani et al., 2017). We use an initial learning rate of 1 × 10−3 a weight decay of 0.1 applied to all linear transformations weights, and a dropout rate of 0.1. We adopt simple data augmentation. During training, we first randomly flip each image horizontally, then add or subtract a value in the range [−10, 10] from all its subpixels, and clip resulting pixel values to [0, 255]. We use cross entropy loss, and a default batch size of 32 for 100 training epochs. We train two DAFT models: DAFT-full with factorized position bias (hidden dimension = 256); and DAFT-conv with 1d kernel size of 256. Note that in this case, the 1d DAFT-conv does not take 8 Under review as a conference paper at ICLR 2022 Table 5: Enwik8 results, measured in bits per character (bpc), the lower the better. Baselines compared are Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020a) (its best performing dense version), Linear Transformer (Katharopoulos et al., 2020) and Performer (Choromanski et al., 2020). L, d, h, T denote number of blocks (depth), dimension of features, number of heads, and sequence length, respectively. Speed and memory are measured during training time, with a batch size of 128 on a 8 V100 GPU node. Both Linear Transformer and Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other models are implemented in native Pytorch. Method Transformer Reformer Synthesizer Linear Transformer Performer DAFT-full DAFT-conv-64 L 12 12 12 12 12 12 12 d 512 512 512 512 512 512 512 h 8 8 8 8 8 1 512 T 1024 1024 1024 1024 1024 1024 1024 Train bpc 0.977 1.04 0.994 0.981 1.002 0.88 0.854 Test bpc 1.137 1.195 1.298 1.207 1.199 1.201 1.180 Iters/Sec 1.42 1.05 1.49 1.46 1.44 1.85 1.77 GB/GPU 29.4 20.9 29.9 10.6 10.1 11.3 11.1 advantage of the true structure of pixels (which is 2d). It’s locality and parameter sharing mainly serve as a way of saving computation. Comparing with the state of the art. CIFAR10 is a crowded benchmark for image autoregressive modeling, and we compare with a few competitive baselines, as shown in Table 4. Note that CIFAR10 has an unrolled sequence length of 3072, which is already prohibitive to train a full Transformer with reasonable size. Another baseline is Image Transformer (Parmar et al., 2018), which restricts attention to local2d windows of size of 256. We also compare to Synthesizer (Tay et al., 2020a) and Reformer (Kitaev et al., 2020). From Table 4, we see that DAFT-full yields similar performance to the baseline Transformer, and outperforming Image Transformer, Synthesizer and Reformer. Interestingly, DAFT-conv provides even stronger performance than DAFT-full, although not utilizing the 2d local structure of pixels. Efficiency wise, both DAFT variants are faster than standard Transformer and other baselines, while consuming only half of the memory 2 . 5.3 L ANGUAGE M ODELING We apply DAFT to character level language modeling on Enwik8 (Mahoney, 2011), which is another popular benchmark for auto-regressive modeling. We follow the standard preprocessing procedures and training/validation/test splits as in (Dai et al., 2019). Our base Transformer reference is a 12 layer 512 dimensional 8 head architecture with 2048 feed forward dimensions. For the first set of experiments, we use sequence length of 1024. Our training protocol is largely the same as the previous experiment, except that we increase the weight decay to 0.5 and train for 100 epochs with batch size 128. We evaluate DAFT-full and DAFT-conv with a kernel size of 64. We also compare to several efficient Transformer baselines, namely Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020a) , Linear Transformer (Katharopoulos et al., 2020) and Performer (Choromanski et al., 2020). From Table 5, we see that, both DAFT models achieve the lowest training bits per character (bpc), which is an indicator for high model capacity. Their test performance is slightly worse than that of the basic Transformer, but outperforms all other Transformer variants. 6 C ONCLUSIONS We have introduced the Dot product Attention Free Transformer that replaces dot product attention with an efficient new operation. We have demonstrated strong results on a set of standard benchmarks with excellent efficiency. We believe that our model opens a new design space for Transformer-like models, and will see impact in various areas where self attention is needed. 2 Fair speed/memory comparison against Sparse Transformer is infeasible, as it relies on a set of advanced implementation tricks such as mixed precision and gradient checkpointing, whereas DAFT is implemented with standard Pytorch utilities run in full precision. 9 Under review as a conference paper at ICLR 2022 R EFERENCES Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=xTJEN-ggl1b. Mark Chen, Alec Radford, Rewon Child, Jeff Wu, and Heewoo Jun. Generative pretraining from pixels. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2020. Zihang Dai, Z. Yang, Yiming Yang, J. Carbonell, Quoc V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. ArXiv, abs/1901.02860, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997. Lang Huang, Y. Yuan, Jianyuan Guo, Chao Zhang, X. Chen, and Jingdong Wang. Interlaced sparse self-attention for semantic segmentation. ArXiv, abs/1907.12273, 2019a. Zilong Huang, Xinggang Wang, Lichao Huang, C. Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 603–612, 2019b. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax, 2017. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020. URL https://fleuret.org/papers/ katharopoulos-et-al-icml2020.pdf. Nikita Kitaev, L. Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. ArXiv, abs/2001.04451, 2020. Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. Pay attention to mlps, 2021. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. Matt Mahoney. Large text compression benchmark, 2011. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018. Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 10 Under review as a conference paper at ICLR 2022 Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and T. Lillicrap. Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2020. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, I. Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. ArXiv, abs/1906.05909, 2019. Aurko Roy, M. Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. ArXiv, abs/2003.05997, 2020. Sainbayar Sukhbaatar, E. Grave, P. Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, 2019. Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2020a. Yi Tay, Dara Bahri, L. Yang, Donald Metzler, and D. Juan. Sparse sinkhorn attention. ArXiv, abs/2002.11296, 2020b. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2020c. Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision, 2021. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017. Huiyu Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and Liang-Chieh Chen. Axial-deeplab: Standalone axial-attention for panoptic segmentation. ArXiv, abs/2003.07853, 2020a. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ArXiv, abs/2006.04768, 2020b. Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and M. Auli. Pay less attention with lightweight and dynamic convolutions. ArXiv, abs/1901.10430, 2019. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, C. Alberti, S. Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and A. Ahmed. Big bird: Transformers for longer sequences. ArXiv, abs/2007.14062, 2020. Zhen Zhu, Mengdu Xu, Song Bai, Tengteng Huang, and X. Bai. Asymmetric non-local neural networks for semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 593–602, 2019. A A PPENDIX B A DDITIONAL A BLATIONS We conducted more experiments on the ImageNet-1K classification settings. Factorization of w. We first verify the importance of the factorized parameterization of DAFT-full. As shown in Tab 6, the non factorized parameterization of DAFT-full achieves worse training and test performance than the factorized version. Reparameterization of r. For DAFT-conv, we by default apply the reprameterization as described in Sec. 3.2. We verify that this design effectively improves the model’s performance, as shown in Table 7. 11 Under review as a conference paper at ICLR 2022 Figure 3: Exponentiated position biases learned by DAFT-full, trained on ImageNet-1K, shown from layer 1, 2, ..., 12, arranged from top left to bottom right. Each image is of size 197 × 197, where the first element corresponds to the class token, and the remaining 196 correspond to the 14 × 14 positions. We see that local, sparse patterns are learned without explicit supervision. 12 Under review as a conference paper at ICLR 2022 Figure 4: Image completion with the AFT-local trained on CIFAR10 autoregressive modeling task. Top: masked images from the test set. Bottom: completed images. 13 Under review as a conference paper at ICLR 2022 7.0 Deit-S DAFT-full DAFT-conv Linear Trans. 6.5 6.0 Training loss 5.5 5.0 4.5 4.0 3.5 3.0 0 50 100 150 Training epoch 200 250 300 80 70 Top1 Acc % 60 50 40 30 20 Deit-S DAFT-full DAFT-conv Linear Trans. 10 0 0 50 100 150 Training epoch 200 250 300 Figure 5: Training dynamics of DAFT, compared with baseline models on ImageNet. DAFT-conv enjoys faster convergence than others. 14 Under review as a conference paper at ICLR 2022 80 70 Test Acc % 60 50 40 30 20 DAFT-conv-rand DAFT-conv-ft DeiT-base 10 0 0 20 40 60 Training epoch 80 100 Figure 6: The top 1 test accuracy of finetuning a pretrained DeiT model to DAFT-conv (DAFT-convft), compared that initialized from random (DAFT-conv-rand). DAFT-conv benefits greatly from a pretrained DeiT model, indicating the compatibility between the two family of models. Table 6: The effect of factorized parameteriza- Table 7: The effect of reprameterization of DAFT-conv (kernel size 7 × 7). tion of DAFT-full. Non Factorized Factorized (default) Train loss 3.17 3.08 Top 1 Acc 78.2 79.8 Naive param Reparameterized (default) Train loss 3.11 2.94 Top 1 Acc 79.4 80.8 Convergence. We found that DAFT variants enjoy quick convergence, compared to Transformer like models. We show the training curves in Fig 5. In addition, we also show the training curve for the finetuning experiment corresponding to Tab. 3, in Fig 6. Model scaling. Next we investigate the performance of Deit vs. DAFT-conv as we scale the model size. We train models with different feature dimensions, while keeping the dimension of each attention head fixed for DeiT, and the kernel size fixed for DAFT-conv. The result is shown in Fig. 7. We see that throughout the size of models considered, DAFT-conv consistently outperforms DeiT model of similar size. Contribution of the query. The query term contributes a small fraction to the computation of DAFT, but it contributes significantly to DAFT’s performance. We conducted an additional experiment with DAFT-conv (384 heads, kernel size in 11 × 11 and 15 × 15), where we remove the query term. The result is shown in Tab 8. Visualizing the key. The keys play a central role in DAFT, as they provide content dependent reweighting for effective context reduction. In order to understand their behavior, we visualized the feature maps for a DAFT-conv model on randomly sampled images from the validation set of ImageNet-1K, as shown in Fig. 11, 12, 13, 14. Interestingly, we see that the keys gradually evolve to “object detectors” as the layer level goes up. Table 8: Top 1 accuracy of DAFT-conv without the query term (w/o q). This results in significant performance drops. Kernel 11 15 with q (default) 80.8 80.9 w/o q 79.3 79.5 15 Under review as a conference paper at ICLR 2022 DeiT DAFT-conv 80.0 77.5 Top1 Acc 75.0 72.5 70.0 67.5 65.0 62.5 60.0 3 5 7 11 13 #params (MB) 15 17 19 21 Figure 7: Performance of DeiT vs DAFT-conv, as model size scales. Figure 8: Exponentiated position biases learned by DAFT-conv, trained on ImageNet-1K. Each row corresponds to a layer, each column corresponds to a head (the first 16 are shown). This model has top 1 accuracy of 80.8%. 16 Under review as a conference paper at ICLR 2022 Figure 9: Exponentiated position biases learned by DAFT-conv (kernel size 11 × 11) with sparsity regularization, trained on ImageNet-1K. Each row corresponds to a layer, each column corresponds to a head (the first 16 are shown). This model has top 1 accuracy of 80.9%. C S PARSITY The position biases learned by DAFT-conv (kernel size 11 × 11) as shown in Figure 8 demonstrates interesting sparsity patterns, which suggests great potential for quantization and pruning. To this end, we experimented with a simple sparsity promoting regularization term: reg(w) = h X H(wi ), H(wi ) = entropy(softmax(wi )). (10) i=1 Where we simply minimize the entropy for each head, with the softmax distribution using wi as the logits. We combining reg(w) with the cross entropy loss with a small weighting (0.001) and train with the DAFT-conv with kernel size 11 and 384 heads. This results in a slight improvement in accuracy (due to its regularization effect) of 80.9 vs 80.8, as well as sparser looking position biases. The visualization is shown in Fig. 9. We see that the position biases are much more sparsely distributed as expected. Encouraged by this, we continued to push the sparsity to an extreme form. Now for each head, we only assign a learned relative position bias for a single position. To do this, during training, we multiply the position biases w for each layer and each head with a sample from its corresponding Gumbel softmax distribution (Jang et al., 2017): wi = wi ∗ gumbel(wi ; τ ), (11) where τ is the temperature term for Gumbel softmax, and we set it as 0.5; gumbel(wi ; τ ) produces a (sparse) sample with the same shape as wi . During inference, the Gumbel softmax is replaced with hard max, i.e., a one hot vector is returned. This results in a model with top 1 accuracy 79.9, with less than 1 point drop compared with the unregularized model. The position biases are visualized in Fig. 10. This extreme model variant makes it possible to implement the context reduction of K, V with a combination of global average pooling and indexing operations. 17 Under review as a conference paper at ICLR 2022 Figure 10: Exponentiated position biases learned DAFT-conv (kernel size 11 × 11) with Gumbel softmax sampling, trained on ImageNet-1K. Each row corresponds to a layer, each column corresponds to a head (the first 16 are shown). This model has top 1 accuracy of 79.9%. 18 Under review as a conference paper at ICLR 2022 Figure 11: Top: sample image from the validation set of ImageNet-1K. Bottom: visualization of the keys for DAFT-conv, with each row corresponding to a layer, each column corresponding to a head. 19 Under review as a conference paper at ICLR 2022 Figure 12: Top: sample image from the validation set of ImageNet-1K. Bottom: visualization of the keys for DAFT-conv, with each row corresponding to a layer, each column corresponding to a head. 20 Under review as a conference paper at ICLR 2022 Figure 13: Top: sample image from the validation set of ImageNet-1K. Bottom: visualization of the keys for DAFT-conv, with each row corresponding to a layer, each column corresponding to a head. 21 Under review as a conference paper at ICLR 2022 Figure 14: Top: sample image from the validation set of ImageNet-1K. Bottom: visualization of the keys for DAFT-conv, with each row corresponding to a layer, each column corresponding to a head. 22