Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, and Christopher Ré Department of Computer Science, Stanford University arXiv:2111.00396v2 [cs.LG] 4 Mar 2022 {albertgu,krng}@stanford.edu, chrismre@cs.stanford.edu Abstract A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) x′ (t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t), and showed that for appropriate choices of the state matrix A, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning A with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.1 1 Introduction A central problem in sequence modeling is efficiently handling data that contains long-range dependencies (LRDs). Real-world time-series data often requires reasoning over tens of thousands of time steps, while few sequence models address even thousands of time steps. For instance, results from the long-range arena (LRA) benchmark [37] highlight that sequence models today perform poorly on LRD tasks, including one (Path-X) where no model performs better than random guessing. Since LRDs are perhaps the foremost challenge for sequence models, all standard model families such as continuous-time models (CTMs), RNNs, CNNs, and Transformers include many specialized variants designed to address them. Modern examples include orthogonal and Lipschitz RNNs [1, 13] to combat vanishing gradients, dilated convolutions to increase context size [3, 25], and an increasingly vast family of efficient Transformers that reduce the quadratic dependence on sequence length [8, 20]. Despite being designed for LRDs, these solutions still perform poorly on challenging benchmarks such as LRA [37] or raw audio classification [18]. An alternative approach to LRDs was recently introduced based on the state space model (SSM) (Fig. 1). SSMs are a foundational scientific model used in fields such as control theory, computational neuroscience, and many more, but have not been applicable to deep learning for concrete theoretical reasons. In particular, Gu et al. [18] showed that deep SSMs actually struggle even on simple tasks, but can perform exceptionally 1 Code is publicly available at https://github.com/HazyResearch/state-spaces. 1 𝑢 𝑥 𝑦 𝐴 𝑥̇ = 𝐴𝑥 + 𝐵𝑢 𝑦 = 𝐶𝑥 + 𝐷𝑢 Continuous State Space 1 𝐴= 1 1 0 2 3 0 0 3 ̅ + 𝐵𝑢 𝑥 = 𝐴𝑥 ̅ + 𝐷𝑢 𝑦 = 𝐶𝑥 Long-Range Dependencies 𝑦 =𝐾∗𝑢 Fast Discrete Representations Figure 1: (Left) State Space Models (SSM) parameterized by matrices A, B, C, D map an input signal u(t) to output y(t) through a latent state x(t). (Center) Recent theory on continuous-time memorization derives special A matrices that allow SSMs to capture LRDs mathematically and empirically. (Right) SSMs can be computed either as a recurrence (left) or convolution (right). However, materializing these conceptual views requires utilizing different representations of its parameters (red, blue, green) which are very expensive to compute. S4 introduces a novel parameterization that efficiently swaps between these representations, allowing it to handle a wide range of tasks, be efficient at both training and inference, and excel at long sequences. well when equipped with special state matrices A recently derived to solve a problem of continuous-time memorization [16, 42]. Their Linear State Space Layer (LSSL) conceptually unifies the strengths of CTM, RNN and CNN models, and provides a proof of concept that deep SSMs can address LRDs in principle. Unfortunately, the LSSL is infeasible to use in practice because of prohibitive computation and memory requirements induced by the state representation. For state dimension N and sequence length L, computing the latent state requires O(N 2 L) operations and O(N L) space – compared to a Ω(L + N ) lower bound for both. Thus for reasonably sized models (e.g. N = 256 in Gu et al. [18]), the LSSL uses orders of magnitude more memory than comparably-sized RNNs or CNNs. Although theoretically efficient algorithms for the LSSL were proposed, we show that these are numerically unstable. In particular, the special A matrix is highly non-normal in the linear algebraic sense, which prevents the application of conventional algorithmic techniques. Consequently, although the LSSL showed that SSMs have strong performance, they are currently computationally impractical as a general sequence modeling solution. In this work, we introduce the Structured State Space (S4) sequence model based on the SSM that solves the critical computational bottleneck in previous work. Technically, S4 reparameterizes the structured state matrices A appearing in Gu et al. [16], Voelker et al. [42] by decomposing them as the sum of a low-rank and skew-symmetric term. Additionally, instead of expanding the standard SSM in coefficient space, we compute its truncated generating function in frequency space, which can be simplified into a multipole-like evaluation. Combining these two ideas, we show that the low-rank term can be corrected by the Woodbury identity while the skew-symmetric term can be diagonalized stably, ultimately reducing to a well-studied and theoretically stable Cauchy kernel [26, 27]. This results in Õ(N + L) computation and O(N + L) memory usage, which is essentially tight for sequence models. Compared to the LSSL, S4 is up to 30× faster with 400× less memory usage, while exceeding the LSSL’s performance empirically. Empirically, S4 significantly advances the state-of-the-art for LRD. On the LRA benchmark for efficient sequence models, S4 is as fast as all baselines while outperforming them by 20+ points on average. S4 is the first model to solve the difficult LRA Path-X task (length-16384), achieving 88% accuracy compared to 50% random guessing for all prior work. On speech classification with length-16000 sequences, S4 halves the test error (1.7%) of specialized Speech CNNs – by contrast, all RNN and Transformer baselines fail to learn (≥ 70% error). Towards a general-purpose sequence model. Beyond LRD, a broad goal of machine learning is to develop a single model that can be used across a wide range of problems. Models today are typically 2 specialized to solve problems from a particular domain (e.g. images, audio, text, time-series), and enable a narrow range of capabilities (e.g. efficient training, fast generation, handling irregularly sampled data). This specialization is typically expressed via domain-specific preprocessing, inductive biases, and architectures. Sequence models provide a general framework for solving many of these problems with reduced specialization – e.g. Vision Transformers for image classification with less 2D information [12]. However, most models such as Transformers generally still require substantial specialization per task to achieve high performance. Deep SSMs in particular have conceptual strengths that suggest they may be promising as a general sequence modeling solution. These strengths include a principled approach to handling LRDs, as well as the ability to move between continuous-time, convolutional, and recurrent model representations, each with distinct capabilities (Fig. 1). Our technical contributions enable SSMs to be applied successfully to a varied set of benchmarks with minimal modification: • Large-scale generative modeling. On CIFAR-10 density estimation, S4 is competitive with the best autoregressive models (2.85 bits per dim). On WikiText-103 language modeling, S4 substantially closes the gap to Transformers (within 0.8 perplexity), setting SoTA for attention-free models. • Fast autoregressive generation. Like RNNs, S4 can use its latent state to perform 60× faster pixel/token generation than standard autoregressive models on CIFAR-10 and WikiText-103. • Sampling resolution change. Like specialized CTMs, S4 can adapt to changes in time-series sampling frequency without retraining, e.g. at 0.5× frequency on speech classification. • Learning with weaker inductive biases. With no architectural changes, S4 surpasses Speech CNNs on speech classification, outperforms the specialized Informer model on time-series forecasting problems, and matches a 2-D ResNet on sequential CIFAR with over 90% accuracy. 2 Background: State Spaces Sections 2.1 to 2.4 describe the four properties of SSMs in Fig. 1: the classic continuous-time representation, addressing LRDs with the HiPPO framework, the discrete-time recurrent representation, and the parallelizable convolution representation. In particular, Section 2.4 introduces the SSM convolution kernel K, which is the focus of our theoretical contributions in Section 3. 2.1 State Space Models: A Continuous-time Latent State Model The state space model is defined by the simple equation (1). It maps a 1-D input signal u(t) to an N -D latent state x(t) before projecting to a 1-D output signal y(t). x′ (t) = Ax(t) + Bu(t) y(t) = Cx(t) + Du(t) (1) SSMs are broadly used in many scientific disciplines and related to latent state models such as Hidden Markov Models (HMM). Our goal is to simply use the SSM as a black-box representation in a deep sequence model, where A, B, C, D are parameters learned by gradient descent. For the remainder of this paper, we will omit the parameter D for exposition (or equivalently, assume D = 0) because the term Du can be viewed as a skip connection and is easy to compute. 2.2 Addressing Long-Range Dependencies with HiPPO Prior work found that the basic SSM (1) actually performs very poorly in practice. Intuitively, one explanation is that linear first-order ODEs solve to an exponential function, and thus may suffer from gradients scaling exponentially in the sequence length (i.e., the vanishing/exploding gradients problem [30]). To address this 3 problem, the LSSL leveraged the HiPPO theory of continuous-time memorization [16]. HiPPO specifies a class of certain matrices A ∈ N ×N that when incorporated into (1), allows the state x(t) to memorize the history of the input u(t). The most important matrix in this class is defined by equation (2), which we will call the HiPPO matrix. For example, the LSSL found that simply modifying an SSM from a random matrix A to equation (2) improved its performance on the sequential MNIST benchmark from 60% to 98%.  1/2 1/2  if n > k (2n + 1) (2k + 1) (HiPPO Matrix) Ank = − n + 1 (2) if n = k .   0 if n < k ❘ 2.3 Discrete-time SSM: The Recurrent Representation To be applied on a discrete input sequence (u0 , u1 , . . . ) instead of continuous function u(t), (1) must be discretized by a step size ∆ that represents the resolution of the input. Conceptually, the inputs uk can be viewed as sampling an implicit underlying continuous signal u(t), where uk = u(k∆). To discretize the continuous-time SSM, we follow prior work in using the bilinear method [40], which converts the state matrix A into an approximation A . The discrete SSM is xk = Axk−1 + Buk A = (I − ∆/2 · A)−1 (I + ∆/2 · A) B = (I − ∆/2 · A)−1 ∆B yk = Cxk (3) C = C. Equation (3) is now a sequence-to-sequence map uk 7→ yk instead of function-to-function. Moreover the state equation is now a recurrence in xk , allowing the discrete SSM to be computed like an RNN. Concretely, xk ∈ N can be viewed as a hidden state with transition matrix A. ❘ Notationally, throughout this paper we use A, B, . . . to denote discretized SSM matrices defined by (3). Note that these matrices are a function of both A as well as a step size ∆; we suppress this dependence for notational convenience when it is clear. 2.4 Training SSMs: The Convolutional Representation The recurrent SSM (3) is not practical for training on modern hardware due to its sequentiality. Instead, there is a well-known connection between linear time-invariant (LTI) SSMs such as (1) and continuous convolutions. Correspondingly, (3) can actually be written as a discrete convolution. For simplicity let the initial state be x−1 = 0. Then unrolling (3) explicitly yields x0 = Bu0 2 x1 = ABu0 + Bu1 y0 = CBu0 x2 = A Bu0 + ABu1 + Bu2 ... 2 y1 = CABu0 + CBu1 y2 = CA Bu0 + CABu1 + CBu2 ... This can be vectorized into a convolution (4) with an explicit formula for the convolution kernel (5). k yk = CA Bu0 + CA y = K ∗ u. k−1 Bu1 + · · · + CABuk−1 + CBuk   i K ∈ RL := KL (A, B, C) := CA B i∈[L] = (CB, CAB, . . . , CA (4) L−1 B). (5) In other words, equation (4) is a single (non-circular) convolution and can be computed very efficiently with FFTs, provided that K is known. However, computing K in (5) is non-trivial and is the focus of our technical contributions in Section 3. We call K the SSM convolution kernel or filter. 4 3 Method: Structured State Spaces (S4) Our technical results focus on developing the S4 parameterization and showing how to efficiently compute all views of the SSM (Section 2): the continuous representation (A, B, C) (1), the recurrent representation (A, B, C) (3), and the convolutional representation K (4). Section 3.1 motivates our approach, which is based on the linear algebraic concepts of conjugation and diagonalization, and discusses why the naive application of this approach does not work. Section 3.2 gives an overview of the key technical components of our approach and formally defines the S4 parameterization. Section 3.3 sketches the main results, showing that S4 is asymptotically efficient (up to log factors) for sequence models. Proofs are in Appendices B and C. 3.1 Motivation: Diagonalization The fundamental bottleneck in computing the discrete-time SSM (3) is that it involves repeated matrix multiplication by A. For example, computing (5) naively as in the LSSL involves L successive multiplications by A, requiring O(N 2 L) operations and O(N L) space. To overcome this bottleneck, we use a structural result that allows us to simplify SSMs. Lemma 3.1. Conjugation is an equivalence relation on SSMs (A, B, C) ∼ (V −1 AV , V −1 B, CV ). Proof. Write out the two SSMs with state denoted by x and x̃ respectively: x′ = Ax + Bu x̃′ = V −1 AV x̃ + V −1 Bu y = Cx y = CV x̃ After multiplying the right side SSM by V , the two SSMs become identical with x = V x̃. Therefore these compute the exact same operator u 7→ y, but with a change of basis by V in the state x. Lemma 3.1 motivates putting A into a canonical form by conjugation2 , which is ideally more structured and allows faster computation. For example, if A were diagonal, the resulting computations become much more tractable. In particular, the desired K (equation (4)) would be a Vandermonde product which theoretically only needs O((N + L) log2 (N + L)) arithmetic operations [26]. Unfortunately, the naive application of diagonalization does not work due to numerical issues. First, Vandermonde multiplication is itself a famously ill-conditioned problem [29]. Furthermore, we derive the explicit diagonalization for the HiPPO matrix (2) and show it has entries exponentially large in the state size N , rendering the diagonalization numerically infeasible (e.g. CV in Lemma 3.1 would not be computable). We note that Gu et al. [18] proposed a different (unimplemented) algorithm to compute K faster than the naive algorithm. In Appendix B, we prove that it is also numerically unstable for related reasons.  i+j Lemma 3.2. The HiPPO matrix A in equation (2) is diagonalized by the matrix Vij = i−j . In particular,  4i 4i 4N/3 . V3i,i = 2i ≈ 2 . Therefore V has entries of magnitude up to 2 3.2 The S4 Parameterization: Normal Plus Low-Rank The previous discussion implies that we should only conjugate by well-conditioned matrices V . The ideal scenario is when the matrix A is diagonalizable by a perfectly conditioned (i.e., unitary) matrix. By the Spectral Theorem of linear algebra, this is exactly the class of normal matrices. However, this class of matrices is restrictive; in particular, it does not contain the HiPPO matrix (2). We make the observation that although the HiPPO matrix is not normal, it can be decomposed as the sum of a normal and low-rank matrix. However, this is still not useful by itself: unlike a diagonal matrix, powering 2 Note that although we ultimately require A, conjugation commutes with discretization so we refer to A. 5 Algorithm 1 S4 Convolution Kernel (Sketch) ❈ Input: S4 parameters Λ, P , Q, B, C ∈ N and step size ∆ Output: SSM convolution kernel K = KL (A, B, C) for A = Λ − P Q∗ (equation (5))  L ∗ e ← I −A C ⊲ Truncate SSM generating function (SSMGF) to length L 1: C   −1 i∗  h k00 (ω) k01 (ω) 2 1−ω eQ [B P ] ⊲ Black-box Cauchy kernel ← C 2: ∆ 1+ω − Λ k10 (ω) k11 (ω)   2 3: K̂(ω) ← 1+ω k00 (ω) − k01 (ω)(1 + k11 (ω))−1 k10 (ω) ⊲ Woodbury Identity k ⊲ Evaluate SSMGF at all roots of unity ω ∈ ΩL 4: K̂ = {K̂(ω) : ω = exp(2πi L )} ⊲ Inverse Fourier Transform 5: K ← iFFT(K̂) up this sum (in (5)) is still slow and not easily optimized. We overcome this bottleneck by simultaneously applying three new techniques. • Instead ofP computing K directly, we compute its spectrum by evaluating its truncated generating L−1 function j=0 K j ζ j at the roots of unity ζ. K can then be found by applying an inverse FFT. • This generating function is closely related to the matrix resolvent, and now involves a matrix inverse instead of power. The low-rank term can now be corrected by applying the Woodbury identity which reduces (A + P Q∗ )−1 in terms of A−1 , truly reducing to the diagonal case. • Finally, we show that the diagonal matrix case is equivalent to the computation of a Cauchy kernel 1 ωj −ζk , a well-studied problem with stable near-linear algorithms [27, 28]. Our techniques apply to any matrix that can be decomposed as Normal Plus Low-Rank (NPLR). Theorem 1. All HiPPO matrices from [16] have a NPLR representation ❈ N ×N A = V ΛV ∗ − P Q⊤ = V (Λ − (V ∗ P ) (V ∗ Q)∗ ) V ∗ ❘ (6) N ×r for unitary V ∈ , diagonal Λ, and low-rank factorization P , Q ∈ . These matrices HiPPO- LegS, LegT, LagT all satisfy r = 1 or r = 2. In particular, equation (2) is NPLR with r = 1. 3.3 S4 Algorithms and Computational Complexity By equation (6), note that NPLR matrices can be conjugated into diagonal plus low-rank (DPLR) form (now over instead of ). Theorems 2 and 3 describe the complexities of SSMs where A is in DPLR form. S4 is optimal or near-optimal for both recurrent and convolutional representations. ❈ ❘ Theorem 2 (S4 Recurrence). Given any step size ∆, computing one step of the recurrence (3) can be done in O(N ) operations where N is the state size. Theorem 2 follows from the fact that the inverse of a DPLR matrix is also DPLR (e.g. also by the Woodbury identity). This implies that the discretized matrix A is the product of two DPLR matrices and thus has O(N ) matrix-vector multiplication. Appendix C.2 computes A in closed DPLR form. Theorem 3 (S4 Convolution). Given any step size ∆, computing the SSM convolution filter K can be e reduced to 4 Cauchy multiplies, requiring only O(N + L) operations and O(N + L) space. Appendix C, Definition 3 formally defines Cauchy matrices, which are related to rational interpolation problems. Computing with Cauchy matrices is an extremely well-studied problem in numerical analysis, with both fast arithmetic and numerical algorithms based on the famous Fast Multipole Method (FMM) [26, 27, 28]. The computational complexities of these algorithms under various settings are described in Appendix C, Proposition 5. We reiterate that Theorem 3 is our core technical contribution, and its algorithm is the very motivation of the NPLR S4 parameterization. This algorithm is formally sketched in Algorithm 1. 6 Table 1: Complexity of various sequence models in terms of sequence length (L), batch size (B), and hidden dimension (H); tildes denote log factors. Metrics are parameter count, training computation, training space requirement, training parallelizability, and inference computation (for 1 sample and time-step). For simplicity, the state size N of S4 is tied to H. Bold denotes model is theoretically best for that metric. Convolutions are efficient for training while recurrence is efficient for inference, while SSMs combine the strengths of both. Convolution3 Parameters Training Space Parallel Inference 3.4 LH L̃H(B + H) BLH Yes LH 2 Recurrence Attention 2 2 H BLH 2 BLH No H2 H B(L2 H + LH 2 ) B(L2 + HL) Yes L2 H + H 2 L S4 H2 BH(H̃ + L̃) + B L̃H BLH Yes H2 Architecture Details of the Deep S4 Layer Concretely, an S4 layer is parameterized as follows. First initialize a SSM with A set to the HiPPO matrix (2). By Lemma 3.1 and Theorem 1, this SSM is unitarily equivalent to some (Λ − P Q∗ , B, C) for some diagonal Λ and vectors P , Q, B, C ∈ N ×1 . These comprise S4’s 5N trainable parameters. ❈ The overall deep neural network (DNN) architecture of S4 is similar to prior work. As defined above, S4 defines a map from L → L , i.e. a 1-D sequence map. Typically, DNNs operate on feature maps of size H instead of 1. S4 handles multiple features by simply defining H independent copies of itself, and then mixing the H features with a position-wise linear layer for a total of O(H 2 ) + O(HN ) parameters per layer. Nonlinear activation functions are also inserted between these layers. Overall, S4 defines a sequence-to-sequence map of shape (batch size, sequence length, hidden dimension), exactly the same as related sequence models such as Transformers, RNNs, and CNNs. ❘ ❘ Note that the core S4 module is a linear transformation, but the addition of non-linear transformations through the depth of the network makes the overall deep SSM non-linear. This is analogous to a vanilla CNN, since convolutional layers are also linear. The broadcasting across H hidden features described in this section is also analogous to depthwise-separable convolutions. Thus, the overall deep S4 model is closely related to a depthwise-separable CNN but with global convolution kernels. Finally, we note that follow-up work found that this version of S4 can sometimes suffer from numerical instabilities when the A matrix has eigenvalues on the right half-plane [14]. It introduced a slight change to the NPLR parameterization for S4 from Λ − P Q∗ to Λ − P P ∗ that corrects this potential problem. Table 1 compares the complexities of the most common deep sequence modeling mechanisms. 4 Experiments Section 4.1 benchmarks S4 against the LSSL and efficient Transformer models. Section 4.2 validates S4 on LRDs: the LRA benchmark and raw speech classification. Section 4.3 investigates whether S4 can be used as a general sequence model to perform effectively and efficiently in a wide variety of settings including image classification, image and text generation, and time series forecasting. 4.1 S4 Efficiency Benchmarks We benchmark that S4 can be trained quickly and efficiently, both compared to the LSSL, as well as efficient Transformer variants designed for long-range sequence modeling. As outlined in Section 3, S4 is theoretically much more efficient than the LSSL, and Table 2 confirms that the S4 is orders of magnitude more speed- and memory-efficient for practical layer sizes. In fact, S4’s speed and memory use is competitive with the most 3 Refers to global (in the sequence length) and depthwise-separable convolutions, similar to the convolution version of S4. 7 Table 2: Deep SSMs: The S4 parameterization with Algorithm 1 is asymptotically more efficient than the LSSL. Table 3: Benchmarks vs. efficient Transformers Training Step (ms) Memory Alloc. (MB) Dim. 128 256 512 128 256 512 Transformer LSSL S4 9.32 4.77 20.6 3.07 140.7 4.75 222.1 5.3 1685 12.6 13140 33.5 Performer Linear Trans. Ratio 1.9× 6.7× 29.6× 42.0× 133× 392× S4 Length 1024 Length 4096 Speed Mem. Speed Mem. 1× 1× 1× 1× 1.23× 1.58× 0.43× 0.37× 3.79× 5.35× 0.086× 0.067× 1.58× 0.43× 5.19× 0.091× ❘ Figure 2: Visualizations of a trained S4 model on LRA Path-X. SSM convolution kernels K ∈ 16384 are reshaped into a 128 × 128 image. (Left) Example from the Path-X task, which involves deducing if the markers are connected by a path (Top) Filters from the first layer (Bottom) Filters from the last layer. Table 4: (Long Range Arena) Accuracy on full suite of LRA tasks. (Top) Original Transformer variants in LRA. Full results in Appendix D.2. (Bottom) Other models reported in the literature. Model ListOps Text Retrieval Image Pathfinder Path-X Avg Transformer Reformer BigBird Linear Trans. Performer 36.37 37.27 36.05 16.13 18.01 64.27 56.10 64.02 65.90 65.40 57.46 53.40 59.29 53.09 53.82 42.44 38.07 40.83 42.34 42.77 71.40 68.50 74.87 75.30 77.05 ✗ ✗ ✗ ✗ ✗ 53.66 50.56 54.17 50.46 51.18 FNet Nyströmformer Luna-256 S4 35.33 37.15 37.25 58.35 65.11 65.52 64.57 76.02 59.61 79.56 79.29 87.09 38.67 41.58 47.38 87.26 77.80 70.94 77.72 86.05 ✗ ✗ ✗ 88.10 54.42 57.46 59.37 80.48 efficient Transformer variants benchmarked by Tay et al. [37]—Linear Transformer [20] and Performer [8]—in a parameter-matched setting (Table 3, following the protocol of Tay et al. [37]). 4.2 Learning Long Range Dependencies As described in Sections 2.2 and 3.1, S4 uses a principled approach to address LRDs based on the HiPPO theory of continuous-time memorization. Our goal in this section is to validate that S4 achieves high performance on difficult tasks that require long-range reasoning. We focus here on two problems: (i) the Long-Range Arena, a well-known benchmark designed to test efficient sequence models on LRDs, and (ii) a speech classification problem as a real-world test of LRDs. Long Range Arena (LRA). LRA [37] contains 6 tasks with lengths 1K-16K steps, encompassing modalities 8 and objectives that require similarity, structural, and visuospatial reasoning. Table 4 compares S4 against the 11 Transformer variants from Tay et al. [37] as well as follow-up work. S4 substantially advances the SoTA, outperforming all baselines on all tasks and averaging 80.48% compared to less than 60% for every baseline. Notably, S4 solves the Path-X task, an extremely challenging task that involves reasoning about LRDs over sequences of length 128 × 128 = 16384. All previous models have failed (i.e. random guessing) due to memory or computation bottlenecks, or simply being unable to learn such long dependencies. We analyze S4’s performance on Path-X by visualizing its learned representations, in particular 1-D convolution kernels K which are the focus of our technical results in Section 3. Fig. 2 shows that S4 learns a variety of filters that display spatially consistent structure and demonstrate awareness of the 2-D nature of the data. In particular, the lower layers learn simple kernels that extract features from just a few rows of local context while ignoring the rest of the image. On the other hand, higher layers aggregate information globally across full columns of the image at varying spatial frequencies. Filters in these higher layers span the entire context (16384 pixels), confirming S4’s ability to learn LRDs. Raw Speech Classification. Speech is a typical real-world time series domain, involving signals sampled from an underlying physical process at high frequency. We perform speech classification using the Speech Commands dataset [44]. While most sequence models for speech rely on extensive preprocessing (e.g. to MFCC features), we classify raw speech (length-16000) following Romero et al. [33]. S4 achieves 98.3% accuracy, higher than all baselines that use the 100× shorter MFCC features, and validates that a powerful LRD model is able to extract more information from the raw data and outperform hand-crafted pre-processing. Additionally, we include a baseline CNN specifically designed for raw speech, the discriminator from the WaveGAN model [11], which performs worse than S4 while having 90× more parameters and incorporating many more architectural heuristics (Appendix D.2). 4.3 S4 as a General Sequence Model A key goal of sequence modeling research is to develop a single model that can be applied in many domains (e.g. images, audio, text, time-series) with a broad range of capabilities (e.g. efficient training, fast generation, handling irregularly sampled data). As a fundamental scientific model, SSMs are a promising candidate that come with a range of capabilities, and S4’s strong results on LRD benchmarks spanning images, text, and speech are evidence of S4’s potential as a general sequence model. In this section, we focus on understanding this question in more depth by highlighting key strengths of S4 in settings that usually require specialized models. The tasks we focus on (generative modeling, image classification, time-series forecasting) are Table 5: (Speech classification) Transformer, CTM, RNN, CNN, and SSM models. (MFCC ) Standard preprocessed MFCC features (length-161). (Raw ) Unprocessed signals (length-16000). (0 .5 ×) Frequency change at test time. ✗ denotes not applicable or computationally infeasible on single GPU. Table 6: (Pixel-level 1-D image classification) Transformer, RNN, CNN, and SSM models. Extended results + citations in Appendix D. sMNIST pMNIST sCIFAR Transformer 98.9 97.9 62.2 LSTM r-LSTM UR-LSTM UR-GRU HiPPO-RNN LMU-FFT LipschitzRNN 98.9 98.4 99.28 99.27 98.9 99.4 95.11 95.2 96.96 96.51 98.3 98.49 96.3 63.01 72.2 71.00 74.4 61.1 64.2 MFCC Raw 0.5× Transformer Performer 90.75 80.85 ✗ 30.77 ✗ 30.68 ODE-RNN NRDE 65.9 89.8 ✗ 16.49 ✗ 15.12 ExpRNN LipschitzRNN 82.13 88.38 11.6 ✗ 10.8 ✗ CKConv WaveGAN-D 95.3 ✗ 71.66 96.25 65.96 ✗ TCN TrellisNet CKConv 99.0 99.20 99.32 97.2 98.13 98.54 73.42 63.74 LSSL S4 93.58 93.96 ✗ 98.32 ✗ 96.30 LSSL S4 99.53 99.63 98.76 98.70 84.65 91.13 9 Table 7: (CIFAR-10 density estimation) As a generic Table 8: (WikiText-103 language modeling) S4 apsequence model, S4 is competitive with previous autoregressive proaches the performance of Transformers with much models (in bits per dim.) while incorporating no 2D inductive faster generation. (Top) Transformer baseline which our bias, and has fast generation through its recurrence mode. implementation is based on, with attention replaced by S4. (Bottom) Attention-free models (RNNs and CNNs). Model bpd 2D bias Images / sec Transformer Linear Transf. PixelCNN Row PixelRNN PixelCNN++ Image Transf. PixelSNAIL Sparse Transf. 3.47 3.40 3.14 3.00 2.92 2.90 2.85 2.80 None None 2D conv. 2D BiLSTM 2D conv. 2D local attn. 2D conv. + attn. 2D sparse attn. 0.32 (1×) 17.85 (56×) 19.19 (59.97×) 0.54 (1.7×) 0.13 (0.4×) - S4 (base) S4 (large) 2.92 2.85 None None 20.84 (65.1×) 3.36 (10.5×) Model Params Test ppl. Tokens / sec Transformer 247M 20.51 0.8K (1×) GLU CNN AWD-QRNN LSTM + Hebb. TrellisNet Dynamic Conv. TaLK Conv. S4 229M 151M 180M 255M 240M 249M 37.2 33.0 29.2 29.19 25.0 23.3 21.28 48K (60×) considered as LRD tasks in the literature, and serve as additional validation that S4 handles LRDs efficiently. Large-scale generative modeling. We investigate two well-studied image and text benchmarks to validate the scalability, flexibility, and efficiency of S4. These tasks require much larger models than our previous tasks – up to 250M parameters. First, CIFAR density estimation is a popular benchmark for autoregressive models, where images are flattened into a sequence of 3072 RGB subpixels that are predicted one by one. Table 7 shows that with no 2D inductive bias, S4 is competitive with the best models designed for this task. Second, WikiText-103 is an established benchmark for language modeling, an important task for large-scale sequence models where tokens are predicted sequentially based on past context. Although RNNs were the model of choice for many years, Transformers are now the dominant model in such applications that contain data that is inherently discrete. We show that alternative models to Transformers can still be competitive in these settings. By simply taking a strong Transformer baseline [2] and replacing the self-attention layers, S4 substantially closes the gap to Transformers (within 0.8 ppl), setting SoTA for attention-free models by over 2 ppl. Fast autoregressive inference. A prominent limitation of autoregressive models is inference speed (e.g. generation), since they require a pass over the full context for every new sample. Several methods have been specifically crafted to overcome this limitation such as the Linear Transformer, a hybrid Transformer/RNN that switches to a stateful, recurrent view at inference time for speed. As a stateful model, SSMs automatically have this ability (Fig. 1). By switching to its recurrent representation (Section 2.3), S4 requires constant memory and computation per time step – in contrast to standard autoregressive models which scale in the context length. On both CIFAR-10 and WikiText-103, we report the throughput of various models at generation time, with S4 around 60× faster than a vanilla Transformer on both tasks (details in Appendix D.3.3). Sampling resolution change. As a continuous-time model, S4 automatically adapts to data sampled at different rates, a challenging setting for time series with a dedicated line of work [10, 33, 34]. Without re-training, S4 achieves 96.3% accuracy at 0.5× the frequency on Speech Commands (Table 5), simply by changing its internal step size ∆ (Section 2.3). Learning with weaker inductive bias. Beyond our results on speech (Section 4.2), we further validate that S4 can be applied with minimal modifications on two domains that typically require specialized domainspecific preprocessing and architectures. First, we compare S4 to the Informer [47], a new Transformer architecture that uses a complex encoder-decoder designed for time-series forecasting problems. A simple application of S4 that treats forecasting as a masked sequence-to-sequence transformation (Fig. 5) outperforms the Informer and other baselines on 40/50 settings across 5 forecasting tasks. Notably, S4 is better on the longest setting in each task, e.g. reducing MSE by 37% when forecasting 30 days of weather data (Table 9). 10 Finally, we evaluate S4 on pixel-level sequential image classification tasks (Table 6), popular benchmarks which were originally LRD tests for RNNs [1]. Beyond LRDs, these benchmarks point to a recent effort of the ML community to solve vision problems with reduced domain knowledge, in the spirit of models such as Vision Transformers [12] and MLP-Mixer [38] which involve patch-based models that without 2-D inductive bias. Sequential CIFAR is a particularly challenging dataset where outside of SSMs, all sequence models have a gap of over 25% to a simple 2-D CNN. By contrast, S4 is competitive with a larger ResNet18 (7.9M vs. 11.0M parameters), both with (93.16% vs. 95.62%) or without (91.12% vs. 89.46%) data augmentation. Moreover, it is much more robust to other architectural choices (e.g. 90.46% vs. 79.52% when swapping BatchNorm for LayerNorm). 4.4 SSM Ablations: the Importance of HiPPO A critical motivation of S4 was the use of the HiPPO matrices to initialize an SSM. We consider several simplifications of S4 to ablate the importance of each of these components, including: (i) how important is the HiPPO initialization? (ii) how important is training the SSM on top of HiPPO? (iii) are the benefits of S4 captured by the NPLR parameterization without HiPPO? As a simple testbed, all experiments in this section were performed on the sequential CIFAR-10 task, whicih we found transferred well to other settings. Models were constrained to at most 100K trainable parameters and trained with a simple plateau learning rate scheduler and no regularization. Unconstrained SSMs. We first investigate generic SSMs with various initializations. We consider a random Gaussian initialization (with variance scaled down until it did not NaN), and the HiPPO initialization. We also consider a random diagonal Gaussian matrix as a potential structured method; parameterizing A as a diagonal matrix would allow substantial speedups without going through the complexity of S4’s NPLR parameterization. We consider both freezing the A matrix and training it. Fig. 3 shows both training and validation curves, from which we can make several observations. First, training the SSM improved all methods, particularly the randomly initialized ones. For all methods, training the SSM led to improvements in both training and validation curves. Second, a large generalization gap exists between the initializations. In particular, note that when A is trained, all initializations are able to reach perfect training accuracy. However, their validation accuracies are separated by over 15%. NPLR SSMs. The previous experiment validates the importance of HiPPO in SSMs. This was the main motivation of the NPLR algorithm in S4, which utilizes structure of the HiPPO matrix (2) to make SSMs computationally feasible. Fig. 4a shows that random NPLR matrices still do not perform well, which validates that S4’s effectiveness primarily comes from the HiPPO initialization, not the NPLR parameterization. Finally, Fig. 4b considers the main ablations considered in this section (with trainable SSMs) and adds minor regularization. With 0.1 Dropout, the same trends still hold, and the HiPPO initialization—in other words, the full S4 method—achieves 84.27% test accuracy with just 100K parameters. Table 9: Univariate long sequence time-series forecasting results. Full results in Appendix D.3.5. ETTh1 ETTh2 ETTm1 Weather ECL S4 Informer LogTrans Reformer LSTMa DeepAR ARIMA Prophet MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE 0.116 0.271 0.187 0.358 0.292 0.466 0.245 0.375 0.432 0.497 0.269 0.435 0.277 0.431 0.512 0.644 0.359 0.466 0.582 0.608 0.273 0.463 0.303 0.493 0.598 0.702 0.388 0.499 0.624 0.645 2.112 1.436 2.030 1.721 1.793 1.528 2.087 1.534 7.019 5.105 0.683 0.768 0.640 0.681 1.064 0.873 0.866 0.809 1.545 1.006 0.658 0.707 0.429 0.580 2.437 1.352 0.499 0.596 0.657 0.683 0.659 0.766 2.878 1.044 0.639 0.697 1.062 0.943 1.370 0.982 2.735 3.253 3.355 4.664 2.747 1.174 3.859 1.144 6.901 4.264 11 Figure 3: CIFAR-10 classification with unconstrained, real-valued SSMs with various initializations. (Left) Train accuracy. (Right) Validation accuracy. (a) (b) Figure 4: CIFAR-10 validation accuracy of SSMs with different initializations and parameterizations. (Left) NPLR parameterization with random versus HiPPO initialization. (Right) All methods considered in this section, including minor Dropout regularization. S4 achieves SotA accuracy with just 100K parameters. 5 Conclusion We introduce S4, a sequence model that uses a new parameterization for the state space model’s continuoustime, recurrent, and convolutional views to efficiently model LRDs in a principled manner. Results across established benchmarks evaluating a diverse range of data modalities and model capabilities suggest that S4 has the potential to be an effective general sequence modeling solution. Acknowledgments We thank Aditya Grover and Chris Cundy for helpful discussions about earlier versions of the method. We thank Simran Arora, Sabri Eyuboglu, Bibek Paudel, and Nimit Sohoni for valuable feedback on earlier drafts of this work. This work was done with the support of Google Cloud credits under HAI proposals 540994170283 and 578192719349. We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEPTUNE); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Salesforce, Total, the HAI-AWS Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Facebook, Google, and VMWare. The Mobilize Center is a Biomedical Technology Resource Center, funded by the NIH National Institute of Biomedical Imaging and Bioengineering through Grant P41EB027060. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of 12 the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government. References [1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In The International Conference on Machine Learning (ICML), pages 1120–1128, 2016. [2] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018. [3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018. [4] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. In The International Conference on Learning Representations (ICLR), 2019. [5] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. [6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. [7] Narsimha Chilkuri and Chris Eliasmith. Parallelizing legendre memory unit training. The International Conference on Machine Learning (ICML), 2021. [8] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In The International Conference on Learning Representations (ICLR), 2020. [9] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017. [10] Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. Gru-ode-bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems (NeurIPS), 2019. [11] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In ICLR, 2019. [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [13] N Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W Mahoney. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021. [14] Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. arXiv preprint arXiv:2202.09729, 2022. [15] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2013. [16] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 102f0bb6efb3a6128a3c750dd16729be-Abstract.html. 13 [17] Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, and Razvan Pascanu. Improving the gating mechanism of recurrent neural networks. In The International Conference on Machine Learning (ICML), 2020. [18] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with the structured learnable linear state space layer. In Advances in Neural Information Processing Systems (NeurIPS), 2021. [19] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. [21] Mario Lezcano-Casado and David Martı́nez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In The International Conference on Machine Learning (ICML), 2019. [22] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5457–5466, 2018. [23] Vasileios Lioutas and Yuhong Guo. Time-aware large kernel convolutions. In International Conference on Machine Learning, pages 6172–6183. PMLR, 2020. [24] Stephen Merity, Nitish Shirish Keskar, James Bradbury, and Richard Socher. Scalable language modeling: Wikitext-103 on a single gpu in 12 hours. SysML, 2018. [25] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [26] Victor Pan. Structured matrices and polynomials: unified superfast algorithms. Springer Science & Business Media, 2001. [27] Victor Pan. Fast approximate computations with cauchy matrices and polynomials. Mathematics of Computation, 86(308):2799–2826, 2017. [28] Victor Y Pan. Transformations of matrix structures work again. Linear Algebra and Its Applications, 465:107–138, 2015. [29] Victor Y Pan. How bad are vandermonde matrices? SIAM Journal on Matrix Analysis and Applications, 37(2):676–694, 2016. [30] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013. [31] Jack Rae, Chris Dyer, Peter Dayan, and Timothy Lillicrap. Fast parametric learning with activation memorization. The International Conference on Machine Learning (ICML), 2018. [32] Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001, 2017. [33] David W Romero, Anna Kuzina, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611, 2021. [34] Yulia Rubanova, Tian Qi Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In Advances in Neural Information Processing Systems, pages 5321–5331, 2019. 14 [35] T Konstantin Rusch and Siddhartha Mishra. Unicornn: A recurrent model for learning very long time dependencies. The International Conference on Machine Learning (ICML), 2021. [36] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. [37] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=qVyeW-grC2k. [38] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, et al. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601, 2021. [39] Trieu H Trinh, Andrew M Dai, Minh-Thang Luong, and Quoc V Le. Learning longer-term dependencies in RNNs with auxiliary losses. In The International Conference on Machine Learning (ICML), 2018. [40] Arnold Tustin. A method of analysing the behaviour of linear systems in terms of time series. Journal of the Institution of Electrical Engineers-Part IIA: Automatic Regulators and Servo Mechanisms, 94(1): 130–142, 1947. [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. [42] Aaron Voelker, Ivana Kajić, and Chris Eliasmith. Legendre memory units: Continuous-time representation in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 15544–15553, 2019. [43] Aaron Russell Voelker. Dynamical systems in spiking neuromorphic hardware. PhD thesis, University of Waterloo, 2019. [44] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. abs/1804.03209, 2018. ArXiv, [45] Max A Woodbury. Inverting modified matrices. Memorandum report, 42:106, 1950. [46] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In The International Conference on Learning Representations (ICLR), 2019. [47] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pages 11106– 11115. AAAI Press, 2021. 15 A Discussion Related Work. Our work is most closely related to a line of work originally motivated by a particular biologically-inspired SSM, which led to mathematical models for addressing LRDs. Voelker et al. [42], Voelker [43] derived a non-trainable SSM motivated from approximating a neuromorphic spiking model, and Chilkuri and Eliasmith [7] showed that it could be sped up at train time with a convolutional view. Gu et al. [16] extended this special case to a general continuous-time function approximation framework with several more special cases of A matrices designed for long-range dependencies. However, instead of using a true SSM, all of these works fixed a choice of A and built RNNs around it. Most recently, Gu et al. [18] used the full (1) explicitly as a deep SSM model, exploring new conceptual views of SSMs, as well as allowing A to be trained. As mentioned in Section 1, their method used a naive instantiation of SSMs that suffered from an additional factor of N in memory and N 2 in computation. Beyond this work, our technical contributions (Section 3) on the S4 parameterization and algorithms are applicable to a broader family of SSMs including these investigated in prior works, and our techniques for working with these models may be of independent interest. Implementation. The computational core of S4’s training algorithm is the Cauchy kernel discussed in Sections 3.2 and 3.3 and Appendix C.3. As described in Appendix C.3 Proposition 5, there are many algorithms for it with differing computational complexities and sophistication. Our current implementation of S4 actually uses the naive O(N L) algorithm which is easily parallelized on GPUs and has more easily accessible libraries allowing it to be implemented; we leverage the pykeops library for memory-efficient kernel operations. However, this library is a much more general library that may not be optimized for the Cauchy kernels used here, and we believe that a dedicated CUDA implementation can be more efficient. Additionally, as discussed in this work, there are asymptotically faster and numerically stable algorithms for the Cauchy kernel (Proposition 5). However, these algorithms are currently not implemented for GPUs due to a lack of previous applications that require them. We believe that more efficient implementations of these self-contained computational kernels are possible, and that S4 (and SSMs at large) may have significant room for further improvements in efficiency. Limitations and Future Directions. In this work, we show that S4 can address a wide variety of data effectively. However, it may not necessarily be the most suitable model for all types of data. For example, Table 8 still found a gap compared to Transformers for language modeling. An interesting future direction is exploring combinations of S4 with other sequence models to complement their strengths. We are excited about other directions, including continuing to explore the benefits of S4 on audio data (e.g. pre-training or generation settings), and generalizing HiPPO and S4 to higher-dimensional data for image and video applications. B Numerical Instability of LSSL This section proves the claims made in Section 3.1 about prior work. We first derive the explicit diagonalization of the HiPPO matrix, confirming its instability because of exponentially large entries. We then discuss the proposed theoretically fast algorithm from [18] (Theorem 2) and show that it also involves exponentially large terms and thus cannot be implemented. 16 B.1 HiPPO Diagonalization Proof of Lemma 3.2. The HiPPO matrix (2) is equal, up to sign and conjugation by a diagonal matrix, to   1 −1 2      1 −3 3   −1 3 −5 4      1 −3 5 −7 5 A=  −1 3 −5 7 −9  6     1 −3 5 −7 9 −11 7    −1 3 −5 7 −9 11 −13 8   .. .. . .  n−k  (2k + 1) n > k (−1) Ank = k + 1 n=k.   0 n k 1 2 1   A = − 1 1  .. . Adding the matrix of all 12 , which is rank 1, yields  − 21 1  2 − 1 1 2 1 2 1 2 1 2 1 1 1 − 21 − 21 2 1 2 1 2 ... 1 2 .. .     .    − 12 − 12  . −1 2 This matrix is now skew-symmetric. Skew-symmetric matrices are a particular case of normal matrices with pure-imaginary eigenvalues. Gu et al. [16] also consider a case of HiPPO corresponding to the generalized Laguerre polynomials that generalizes the above HiPPO-LagT case. In this case, the matrix A (up to conjugation by a diagonal matrix) ends up being close to the above matrix, but with a different element on the diagonal. After adding the rank-1 correction, it becomes the above skew-symmetric matrix plus a multiple of the identity. Thus after diagonalization by the same matrix as in the LagT case, it is still reduced to diagonal plus low-rank (DPLR) form, where the diagonal is now pure imaginary plus a real constant. 19 HiPPO-LegS. We restate the formula from equation (2) for convenience.  1/2 1/2  if n > k (2n + 1) (2k + 1) Ank = − n + 1 if n = k .   0 if n < k Adding 12 (2n + 1)1/2 (2k + 1)1/2 to the whole matrix gives  1 1/2 1/2   2 (2n + 1) (2k + 1) 1 − 2   1 − 2 (2n + 1)1/2 (2k + 1)1/2 if n > k if n = k if n < k Note that this matrix is not skew-symmetric, but is 12 I + S where S is a skew-symmetric matrix. This is diagonalizable by the same unitary matrix that diagonalizes S. HiPPO-LegT. Up to the diagonal scaling, the LegT matrix is  1 1   A = − 1 1  .. . −1 1 1 1 1 −1 1 1 By adding −1 to this matrix and then the matrix  2   2 the matrix becomes  2   2  −1 . . .  1   −1 .  1  .. .     2 2 −2 −2 −2 2     which is skew-symmetric. In fact, this matrix is the inverse of the Chebyshev Jacobi. An alternative way to see this is as follows. The LegT matrix is the inverse of the matrix   −1 1 0 −1  1    −1 1 −1 −1 This can obviously be converted to a skew-symmetric matrix by adding a rank 2 term. The inverses of these matrices are also rank-2 differences from each other by the Woodbury identity. A final form is   1 0 −1 1 −1 1 −1 −1 1 −1 0 1    −1 −1 −1 1  + 1 0 −1 −1 −1 −1 0 1  1 0 1 0    0 0 1 0 1  1 1 0   = −1 0   0 0 −1 0 1 −1 0 −1 0 1 This has the advantage that the rank-2 correction is symmetric (like the others), but the normal skewsymmetric matrix is now 2-quasiseparable instead of 1-quasiseparable. 20 C.2 Computing the S4 Recurrent View We prove Theorem 2 showing the efficiency of the S4 parameterization for computing one step of the recurrent representation (Section 2.3). Recall that without loss of generality, we can assume that the state matrix A = Λ − P Q∗ is diagonal plus low-rank (DPLR), potentially over . Our goal in this section is to explicitly write out a closed form for the discretized matrix A. ❈ Recall from equation (3) that A = (I − ∆/2 · A)−1 (I + ∆/2 · A) B = (I − ∆/2 · A)−1 ∆B. We first simplify both terms in the definition of A independently. Forward discretization. The first term is essentially the Euler discretization motivated in Section 2.3. I+ ∆ ∆ A = I + (Λ − P Q∗ ) 2 2  ∆ 2 ∗ I + (Λ − P Q ) = 2 ∆ ∆ = A0 2 where A0 is defined as the term in the final brackets. Backward discretization. The second term is known as the Backward Euler’s method. Although this inverse term is normally difficult to deal with, in the DPLR case we can simplify it using Woodbury’s Identity (Proposition 4).  I− ∆ A 2 −1  −1 ∆ (Λ − P Q∗ ) 2 −1  2 2 ∗ = − Λ + PQ ∆ ∆ i 2 h −1 D − DP (I + Q∗ DP ) Q∗ D = ∆ 2 = A1 ∆ = I− −1 2 −Λ where D = ∆ and A1 is defined as the term in the final brackets. Note that (1 + Q∗ DP ) is actually a scalar in the case when the low-rank term has rank 1. S4 Recurrence. Finally, the full bilinear discretization can be rewritten in terms of these matrices as A = A1 A0 2 B = A1 ∆B = 2A1 B. ∆ The discrete-time SSM (3) becomes xk = Axk−1 + Buk = A1 A0 xk−1 + 2A1 Buk yk = Cxk . Note that A0 , A1 are accessed only through matrix-vector multiplications. Since they are both DPLR, they have O(N ) matrix-vector multiplication, showing Theorem 2. 21 C.3 Computing the Convolutional View The most involved part of using SSMs efficiently is computing K. This algorithm was sketched in Section 3.2 and is the main motivation for the S4 parameterization. In this section, we define the necessary intermediate quantities and prove the main technical result. The algorithm for Theorem 3 falls in roughly three stages, leading to Algorithm 1. Assuming A has been conjugated into diagonal plus low-rank form, we successively simplify the problem of computing K by applying the techniques outlined in Section 3.2. Remark C.1. We note that for the remainder of this section, we transpose C to be a column vector of shape N or N ×1 instead of matrix or row vector 1×N as in (1). In other words the SSM is ❈ ❈ ❈ x′ (t) = Ax(t) + Bu(t) y(t) = C ∗ x(t) + Du(t). (8) This convention is made so that C has the same shape as B, P , Q and simplifies the implementation of S4. Reduction 0: Diagonalization By Lemma 3.1, we can switch the representation by conjugating with any unitary matrix. For the remainder of this section, we can assume that A is (complex) diagonal plus low-rank (DPLR). Note that unlike diagonal matrices, a DPLR matrix does not lend itself to efficient computation of K. ∗ i The reason is that K computes terms C A B which involve powers of the matrix A. These are trivially computable when A is diagonal, but is no longer possible for even simple modifications to diagonal matrices such as DPLR. Reduction 1: SSM Generating Function To address the problem of computing powers of A, we introduce another technique. Instead of computing the SSM convolution filter K directly, we introduce a generating function on its coefficients and compute evaluations of it. Definition 2 (SSM Generating Function). We define the following quantities: ∗ ∗ • The SSM convolution function is K(A, B, C) = (C B, C AB, . . . ) and the (truncated) SSM filter of length L ∗ ∗ ∗ L−1 B) ∈ L (9) KL (A, B, C) = (C B, C AB, . . . , C A ❘ • The SSM generating function at node z is K̂(z; A, B, C) ∈ ❈ := ∞ X i=0 ∗ i ∗ C A Bz i = C (I − Az)−1 B (10) and the truncated SSM generating function at node z is K̂L (z; A, B, C)∗ ∈ ❈ := L−1 X i=0 ∗ i ∗ L C A Bz i = C (I − A z L )(I − Az)−1 B (11) ❈ • The truncated SSM generating function at nodes Ω ∈ M is   K̂L (Ω; A, B, C) ∈ M := K̂L (ωk ; A, B, C) ❈ k∈[M ] (12) Intuitively, the generating function essentially converts the SSM convolution filter from the time domain to frequency domain. Importantly, it preserves the same information, and the desired SSM convolution filter can be recovered from evaluations of its generating function. 22 Lemma C.2. The SSM function KL (A, B, C) can be computed from the SSM generating function K̂L (Ω; A, B, C) at the roots of unity Ω = {exp(−2πi Lk : k ∈ [L]} stably in O(L log L) operations. Proof. For convenience define K = KL (A, B, C) K̂ = K̂L (Ω; A, B, C) K̂(z) = K̂L (z; A, B, C). Note that K̂j = L−1 X k=0   jk K k exp −2πi . L Note that this is exactly the same as the Discrete Fourier Transform (DFT): K̂ = FL K. Therefore K can be recovered from K̂ with a single inverse DFT, which requires O(L log L) operations with the Fast Fourier Transform (FFT) algorithm. Reduction 2: Woodbury Correction The primary motivation of Definition 2 is that it turns powers of A into a single inverse of A (equation (10)). While DPLR matrices cannot be powered efficiently due to the low-rank term, they can be inverted efficiently by the well-known Woodbury identity. Proposition 4 (Binomial Inverse Theorem or Woodbury matrix identity [15, 45]). Over a commutative ring R, let A ∈ RN ×N and U , V ∈ RN ×p . Suppose A and A + U V ∗ are invertible. Then Ip + V ∗ A−1 U ∈ Rp×p is invertible and (A + U V ∗ )−1 = A−1 − A−1 U (Ip + V ∗ A−1 U )−1 V ∗ A−1 With this identity, we can convert the SSM generating function on a DPLR matrix A into one on just its diagonal component. Lemma C.3. Let A = Λ − P Q∗ be a diagonal plus low-rank representation. Then for any root of unity z ∈ Ω, the truncated generating function satisfies K̂(z) = i 2 h ∗ −1 C̃ R(z)B − C̃ ∗ R(z)P (1 + Q∗ R(z)P ) Q∗ R(z)B 1+z L C̃ = C(I − A ) −1  2 1−z −Λ . R(z; Λ) = ∆1+z Proof. Directly expanding Definition 2 yields ∗ ∗ ∗ KL (z; A, B, C) = C B + C ABz + · · · + C A   −1 ∗ L =C I −A I − Az B  −1 = C̃ ∗ I − Az B   L where C̃ ∗ = C ∗ I − A . 23 L−1 Bz L−1 We can now explicitly expand the discretized SSM matrices A and B back in terms of the original SSM parameters A and B. Lemma C.4 provides an explicit formula, which allows further simplifying C̃ ∗ I − Az −1 2 B= C̃ ∗ 1+z  2 1−z −A ∆1+z −1 B  −1 2 1−z 2 C̃ ∗ − Λ + P Q∗ B 1+z ∆1+z i 2 h ∗ −1 = C̃ R(z)B − C̃ ∗ R(z)P (1 + Q∗ R(z)P ) Q∗ R(z)B . 1+z −1  2 1−z . − Λ The last line applies the Woodbury Identity (Proposition 4) where R(z) = ∆ 1+z = The previous proof used the following self-contained result to back out the original SSM matrices from the discretization. Lemma C.4. Let A, B be the SSM matrices A, B discretized by the bilinear discretization with step size ∆. Then  −1 −1 2∆ ∗ 1 − z C ∗ I − Az B= C 2 − ∆A B 1+z 1+z Proof. Recall that the bilinear discretization that we use (equation (3)) is A= B=   ∆ I− A 2 I− ∆ A 2 −1  −1 ∆ I+ A 2  ∆B The result is proved algebraic manipulations. C ∗ I − Az −1 " −1    −1   #−1 ∆ ∆ ∆ ∆ B=C B I− A − I− A I+ A z I− A 2 2 2 2     −1   ∆ ∆ ∆ = C∗ I − A − I + A z I− A B 2 2 2  −1 ∆ = C ∗ I(1 − z) − A(1 + z) ∆B 2 #−1 " ∆ ∆A ∗ = B C I − 1−z 1−z 2 1+z −1  2∆ ∗ 1 − z = B C 2 I − ∆A 1+z 1+z ∗   L Note that in the S4 parameterization, instead of constantly computing C̃ = C I − A , we can simply reparameterize our parameters to learn C̃ directly instead of C, saving a minor computation cost and simplifying the algorithm. 24 Reduction 3: Cauchy Kernel We have reduced the original problem of computing K to the problem of computing the SSM generating function K̂L (Ω; A, B, C) in the case that A is a diagonal matrix. We show that this is exactly the same as a Cauchy kernel, which is a well-studied problem with fast and stable numerical algorithms. Definition 3. A Cauchy matrix or kernel on nodes Ω = (ωi ) ∈ M∈ ❈M ×N = M (Ω, Λ) = (Mij )i∈[M ],j∈[N ] ❈M and Λ = (λj ) ∈ ❈N is Mij = 1 . ω i − λj The computation time of a Cauchy matrix-vector product of size M × N is denoted by C(M, N ). Computing with Cauchy matrices is an extremely well-studied problem in numerical analysis, with both fast arithmetic algorithms and fast numerical algorithms based on the famous Fast Multipole Method (FMM) [26, 27, 28]. Proposition 5 (Cauchy). A Cauchy kernel requires O(M + N ) space, and operation count   naively O (M N )  C(M, N ) = O (M + N ) log2 (M + N ) in exact arithmetic    O (M + N ) log(M + N ) log 1ε numerically to precision ε. ❈ Corollary C.5. Evaluating Q∗ R(Ω; Λ)P (defined in Lemma C.3) for any set of nodes Ω ∈ L , diagonal matrix Λ, and vectors P , Q can be computed in C(L, N ) operations and O(L + N ) space, where C(L, N ) = Õ(L + N ) is the cost of a Cauchy matrix-vector multiplication. Proof. For any fixed ω ∈ Ω, we want to compute Cauchy matrix-vector multiplication. P qj∗ pj j ω−λj . Computing this over all ωi is therefore exactly a This completes the proof of Theorem 3. In Algorithm 1, note that the work is dominated by Step 2, which has a constant number of calls to a black-box Cauchy kernel, with complexity given by Proposition 5. D Experiment Details and Full Results This section contains full experimental procedures and extended results and citations for our experimental evaluation in Section 4. Appendix D.1 corresponds to benchmarking results in Section 4.1, Appendix D.2 corresponds to LRD experiments (LRA and Speech Commands) in Section 4.2, and Appendix D.3 corresponds to the general sequence modeling experiments (generation, image classification, forecasting) in Section 4.3. D.1 Benchmarking Benchmarking results from Table 2 and Table 3 were tested on a single A100 GPU. Benchmarks against LSSL For a given dimension H, a single LSSL or S4 layer was constructed with H hidden features. For LSSL, the state size N was set to H as done in [18]. For S4, the state size N was set to parameter-match the LSSL, which was a state size of N4 due to differences in the parameterization. Table 2 benchmarks a single forward+backward pass of a single layer. 25 Table 10: Full results for the Long Range Arena (LRA) benchmark for long-range dependencies in sequence models. (Top): Original Transformer variants in LRA. (Bottom): Other models reported in the literature. Model ListOps Text Retrieval Image Pathfinder Path-X Avg Random 10.00 50.00 50.00 10.00 50.00 50.00 36.67 Transformer Local Attention Sparse Trans. Longformer Linformer Reformer Sinkhorn Trans. Synthesizer BigBird Linear Trans. Performer 36.37 15.82 17.07 35.63 35.70 37.27 33.67 36.99 36.05 16.13 18.01 64.27 52.98 63.58 62.85 53.94 56.10 61.20 61.68 64.02 65.90 65.40 57.46 53.39 59.59 56.89 52.27 53.40 53.83 54.67 59.29 53.09 53.82 42.44 41.46 44.24 42.22 38.56 38.07 41.23 41.61 40.83 42.34 42.77 71.40 66.63 71.71 69.71 76.34 68.50 67.45 69.45 74.87 75.30 77.05 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 53.66 46.71 51.03 52.88 51.14 50.56 51.23 52.40 54.17 50.46 51.18 FNet Nyströmformer Luna-256 S4 35.33 37.15 37.25 58.35 65.11 65.52 64.57 76.02 59.61 79.56 79.29 87.09 38.67 41.58 47.38 87.26 77.80 70.94 77.72 86.05 ✗ ✗ ✗ 88.10 54.42 57.46 59.37 80.48 Benchmarks against Efficient Transformers Following [37], the Transformer models had 4 layers, hidden dimension 256 with 4 heads, query/key/value projection dimension 128, and batch size 32, for a total of roughly 600k parameters. The S4 model was parameter tied while keeping the depth and hidden dimension constant (leading to a state size of N = 256). We note that the relative orderings of these methods can vary depending on the exact hyperparameter settings. D.2 Long-Range Dependencies This section includes information for reproducing our experiments on the Long-Range Arena and Speech Commands long-range dependency tasks. Long Range Arena Table 10 contains extended results table with all 11 methods considered in [37]. For the S4 model, hyperparameters for all datasets are reported in Table 11. For all datasets, we used the AdamW optimizer with a constant learning rate schedule with decay on validation plateau. However, the learning rate on HiPPO parameters (in particular Λ, P , Q, B, C, ∆) were reduced to a maximum starting LR of 0.001, which improves stability since the HiPPO equation is crucial to performance. The S4 state size was always fixed to N = 64. As S4 is a sequence-to-sequence model with output shape (batch, length, dimension) and LRA tasks are classification, mean pooling along the length dimension was applied after the last layer. We note that most of these results were trained for far longer than what was necessary to achieve SotA results (e.g., the Image task reaches SotA in 1 epoch). Results often keep improving with longer training times. Hardware. All models were run on single GPU. Some tasks used an A100 GPU (notably, the Path-X experiments), which has a larger max memory of 40Gb. To reproduce these on smaller GPUs, the batch size can be reduced or gradients can be accumulated for two batches. Path-X. We remark that an earlier version of this paper reported a higher score for Path-X. This earlier version used a different variant of the dataset, due to a misunderstanding of the properties of the dataset. More specifically, we found that S4 scored 93.68% accuracy on a version that involved taking the 256 × 256 26 Table 11: The values of the best hyperparameters found for classification datasets; LRA (Top) and images/speech (Bottom). LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization. Depth Features H Norm Pre-norm Dropout LR Batch Size Epochs WD Patience ListOps Text Retrieval Image Pathfinder Path-X 6 4 6 6 6 6 128 64 256 512 256 256 BN BN BN LN BN BN False True True False True True 0 0 0 0.2 0.1 0.0 0.01 0.001 0.002 0.004 0.004 0.0005 100 50 64 50 100 32 50 20 20 200 200 100 0.01 0 0 0.01 0 0 5 5 20 20 10 20 CIFAR-10 6 1024 LN False 0.25 0.01 50 200 0.01 20 Speech Commands (MFCC) Speech Commands (Raw) 4 6 256 128 LN BN False True 0.2 0.1 0.01 0.01 100 20 50 150 0 0 5 10 resolution version of the Pathfinder dataset and averaging every 2 × 2 square; we erroneously thought that this version of the dataset was equivalent to the original Path-X. After discussions with the LRA authors, we discovered that this is not equivalent to the 128 × 128 resolution Pathfinder dataset (the correct Path-X), which is in fact much harder. In fact, Path-X is so difficult that a 2D CNN without global receptive field (e.g. ResNet-18 or ResNet-34) also cannot achieve above chance. This fact led to the original misunderstanding, as we could not solve this image classification task even with a ResNet and thought the data might have errors. Speech Commands We provide details of sweeps run for baseline methods run by us—numbers for all others method are taken from Gu et al. [18]. The best hyperparameters used for S4 are included in Table 11. Transformer [41] For MFCC, we swept the number of model layers {2, 4}, dropout {0, 0.1} and learning rates {0.001, 0.0005}. We used 8 attention heads, model dimension 128, prenorm, positional encodings, and trained for 150 epochs with a batch size of 100. For Raw, the Transformer model’s memory usage made training impossible. Performer [8] For MFCC, we swept the number of model layers {2, 4}, dropout {0, 0.1} and learning rates {0.001, 0.0005}. We used 8 attention heads, model dimension 128, prenorm, positional encodings, and trained for 150 epochs with a batch size of 100. For Raw, we used a model dimension of 128, 4 attention heads, prenorm, and a batch size of 16. We reduced the number of model layers to 4, so the model would fit on the single GPU. We trained for 100 epochs with a learning rate of 0.001 and no dropout. ExpRNN [21] For MFCC, we swept hidden sizes {256, 512} and learning rates {0.001, 0.002, 0.0005}. Training was run for 200 epochs, with a single layer model using a batch size of 100. For Raw, we swept hidden sizes {32, 64} and learning rates {0.001, 0.0005} (however, ExpRNN failed to learn). Lipschitz RNN [13] For MFCC, we swept hidden sizes {256, 512} and learning rates {0.001, 0.002, 0.0005}. Training was run for 150 epochs, with a single layer model using a batch size of 100. For Raw, we found that LipschitzRNN was too slow to train on a single GPU (requiring a full day for 1 epoch of training alone). WaveGAN Discriminator [11] The WaveGAN-D in Table 5 is actually our improved version of the discriminator network from the recent WaveGAN model for speech [11]. This CNN actually did not work well out-of-the-box, and we added several features to help it perform better. The final model is highly specialized compared to our model, and includes: • Downsampling or pooling between layers, induced by strided convolutions, that decrease the sequence length between layers. • A global fully-connected output layer; thus the model only works for one input sequence length and does not work on MFCC features or the frequency-shift setting in Table 5. • Batch Normalization is essential, whereas S4 works equally well with either Batch Normalization or Layer Normalization. 27 • Almost 90× as many parameters as the S4 model (26.3M vs. 0.3M). D.3 General Sequence Modeling This subsection corresponds to the experiments in Section 4.3. Because of the number of experiments in this section, we use subsubsection dividers for different tasks to make it easier to follow: CIFAR-10 density estimation Appendix D.3.1, WikiText-103 language modeling Appendix D.3.2, autoregressive generation Appendix D.3.3, sequential image classification Appendix D.3.4, and time-series forecasting Appendix D.3.5. D.3.1 CIFAR Density Estimation This task used a different backbone than the rest of our experiments. We used blocks of alternating S4 layers and position-wise feed-forward layers (in the style of Transformer blocks). Each feed-forward intermediate dimension was set to 2× the hidden size of the incoming S4 layer. Similar to Salimans et al. [36], we used a UNet-style backbone consisting of B identical blocks followed by a downsampling layer. The downsampling rates were 3, 4, 4 (the 3 chosen because the sequence consists of RGB pixels). The base model had B = 8 with starting hidden dimension 128, while the large model had B = 16 with starting hidden dimension 192. We experimented with both the mixture of logistics from [36] as well as a simpler 256-way categorical loss. We found they were pretty close and ended up using the simpler softmax loss along with using input embeddings. We used the LAMB optimizer with learning rate 0.005. The base model had no dropout, while the large model had dropout 0.1 before the linear layers inside the S4 and FF blocks. D.3.2 WikiText-103 Language Modeling The RNN baselines included in Table 8 are the AWD-QRNN [24], an efficient linear gated RNN, and the LSTM + Cache + Hebbian + MbPA [31], the best performing pure RNN in the literature. The CNN baselines are the CNN with GLU activations [9], the TrellisNet [4], Dynamic Convolutions [46], and TaLK Convolutions [23]. The Transformer baseline is [2], which uses Adaptive Inputs with a tied Adaptive Softmax. This model is a standard high-performing Transformer baseline on this benchmark, used for example by Lioutas and Guo [23] and many more. Our S4 model uses the same Transformer backbone as in [2]. The model consists of 16 blocks of S4 layers alternated with position-wise feedforward layers, with a feature dimension of 1024. Because our S4 layer has around 1/4 the number of parameters as a self-attention layer with the same dimension, we made two modifications to match the parameter count better: (i) we used a GLU activation after the S4 linear layer (Section 3.4) (ii) we used two S4 layers per block. Blocks use Layer Normalization in the pre-norm position. The embedding and softmax layers were the Adaptive Embedding from [2] with standard cutoffs 20000, 40000, 200000. Evaluation was performed similarly to the basic setting in [2], Table 5, which involves sliding non-overlapping windows of width 1024 tokens. Other settings are reported in [2] that include more context at training and evaluation time and improves the score. Because such evaluation protocols are orthogonal to the basic model, we do not consider them and report the base score from [2] Table 5. Instead of SGD+Momentum with multiple cosine learning rate annealing cycles, our S4 model was trained with the simpler AdamW optimizer with a single cosine learning rate cycle with a maximum of 800000 steps. The initial learning rate was set to 0.0005. We used 8 A100 GPUs with a batch size of 8 per gpu and context size 1024. We used no gradient clipping and a weight decay of 0.1. Unlike [2] which specified different dropout rates for different parameters, we used a constant dropout rate of 0.25 throughout the network, including before every linear layer and on the residual branches. 28 D.3.3 Autoregressive Generation Speed Protocol. To account for different model sizes and memory requirements for each method, we benchmark generation speed by throughput, measured in images per second (Table 7) or tokens per second (Table 8). Each model generates images on a single A100 GPU, maximizing batch size to fit in memory. (For CIFAR-10 generation we limited memory to 16Gb, to be more comparable to the Transformer and Linear Transformer results reported from [20].) Baselines. The Transformer and Linear Transformer baselines reported in Table 7 are the results reported directly from Katharopoulos et al. [20]. Note that the Transformer number is the one in their Appendix, which implements the optimized cached implementation of self-attention. For all other baseline models, we used open source implementations of the models to benchmark generation speed. For the PixelCNN++, we used the fast cached version by Ramachandran et al. [32], which sped up generation by orders of magnitude from the naive implementation. This code was only available in TensorFlow, which may have slight differences compared to the rest of the baselines which were implemented in PyTorch. We were unable to run the Sparse Transformer [6] model due to issues with their custom CUDA implementation of the sparse attention kernel, which we were unable to resolve. The Transformer baseline from Table 8 was run using a modified GPT-2 backbone from the HuggingFace repository, configured to recreate the architecture reported in [2]. These numbers are actually slightly favorable to the baseline, as we did not include the timing of the embedding or softmax layers, whereas the number reported for S4 is the full model. D.3.4 Pixel-Level Sequential Image Classification Our models were trained with the AdamW optimizer for up to 200 epochs. Hyperparameters for the CIFAR-10 model is reported in Table 11. For our comparisons against ResNet-18, the main differences between the base models are that S4 uses LayerNorm by default while ResNet uses BatchNorm. The last ablation in Section 4.3 swaps the normalization type, using BatchNorm for S4 and LayerNorm for ResNet, to ablate this architectural difference. The experiments with augmentation take the base model and train with mild data augmentation: horizontal flips and random crops (with symmetric padding). D.3.5 Time Series Forecasting compared to Informer We include a simple figure (Fig. 5) contrasting the architecture of S4 against that of the Informer [47]. In Fig. 5, the goal is to forecast a contiguous range of future predictions (Green, length F ) given a range of past context (Blue, length C ). We simply concatenate the entire context with a sequence of masks set to the length of the forecast window. This input is a single sequence of length C + F that is run through the same simple deep S4 model used throughout this work, which maps to an output of length C + F . We then use just the last F features as the forecasted predictions. Tables 13 and 14 contain full results on all 50 settings considered by Zhou et al. [47]. S4 sets the best results on 40 out of 50 of these settings. D.4 Visualizations We visualize the convolutional filter K̄ learned by S4 for the Pathfinder and CIFAR-10 tasks in Appendix D.4. 29 Table 12: (Pixel-level image classification.) Citations refer to the original model; additional citation indicates work from which this baseline is reported. Model sMNIST pMNIST sCIFAR Transformer [39, 41] 98.9 97.9 62.2 CKConv [33] TrellisNet [4] TCN [3] 99.32 99.20 99.0 98.54 98.13 97.2 63.74 73.42 - LSTM [17, 19] r-LSTM [39] Dilated GRU [5] Dilated RNN [5] IndRNN [22] expRNN [21] UR-LSTM UR-GRU [17] LMU [42] HiPPO-RNN [16] UNIcoRNN [35] LMUFFT [7] LipschitzRNN [13] 98.9 98.4 99.0 98.0 99.0 98.7 99.28 99.27 98.9 99.4 95.11 95.2 94.6 96.1 96.0 96.6 96.96 96.51 97.15 98.3 98.4 98.49 96.3 63.01 72.2 71.00 74.4 61.1 64.2 S4 99.63 98.70 91.13 Context Forecast Forecast S4 S4 S4 Context Figure 5: Comparison of S4 and specialized time-series models for forecasting tasks. (Top Left) The forecasting task involves predicting future values of a time-series given past context. (Bottom Left) We perform simple forecasting using a sequence model such as S4 as a black box. (Right) Informer uses an encoder-decoder architecture designed specifically for forecasting problems involving a customized attention module (figure taken from Zhou et al. [47]). 30 Methods S4 Informer Informer† LogTrans Reformer LSTMa DeepAR ARIMA Prophet Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTh1 24 48 168 336 720 0.061 0.191 0.079 0.220 0.104 0.258 0.080 0.229 0.116 0.271 0.098 0.247 0.158 0.319 0.183 0.346 0.222 0.387 0.269 0.435 0.092 0.246 0.161 0.322 0.187 0.355 0.215 0.369 0.257 0.421 0.103 0.259 0.167 0.328 0.207 0.375 0.230 0.398 0.273 0.463 0.222 0.389 0.284 0.445 1.522 1.191 1.860 1.124 2.112 1.436 0.114 0.272 0.193 0.358 0.236 0.392 0.590 0.698 0.683 0.768 0.107 0.280 0.162 0.327 0.239 0.422 0.445 0.552 0.658 0.707 0.108 0.284 0.175 0.424 0.396 0.504 0.468 0.593 0.659 0.766 0.115 0.275 0.168 0.330 1.224 0.763 1.549 1.820 2.735 3.253 ETTh2 24 48 168 336 720 0.095 0.234 0.191 0.346 0.167 0.333 0.189 0.361 0.187 0.358 0.093 0.240 0.155 0.314 0.232 0.389 0.263 0.417 0.277 0.431 0.099 0.241 0.159 0.317 0.235 0.390 0.258 0.423 0.285 0.442 0.102 0.255 0.169 0.348 0.246 0.422 0.267 0.437 0.303 0.493 0.263 0.437 0.458 0.545 1.029 0.879 1.668 1.228 2.030 1.721 0.155 0.307 0.190 0.348 0.385 0.514 0.558 0.606 0.640 0.681 0.098 0.263 0.163 0.341 0.255 0.414 0.604 0.607 0.429 0.580 3.554 0.445 3.190 0.474 2.800 0.595 2.753 0.738 2.878 1.044 0.199 0.381 0.304 0.462 2.145 1.068 2.096 2.543 3.355 4.664 ETTm1 24 48 96 288 672 0.024 0.117 0.051 0.174 0.086 0.229 0.160 0.327 0.292 0.466 0.030 0.137 0.069 0.203 0.194 0.372 0.401 0.554 0.512 0.644 0.034 0.160 0.066 0.194 0.187 0.384 0.409 0.548 0.519 0.665 0.065 0.202 0.078 0.220 0.199 0.386 0.411 0.572 0.598 0.702 0.095 0.228 0.249 0.390 0.920 0.767 1.108 1.245 1.793 1.528 0.121 0.233 0.305 0.411 0.287 0.420 0.524 0.584 1.064 0.873 0.091 0.243 0.219 0.362 0.364 0.496 0.948 0.795 2.437 1.352 0.090 0.206 0.179 0.306 0.272 0.399 0.462 0.558 0.639 0.697 0.120 0.290 0.133 0.305 0.194 0.396 0.452 0.574 2.747 1.174 Weather 24 48 168 336 720 0.125 0.254 0.181 0.305 0.198 0.333 0.300 0.417 0.245 0.375 0.117 0.251 0.178 0.318 0.266 0.398 0.297 0.416 0.359 0.466 0.119 0.256 0.185 0.316 0.269 0.404 0.310 0.422 0.361 0.471 0.136 0.279 0.206 0.356 0.309 0.439 0.359 0.484 0.388 0.499 0.231 0.401 0.328 0.423 0.654 0.634 1.792 1.093 2.087 1.534 0.131 0.254 0.190 0.334 0.341 0.448 0.456 0.554 0.866 0.809 0.128 0.274 0.203 0.353 0.293 0.451 0.585 0.644 0.499 0.596 0.219 0.355 0.273 0.409 0.503 0.599 0.728 0.730 1.062 0.943 0.302 0.433 0.445 0.536 2.441 1.142 1.987 2.468 3.859 1.144 ECL 48 168 336 720 960 0.222 0.350 0.331 0.421 0.328 0.422 0.428 0.494 0.432 0.497 0.239 0.359 0.447 0.503 0.489 0.528 0.540 0.571 0.582 0.608 0.238 0.368 0.442 0.514 0.501 0.552 0.543 0.578 0.594 0.638 0.280 0.429 0.454 0.529 0.514 0.563 0.558 0.609 0.624 0.645 0.971 0.884 1.671 1.587 3.528 2.196 4.891 4.047 7.019 5.105 0.493 0.539 0.723 0.655 1.212 0.898 1.511 0.966 1.545 1.006 0.204 0.357 0.315 0.436 0.414 0.519 0.563 0.595 0.657 0.683 0.879 0.764 1.032 0.833 1.136 0.876 1.251 0.933 1.370 0.982 0.524 0.595 2.725 1.273 2.246 3.077 4.243 1.415 6.901 4.264 22 5 0 0 0 0 2 0 0 Count Table 13: Univariate long sequence time-series forecasting results on four datasets (five cases). Informer† LSTMa LSTnet MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTh1 Reformer MSE 24 48 168 336 720 0.525 0.641 0.980 1.407 1.162 0.542 0.615 0.779 0.910 0.842 0.577 0.685 0.931 1.128 1.215 0.549 0.625 0.752 0.873 0.896 0.620 0.692 0.947 1.094 1.241 0.577 0.671 0.797 0.813 0.917 0.686 0.766 1.002 1.362 1.397 0.604 0.757 0.846 0.952 1.291 0.991 1.313 1.824 2.117 2.415 0.754 0.906 1.138 1.280 1.520 0.650 0.702 1.212 1.424 1.960 0.624 0.675 0.867 0.994 1.322 1.293 1.456 1.997 2.655 2.143 0.901 0.960 1.214 1.369 1.380 ETTh2 LogTrans MAE 24 48 168 336 720 0.871 1.240 2.580 1.980 2.650 0.736 0.867 1.255 1.128 1.340 0.720 1.457 3.489 2.723 3.467 0.665 1.001 1.515 1.340 1.473 0.753 1.461 3.485 2.626 3.548 0.727 1.077 1.612 1.285 1.495 0.828 1.806 4.070 3.875 3.913 0.750 1.034 1.681 1.763 1.552 1.531 1.871 4.660 4.028 5.381 1.613 1.735 1.846 1.688 2.015 1.143 1.671 4.117 3.434 3.963 0.813 1.221 1.674 1.549 1.788 2.742 3.567 3.242 2.544 4.625 1.457 1.687 2.513 2.591 3.709 ETTm1 Informer MSE 24 48 96 288 672 0.426 0.580 0.699 0.824 0.846 0.487 0.565 0.649 0.674 0.709 0.323 0.494 0.678 1.056 1.192 0.369 0.503 0.614 0.786 0.926 0.306 0.465 0.681 1.162 1.231 0.371 0.470 0.612 0.879 1.103 0.419 0.507 0.768 1.462 1.669 0.412 0.583 0.792 1.320 1.461 0.724 1.098 1.433 1.820 2.187 0.607 0.777 0.945 1.094 1.232 0.621 1.392 1.339 1.740 2.736 0.629 0.939 0.913 1.124 1.555 1.968 1.999 2.762 1.257 1.917 1.170 1.215 1.542 2.076 2.941 Weather S4 Metric 24 48 168 336 720 0.334 0.406 0.525 0.531 0.578 0.385 0.444 0.527 0.539 0.578 0.335 0.395 0.608 0.702 0.831 0.381 0.459 0.567 0.620 0.731 0.349 0.386 0.613 0.707 0.834 0.397 0.433 0.582 0.634 0.741 0.435 0.426 0.727 0.754 0.885 0.477 0.495 0.671 0.670 0.773 0.655 0.729 1.318 1.930 2.726 0.583 0.666 0.855 1.167 1.575 0.546 0.829 1.038 1.657 1.536 0.570 0.677 0.835 1.059 1.109 0.615 0.660 0.748 0.782 0.851 0.545 0.589 0.647 0.683 0.757 ECL Methods 48 168 336 720 960 0.255 0.283 0.292 0.289 0.299 0.352 0.373 0.382 0.377 0.387 0.344 0.368 0.381 0.406 0.460 0.393 0.424 0.431 0.443 0.548 0.334 0.353 0.381 0.391 0.492 0.399 0.420 0.439 0.438 0.550 0.355 0.368 0.373 0.409 0.477 0.418 0.432 0.439 0.454 0.589 1.404 1.515 1.601 2.009 2.141 0.999 1.069 1.104 1.170 1.387 0.486 0.574 0.886 1.676 1.591 0.572 0.602 0.795 1.095 1.128 0.369 0.394 0.419 0.556 0.605 0.445 0.476 0.477 0.565 0.599 Count 18 5 6 0 0 0 Table 14: Multivariate long sequence time-series forecasting results on four datasets (five cases). 31 0 Figure 6: (Convolutional filters on Pathfinder) A random selection of filters learned by S4 in the first layer (top 2 rows) and last layer (bottom 2 rows) of the best model. 32