RWKV: Reinventing RNNs for the Transformer Era
Bo Peng1∗ Eric Alcaide2,3,4∗ Quentin Anthony2,5∗
Alon Albalak
Samuel Arcadinho2,7 Huanqi Cao8 Xin Cheng9 Michael Chung10
Matteo Grella11 Kranthi Kiran GV12 Xuzheng He2 Haowen Hou13 Przemysław Kazienko14
Jan Kocoń14 Jiaming Kong15 Bartłomiej Koptyra14 Hayden Lau2 Krishna Sri Ipsit Mantri16
Ferdinand Mom17,18 Atsushi Saito2,19 Xiangru Tang20 Bolun Wang27 Johan S. Wind21 Stanisław Woźniak14
Ruichong Zhang8 Zhenyuan Zhang2 Qihang Zhao22,23 Peng Zhou27 Jian Zhu24 Rui-Jie Zhu25,26
2,6

arXiv:2305.13048v1 [cs.CL] 22 May 2023

1

RWKV Foundation 2 EleutherAI 3 University of Barcelona 4 Charm Therapeutics 5 Ohio State University
6
University of California, Santa Barbara 7 Zendesk 8 Tsinghua University 9 Peking University
10
Storyteller.io 11 Crisis24 12 New York University 13 National University of Singapore
14
Wroclaw University of Science and Technology 15 Databaker Technology Co. Ltd 16 Purdue University
17
Criteo AI Lab 18 Epita 19 Nextremer Co. Ltd. 20 Yale University 21 University of Oslo
22
University of Science and Technology of China 23 Kuaishou Technology Co. Ltd
24
University of British Columbia 25 University of California, Santa Cruz
26
University of Electronic Science and Technology of China 27 RuoxinTech
Abstract
Transformers have revolutionized almost all
natural language processing (NLP) tasks but
suffer from memory and computational complexity that scales quadratically with sequence
length. In contrast, recurrent neural networks
(RNNs) exhibit linear scaling in memory and
computational requirements but struggle to
match the same performance as Transformers due to limitations in parallelization and
scalability. We propose a novel model architecture, Receptance Weighted Key Value
(RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows
us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant
computational and memory complexity during
inference, leading to the first non-transformer
architecture to be scaled to tens of billions
of parameters. Our experiments reveal that
RWKV performs on par with similarly sized
Transformers, suggesting that future work can
leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.1

1

Introduction

Deep learning techniques have made significant
strides in artificial intelligence, playing a pivotal
∗
1

Equal first authorship. Others listed alphabetically.
Code at: https://github.com/BlinkDL/RWKV-LM

role in various scientific and industrial applications. These applications often involve complex
sequential data processing tasks that include natural language understanding, conversational AI,
time-series analysis, and even indirect modalities
that can be reframed as sequences, such as images and graphs (Brown et al., 2020; Ismail Fawaz
et al., 2019; Wu et al., 2020; Albalak et al., 2022).
Predominant among these techniques are RNNs,
convolutional neural networks (CNNs), and the
Transformer models (Vaswani et al., 2017).
Each of these has distinct drawbacks that restrict
their efficiency in certain scenarios. RNNs suffer from the vanishing gradient problem, making
them difficult to train for long sequences. Additionally, they cannot be parallelized in the time dimension during training, which restricts their scalability
(Hochreiter, 1998; Le and Zuidema, 2016). CNNs,
on the other hand, are only adept at capturing local
patterns, which limits their capacity to deal with
long-range dependencies, crucial to many sequence
processing tasks (Bai et al., 2018).
Transformer models emerged as a powerful alternative due to their ability to handle both local and
long-range dependencies and their capability for
parallelized training (Tay et al., 2022). Recent models such as GPT-3 (Brown et al., 2020), ChatGPT
(OpenAI, 2022; Kocoń et al., 2023), GPT-4 (OpenAI, 2023), LLaMA (Touvron et al., 2023), and
Chinchilla (Hoffmann et al., 2022) exemplify the
capability of this architecture, pushing the frontiers
of what’s possible in NLP. Despite these significant advancements, the self-attention mechanism
inherent to Transformers poses unique challenges,

Model

Time

Space

Transformer
O(T d)
O(T + T d)
Reformer
O(T log T d)
O(T log T + T d)
Linear Transformers
O(T d2 )
O(T d + d2 )
2
Performer
O(T d log d) O(T d log d + d2 log d)
AFT-full
O(T 2 d)
O(T d)
MEGA
O(cT d)
O(cT d)
2

RWKV (ours)

O(Td)

2

O(d)

Table 1: Complexity comparison with different Transformers: Reformer (Kitaev et al., 2020), Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), AFT (Zhai et al., 2021), MEGA
(Ma et al., 2023). Here T denotes the sequence length,
d the feature dimension, and c is MEGA’s chunk size
of quadratic attention.

primarily due to its quadratic complexity. This complexity renders the architecture computationally expensive and memory-intensive for tasks involving
long input sequences or in resource-constrained situations. These limitations have spurred a wealth of
research aiming to improve the scaling properties
of Transformers, often at the expense of some of
the properties that make it so effective (Wang et al.,
2020; Zaheer et al., 2020; Dao et al., 2022a).
To tackle these challenges, we introduce the Receptance Weighted Key Value (RWKV) model, a
novel architecture that effectively combines the
strengths of RNNs and Transformers while circumventing key drawbacks. RWKV is carefully
designed to alleviate the memory bottleneck and
quadratic scaling associated with Transformers
(Katharopoulos et al., 2020) with a more efficient
linear scaling, while still preserving the rich, expressive properties that make the Transformer a
dominant architecture in the field.
One of the defining characteristics of RWKV
is its ability to offer parallelized training and robust scalability, similar to Transformers. Moreover, we have reformulated the attention mechanism in RWKV to introduce a variant of linear
attention, eschewing the traditional dot-product token interaction in favor of more effective channeldirected attention. This approach contrasts significantly with the traditional Transformer architecture,
where specific token interactions predominantly
drive attention. The implementation of linear attention in RWKV is carried out without approximation, which offers a considerable improvement in
efficiency and enhances the scalability, see Table 1.
The overarching motivation behind developing
RWKV is to bridge the gap between computational
efficiency and expressive capacity in neural net-

work architectures. It offers a promising and viable
solution for handling tasks involving large-scale
models with billions of parameters, exhibiting competitive performance at a fraction of the computational cost. Our experimental results suggest that
RWKV could be a valuable tool for addressing the
ongoing challenges in scaling and deploying AI
models across various domains, particularly those
involving sequential data processing. Thus, RWKV
paves the way for the next generation of more sustainable and computationally efficient AI models
for sequence processing tasks.
Our contributions in this paper are as follows:
• We introduce the RWKV network architecture, which combines the advantages of
RNNs and Transformers while mitigating
their known limitations.
• We propose a new attention mechanism reformulation that results in linear attention, eschewing the quadratic complexity associated
with standard Transformer models.
• We conduct a comprehensive series of experiments on benchmark datasets to showcase the
performance, efficiency and scaling of RWKV
in managing tasks involving large-scale models and long-range dependencies.
• We release pretrained model ranging in size
from 169 million to 14 billion parameters
trained on the Pile (Gao et al., 2020).2

2

Related Work

Recently, a number of techniques have been proposed to address the limitations of transformers.
Optimizing Attention Mechanism Many transformer variants (“x-formers”) have been introduced
to reduce the complexity of transformers (Tay et al.,
2022), including sparse attention (Beltagy et al.,
2020; Kitaev et al., 2020; Guo et al., 2022), approximating the full attention matrix (Wang et al.,
2020; Ma et al., 2021; Choromanski et al., 2020),
combining chunked attention with gating (Ma et al.,
2023) and other efficient methods (Katharopoulos
et al., 2020; Jaegle et al., 2021).
Some recent works like FlashAttention (Dao
et al., 2022a) and others (Rabe and Staats, 2022;
Jang et al., 2019) share similarities with RWKV’s
chunked computation scheme. Despite being
memory-efficient, their time complexity remains
quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and
2

https://huggingface.co/RWKV

time complexity during inference by formulating a
linear attention as an RNN.
Attention Free Models Another line of research
replaces the attention mechanism with other modules to scale to long sequences. MLP-Mixer and
others (Tolstikhin et al., 2021; Liu et al., 2021)
proposed the replacement of attention by MultiLayer Perceptrons (MLPs) in computer vision tasks.
The Attention Free Transformer (AFT) (Zhai et al.,
2021) replaces dot-product self-attention with a
computationally efficient alternative which can be
seen as a multi-head attention where each feature
dimension corresponds to a head. Inspired by AFT,
RWKV takes a similar approach but modifies the
interaction weights for simplicity such that it can
be transformed into an RNN. In parallel, RNNstyle (Hochreiter and Schmidhuber, 1997; Chung
et al., 2014) recursive components have also been
modified to increase context length, such as the Recurrent Memory Transformer (Bulatov et al., 2022,
2023) and Linear Recurrent Units (Orvieto et al.,
2023). State space models (SSM) like S4 (Gu et al.,
2022) and its variants (Dao et al., 2022b; Poli et al.,
2023) are also proposed.

similarly):
ft = σg (Wf xt + Uf ht−1 + bf ),

(1)

it = σg (Wi xt + Ui ht−1 + bi ),

(2)

ot = σg (Wo xt + Uo ht−1 + bo ),

(3)

c̃t = σc (Wc xt + Uc ht−1 + bc ),

(4)

ct = ft ⊙ ct−1 + it ⊙ c̃t ,

(5)

ht = ot ⊙ σh (ct ).

(6)

The data flow of RNNs is shown in Fig. 1a. Although RNNs can be factored into two linear blocks
(W and U ) and an RNN-specific block (1)–(6), as
noted by Bradbury et al. (2017), the data dependency relying on previous time steps prohibits parallelizing these typical RNNs.
3.2 Transformers and AFT
Introduced by Vaswani et al. (2017), Transformers
are a class of neural networks that have become
the dominant architecture for several NLP tasks.
Instead of operating on sequences step-by-step like
RNNs, Transformers rely on attention mechanisms
to capture relationships between all input and all
output tokens:
Attn(Q, K, V ) = softmax(QK ⊤ )V,

(7)

Notably, Quasi-Recurrent neural network
(QRNN) (Bradbury et al., 2017) uses both convolutional layers and recurrent pooling functions
across timesteps and channels. While QRNN
utilizes convolutional filters with fixed sizes,
RWKV employs a time-mixing module as an
attention mechanism with time-decaying factors.
Different from the element-wise pooling in QRNN,
RWKV includes a parametrized channel-mixing
module (see the green blocks in Fig.1c) that is
parallelizable.

where the multi-headness and scaling factor √1d is

3

where {wt,i } ∈ RT ×T is the learned pair-wise position biases, and each wt,i is a scalar.
Inspired by AFT, we let each wt,i in RWKV be
a channel-wise time decay vector multiplied by the
relative position, traced backwards from current
time as it decays:

Background

Here we briefly review the fundamentals of RNNs
and Transformers.

3.1 Recurrent Neural Networks (RNNs)
Popular RNN architectures such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Chung
et al., 2014) are characterized by the following formulation (shown for LSTM, others can be reasoned

k

omitted for convenience. The core QK ⊤ multiplication is an ensemble of pairwise attention scores
between each token in a sequence, which can be
decomposed as vector operations:
P T q ⊤ ki
e t vi
Attn(Q, K, V )t = Pi=1
.
(8)
T
qt⊤ ki
i=1 e
In AFT (Zhai et al., 2021), this is alternately
formulated as
Pt
ewt,i +ki vi
+
,
(9)
Attn (W, K, V )t = Pi=1
t
wt,i +ki
i=1 e

wt,i = −(t − i)w,

(10)

where w ∈ (R≥0 )d , with d the number of channels. We require w to be non-negative to ensure
that ewt,i ≤ 1 and the per-channel weights decay
backwards in time.

Linear

Convolution

Time-mixing

RNN Cell/Linear

Elementwise
Pooling

Channel-mixing

Linear

Convolution

Time-mixing

RNN Cell/Linear

Elementwise
Pooling

Channel-mixing

(a) RNN

(b) QuasiRNN (Bradbury et al., 2017)

(c) RWKV

Figure 1: Computation structure of the RWKV in comparison to QRNN and RNN (Vanilla, LSTM, GRU, etc)
architectures. Color codes: orange indicates time-mixing, convolutions or matrix multiplications, and the continuous block indicates that these computations can proceed simultaneously; blue signifies parameterless functions
that operate concurrently along the channel or feature dimension (element-wise). Green indicates channel-mixing.

4

The Receptance Weighted Key Value
(RWKV) Model

The RWKV architecture derives its name from
the four primary model elements used in the timemixing and channel-mixing blocks:
• R: Receptance vector acting as the acceptance of past information.
• W : Weight is the positional weight decay
vector. A trainable model parameter.
• K: Key is a vector analogous to K in traditional attention.
• V : Value is a vector analogous to V in traditional attention.
Interactions between the main elements for every
timestep are multiplicative, as illustrated in Fig. 2
4.1 High-Level Summary
The RWKV architecture is comprised of a series
of stacked residual blocks, each formed by a timemixing and a channel-mixing sub-blocks with recurrent structures.
The recurrence is formulated both as a linear interpolation between the current input and the input
at the previous time step (a technique we refer to
as time-shift mixing or token shift, indicated by the
diagonal lines in Fig. 3), which can be adjusted independently for every linear projection of the input
embedding (e.g., R, K, V in time-mixing, and R,
K in channel-mixing), and as the time-dependent
update of the W KV which is formalized in equation 14. The W KV computation is similar to AFT
(Zhai et al., 2021), but W is now a channel-wise
vector multiplied by relative position rather than a
pairwise matrix in AFT. We also introduce a vector
U for separately attending to the current token in
order to compensate for potential degeneration of
W (see Appendix G for more details).

Figure 2: RWKV block elements (left) and RWKV
residual block with a final head for language modeling
(right) architectures.
name

is

Bob

LM Head

LM Head

LM Head

Channel Mix

Channel Mix
Layer Norm

Time Mix
Layer Norm

n
Toke
shift

States

n
Toke
shift

Layer Norm

Time Mix
Layer Norm

Channel Mix

n
Toke
shift

States

n
Toke
shift

Layer Norm

Time Mix
Layer Norm

Layer Norm

Layer Norm

Layer Norm

My

name

is

Figure 3: RWKV architecture for language modelling.

The time-mixing block is given by:
rt = Wr · (µr xt + (1 − µr )xt−1 ),

(11)

kt = Wk · (µk xt + (1 − µk )xt−1 ),

(12)

vt = Wv · (µv xt + (1 − µv )xt−1 ),
(13)
Pt−1 −(t−1−i)w+ki
vi + eu+kt vt
i=1 e
wkvt = P
, (14)
t−1 −(t−1−i)w+ki
+ eu+kt
i=1 e
(15)

ot = Wo · (σ(rt ) ⊙ wkvt ),

where the W KV computation, wkvt , plays the
role of Attn(Q, K, V ) in Transformers without incurring a quadratic cost as interactions are between
scalars. Intuitively, as time t increases, the vector
ot is dependent on a long history, represented by the
summation of an increasing number of terms. For
the target position t, RWKV performs a weighted
summation in the positional interval of [1, t], and
then multiplies with the receptance σ(r). Therefore, interactions are multiplicative inside a given
timestep and summed over different timesteps.
Further, the channel-mixing block is given by:
rt = Wr · (µr xt + (1 − µr )xt−1 ),

(16)

kt = Wk · (µk xt + (1 − µk )xt−1 ),

(17)

2

ot = σ(rt ) ⊙ (Wv · max(kt , 0) ),

(18)

where we adopt squared ReLU activation (So et al.,
2021). Note that in both time-mixing and channelmixing, by taking the sigmoid of the receptance,
we’re intuitively using it as a “forget gate” to eliminate unnecessary historical information.
4.2 Transformer-like Parallelization
RWKV can be efficiently parallelized in what we
call a time-parallel mode, reminiscent of Transformers. The time complexity of processing a
batch of sequences in a single layer is O(BT d2 ),
which mainly consists of matrix multiplications
W ,  ∈ {r, k, v, o} (assuming B sequences, T
maximum tokens and d channels). Meanwhile, updating attention scores wkvt requires a serial scan
(see Appendix B for more detail) and has complexity O(BT d).
The matrix multiplications can be parallelized
akin to W  ,  ∈ {Q, K, V, O} in typical Transformers. The element-wise W KV computation
is time-dependent, but can be readily parallelized
along the other two dimensions (Lei et al., 2018)3 .
3

If the sequence is very long, more sophisticated methods such as Martin and Cundy (2017) that parallelize over
sequence length could be used.

Additionally, token shift is implemented as a simple offset in the temporal dimension at each block
using PyTorch (Paszke et al., 2019) library as
nn.ZeroPad2d((0,0,1,-1)).
4.3 RNN-like Sequential Decoding
It is common in recurrent networks to use output
at state t as input at state t + 1. This is especially
evident in the autoregressive decoding inference
of a language model, requiring each token to be
computed before fed into the next step, making it
possible for RWKV to take advantage of its RNNlike structure, referred to as time-sequential mode.
In such circumstances, RWKV can be conveniently
formulated recursively for decoding during inference, as shown in Appendix B, which leverages
the advantage that each output token is dependent
only on the latest state, which is of constant size,
irrespective of the sequence length.
It then behaves as an RNN decoder, yielding
constant speed and memory footprint with respect
to the sequence length, enabling the processing of
longer sequences more efficiently. In contrast, selfattention typically requires a KV cache growing
linearly with respect to the sequence length, resulting in degraded efficiency and increasing memory
footprint and time as the sequence grows longer.
4.4 Software Implementation
RWKV is originally implemented using the Pytorch Deep Learning Library (Paszke et al., 2019)
and a custom CUDA kernel for the W KV computation explained in 4.7. Although RWKV is a
general recurrent network, its current implementation focuses in the task of language modeling
(RWKV-LM). The model architecture is comprised
of an embedding layer, for which we follow the
setup described in Section 4.7 and several identical
residual blocks applied sequentially as seen in Fig.
2 and 3 following the principles outlined in Section
4.6. After the last block, a simple output projection head composed by a LayerNorm (Ba et al.,
2016) and a linear projection is used to obtain the
logits to be used in the next-token prediction task
and calculate the cross entropy loss during training.
Both the embeddings generated after the last residual block and the logits could also be used later
for downstream NLP tasks. Training is performed
in time-parallel mode (Section 4.2) while autoregressive inference and a potential chat interface4
4

https://github.com/BlinkDL/ChatRWKV

leverage the time-sequential mode (Section 4.3).
4.5 Gradient Stability and Layer Stacking
The RWKV architecture has been designed as a
fusion of both Transformers and RNNs, offering
the advantage of stable gradients and deeper architectures of Transformers compared to traditional
RNNs while being efficient in inference.
Previous work has sought to tackle the problem of gradient stability in RNNs with a variety of
techniques including using non-saturated activation
functions (Chandar et al., 2019), gating mechanism
(Gu et al., 2019), gradient clipping (Pascanu et al.,
2012), and adding constraints (Kanai et al., 2017;
Miller and Hardt, 2018). While these techniques
have seen little success, RWKV avoids the problem
inherently by utilizing softmax in conjunction with
RNN-style updates.
The RWKV model features a single-step process for updating attention-like scores, which includes a time-dependent softmax operation that
helps numerical stability and guards against vanishing gradients (for rigorous proof, see Appendix
F). Intuitively, this operation ensures the gradient
is propagated along the most relevant path. Layer
normalization (Ba et al., 2016) is another key aspect of the architecture which enhances the training
dynamics of deep neural networks by stabilizing
gradients, addressing both vanishing and exploding
gradient issues.
These design elements not only contribute to the
RWKV architecture’s stability and learning capabilities but enable the stacking of multiple layers
in a manner that surpasses the capabilities of any
existing RNN. In doing so, the model is able to capture more complex patterns across various levels of
abstraction (see also Appendix G).
4.6 Harnessing Temporal Structure for
Sequential Data Processing
RWKV captures and propagates sequential information through the combination of three mechanisms: recurrence, time decay and token shift.
The recurrence in the time-mixing block of
RWKV is the basis for the model’s capacity to
capture intricate relationships between sequence
elements and to propagate locality information
through time.
The time decay mechanism (e−w and eu in equation 14), maintains sensitivity to the positional relationship between sequence elements. By gradually diminishing the influence of past information

over time, the model preserves a sense of temporal
locality and progression, which is essential for sequential processing. This treatment of positional
information in sequential data exhibits similarities
to the Attention with Linear Biases (ALiBi) model
(Press et al., 2022), where the linear biases facilitate input length extrapolation. In this context, the
RWKV architecture can be perceived as a trainable
version of ALiBi, seamlessly incorporating positional information without the necessity for explicit
encoding. It can also be seen as an extension of the
gated convolution introduced in Zhai et al. (2021)
to the full sequence length until a given step.
The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the
model’s adaptation to sequential data. By linearly
interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels.
The overall structure of time-shift mixing bears
resemblance to the causal convolution with no dilations in WaveNet (van den Oord et al., 2016), which
is a classical architecture used for forecasting time
series data.
4.7 Additional Optimizations
Custom Kernels To address inefficiencies in the
W KV computation due to the sequential nature of
the task when using standard deep learning frameworks, we implement a custom CUDA kernel so
as to launch a single compute kernel in training accelerators. All other parts of the model are matrix
multiplications and point-wise operations that can
already be efficiently parallelized.
FFN with R gate Prior research (Tolstikhin et al.,
2021; Liu et al., 2021; Yu et al., 2022) suggests that self-attention may not be as essential
in Transformer-based vision tasks as previously
thought. Although it provided us with some insights, replacing self-attention entirely in natural
language tasks could be too drastic. In our study,
we partially dismantle the attention mechanism by
replacing the fixed QKV formula with KV and introducing a new time-decaying factor W . This approach enables us to incorporate token and channelmixing components akin to MLP-mixer (Tolstikhin
et al., 2021) and a gating unit R similar to gMLP
(Liu et al., 2021), which enhance the performance
of our RWKV model.
Small Init Embedding During the initial stage
of training a transformer model (Vaswani et al.,

2017), we observe that the embedding matrix undergoes slow changes, which pose a challenge for
the model to deviate from its initial noisy embedding state. To mitigate this issue, we propose an
approach that involves initializing the embedding
matrix with small values and subsequently applying an additional LayerNorm operation. By implementing this technique, we accelerate and stabilize
the training process, enabling the training of deep
architectures with post-LN components. The effectiveness of this approach is demonstrated in Figure
8, where it is shown to facilitate improved convergence by allowing the model to quickly transition
away from the initially small embedding. This is
achieved through small changes following a single
step, which in turn lead to substantial alterations
in directions and subsequently significant changes
after the LayerNorm operation.
Custom Initialization Building on principles
from previous works (He et al., 2016; Jumper et al.,
2021), we initialize parameters to values as similar
as possible to an identity mapping while breaking symmetry so there is a clean information path.
Most weights are initialized to zero. No biases are
used for linear layers. Specific formulas are given
in Appendix D. We find the choice of initialization
to be significant in convergence speed and quality
(see Appendix E).

5

and BLOOM (Scao et al., 2022). RWKV even outperforms Pythia and GPT-Neo (Black et al., 2022)
in four tasks: PIQA, OBQA, ARC-E, and COPA
(See details in Appendix H). For RQ3, Fig. 5 shows
that increasing context length leads to lower test
loss on the Pile, an indication that RWKV can make
effective use of long contextual information.

6

Inference Experiments

We benchmark inference requirements according
to size and family. Specifically, we evaluate text
generation speed and memory requirements on a
typical compute platforms including CPU (x86)
and GPU (NVIDIA A100 80GB). For all our experiments we use float32 precision. We include
all model parameters in parameter count, including
both embedding and non-embedding layers. Performance under different quantization setups is left
to further work. See Appendix I for more results.

Evaluations

In this section, we focus on evaluating to answer
the following questions:
Is RWKV competitive against
• RQ1:
quadratic transformer architectures with equal
number of parameters and training tokens?
• RQ2: When increasing the number of parameters, does RWKV remain competitive against
quadratic transformer architectures?
• RQ3: Does increasing parameters of RWKV
yield better language modeling loss, when
RWKV models are trained for context lengths
that most open-sourced quadratic transformers cannot efficiently process?
Addressing RQ1 and RQ2, from Fig. 4, we
can see that RWKV is very competitive on six
benchmarks (Winogrande, PIQA, ARC-C, ARC-E,
LAMBADA, and SciQ) against major open source
quadratic complexity transformer models: Pythia
(Biderman et al., 2023), OPT (Zhang et al., 2022)

Figure 6: Cumulative time during text generation for
different LLMs.

Additionally, we carried out comparative studies
on RWKV-4 and ChatGPT / GPT-4, see Appendix
J. They revealed that RWKV-4 is very sensitive to
prompt engineering. When the prompts were adjusted from the ones used for GPT to more suitable
for RWKV, the F1-measure performance increased
even from 44.2% to 74.8%.

7

Future Work

There are several promising directions for future
work on the RWKV architecture:
• Increasing model expressivity with enhanced
time-decay formulations and exploring initial
model states while maintaining efficiency.

(a) Winogrande

(b) PIQA

(c) ARC-Challenge

(d) ARC-Easy

(e) LAMBADA

(f) SciQ

Figure 4: Zero-Shot Performance: The horizontal axis is a number of parameters and the vertical axis is accuracy.

ing would be the performance under different
datasets and specific use cases.
• Adapting parameter-efficient fine-tuning
methods such as LoRA (Hu et al., 2022)
and characterizing behavior under different
quantization schemes for the proposed
architecture

Pile test loss

7B 8k
2

2

2

1

14B 8k

2

1

2

3

2

5

2

7

2

9

2

11

Context Length

8
Figure 5: Increasing context length contributes to lower
test loss on the Pile (Gao et al., 2020).

• Further improving RWKV computational efficiency by applying parallel scan in the
wkvt step to reduce the computational cost
to O(B log(T )d).
• Investigating the application of RWKV to
encoder-decoder architectures and potential
replacement of cross-attention mechanism.
This could have applicability seq2seq or multimodal settings, enhancing efficiency both in
training and inference.
• Leveraging RWKV’s state (or context) for interpretability, predictability in sequence data
and safety. Manipulating the hidden state
could also guide behavior and allow greater
customizability through prompt tuning.
• Exploring fine-tuned models in specific settings for enhanced interaction with humans
(Ouyang et al., 2022). Particularly interest-

Conclusions

We introduced RWKV, a new approach to RNN
models exploiting the potential of time-based mixing components. RWKV introduces several key
strategies which allow it to capture locality and
long-range dependencies, while addressing limitations of current architectures by: (1) replacing
the quadratic QK attention by a scalar formulation
with linear cost, (2) reformulating recurrence and
sequential inductive biases to unlock efficient training parallelization and efficient inference, and (3)
enhancing training dynamics using custom initializations.
We benchmark the proposed architecture in a
wide variety of NLP tasks and show comparable
performance to SoTA with reduced cost. Further
experiments on expressivity, interpretability, and
scaling showcase the model capabilities and draw
parallels in behavior between RWKV and other
LLMs.
RWKV opens a new door to scalable and efficient architectures to model complex relation-

ships in sequential data. While many alternatives
to Transformers have been proposed with similar
claims, ours is the first to back up those claims with
pretrained models with tens of billions of parameters.

9

Limitations

While our proposed RWKV model has demonstrated promising results regarding training and
memory efficiency during inference, some limitations should be acknowledged and addressed in
future work. First, the linear attention of RWKV
leads to significant efficiency gains but still, it may
also limit the model’s performance on tasks that
require recalling minutiae information over very
long contexts. This is due to the funneling of information through a single vector representation
over many time steps, compared with the full information maintained by the quadratic attention of
standard Transformers. In other words, the model’s
recurrent architecture inherently limits its ability to
“look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned
time decay helps prevent the loss of information,
it is mechanistically limited compared to full selfattention.
Another limitation of this work is the increased
importance of prompt engineering in comparison to
standard Transformer models. The linear attention
mechanism used in RWKV limits the information
from the prompt that will be carried over to the
model’s continuation. As a result, carefully designed prompts may be even more crucial for the
model to perform well on tasks.

Acknowledgements
We acknowledge EleutherAI and StabilityAI for
compute access and technical support in development of RWKV. We also acknowledge the members of the RWKV Discord server for their help
and work on further extending the applicability of
RWKV to different domains. Finally, we thank
Stella Biderman for feedback on the paper.

References
Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor
Pryor, Luke Yoffe, Deepak Ramachandran, Lise
Getoor, Jay Pujara, and William Yang Wang. 2022.
FETA: A benchmark for few-sample task transfer
in open-domain dialogue. In Proceedings of the

2022 Conference on Empirical Methods in Natural Language Processing, pages 10936–10953, Abu
Dhabi, United Arab Emirates. Association for Computational Linguistics.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization.
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018.
An empirical evaluation of generic convolutional
and recurrent networks for sequence modeling.
Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. 2020. TweetEval: Unified benchmark and comparative evaluation
for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020,
pages 1644–1650, Online. Association for Computational Linguistics.
Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
arXiv:2004.05150.
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit,
USVSN Sai Prashanth, Edward Raff, et al. 2023.
Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint
arXiv:2304.01373.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning
about physical commonsense in natural language. In
Thirty-Fourth AAAI Conference on Artificial Intelligence.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
Stella Biderman. 2022. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow,
2021.
URL: https://doi. org/10.5281/zenodo,
5297715.
James Bradbury, Stephen Merity, Caiming Xiong, and
Richard Socher. 2017. Quasi-recurrent neural networks. In ICLR.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing
systems, 33:1877–1901.
Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev.
2023. Scaling transformer to 1m tokens and beyond
with rmt.
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev.
2022. Recurrent memory transformer. Advances in
Neural Information Processing Systems, 35:11079–
11091.

A. P. Sarath Chandar, Chinnadhurai Sankar, Eugene
Vorontsov, Samira Ebrahimi Kahou, and Yoshua
Bengio. 2019. Towards non-saturating recurrent
units for modelling long-term dependencies. In
AAAI Conference on Artificial Intelligence.
Krzysztof Choromanski, Valerii Likhosherstov, David
Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
Lukasz Kaiser, David Belanger, Lucy Colwell, and
Adrian Weller. 2020. Rethinking attention with performers.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. In
arXiv:1803.05457.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse,
and John Schulman. 2021. Training verifiers to
solve math word problems. In arXiv, volume
abs/2110.14168.
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and
Christopher Re. 2022a. Flashattention: Fast and
memory-efficient exact attention with IO-awareness.
In Advances in Neural Information Processing Systems.
Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W
Thomas, Atri Rudra, and Christopher Ré. 2022b.
Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint
arXiv:2212.14052.
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan S. Cowen, Gaurav Nemade, and Sujith Ravi. 2020. Goemotions: A dataset of finegrained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5-10, 2020,
pages 4040–4054. Association for Computational
Linguistics.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020.
The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
Albert Gu, Karan Goel, and Christopher Ré. 2022. Efficiently modeling long sequences with structured
state spaces. In The International Conference on
Learning Representations (ICLR).
Albert Gu, Çaglar Gülçehre, Tom Le Paine,
Matthew W. Hoffman, and Razvan Pascanu.
2019. Improving the gating mechanism of recurrent
neural networks. ArXiv, abs/1910.09890.

Mandy Guo, Joshua Ainslie, David C Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022. Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022,
pages 724–736.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Identity mappings in deep residual networks.
Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In International Conference
on Learning Representations.
Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,
6(02):107–116.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes
Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,
and Laurent Sifre. 2022. Training compute-optimal
large language models.
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2022. LoRA: Low-rank adaptation of
large language models. In International Conference
on Learning Representations.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller.
2019. Deep learning for time series classification:
a review. Data mining and knowledge discovery,
33(4):917–963.
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol
Vinyals, Andrew Zisserman, and Joao Carreira.
2021. Perceiver: General perception with iterative
attention. In International conference on machine
learning, pages 4651–4664. PMLR.
Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee,
and Jangwoo Kim. 2019. Mnnfast: A fast and
scalable system architecture for memory-augmented
neural networks. In Proceedings of the 46th International Symposium on Computer Architecture, pages
250–263.
Matt Gardner Johannes Welbl Nelson F. Liu. 2017.
Crowdsourcing multiple choice science questions.
In DOI:10.18653/v1/W17-4413.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In ACL.
John Jumper, Richard Evans, Alexander Pritzel,
Tim Green, Michael Figurnov, Olaf Ronneberger,
Kathryn Tunyasuvunakool, Russ Bates, Augustin
Žídek, Anna Potapenko, and et al. 2021. Highly
accurate protein structure prediction with alphafold.
Nature, 596(7873):583–589.
Sekitoshi Kanai, Yasuhiro Fujiwara, and Sotetsu Iwamura. 2017. Preventing gradient explosions in gated
recurrent units. In NIPS.
Jared Kaplan, Sam McCandlish, Tom Henighan,
Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
2020. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are
rnns: Fast autoregressive transformers with linear
attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR.
Nikita Kitaev, L. Kaiser, and Anselm Levskaya.
2020. Reformer: The efficient transformer. ArXiv,
abs/2001.04451.
Jan Kocoń, Igor Cichecki, Oliwier Kaszyca, Mateusz
Kochanek, Dominika Szydło, Joanna Baran, Julita
Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil
Kanclerz, Anna Kocoń, Bartłomiej Koptyra, Wiktoria Mieleszczenko-Kowszewicz, Piotr Miłkowski,
Marcin Oleksy, Maciej Piasecki, Łukasz Radliński,
Konrad Wojtasik, Stanisław Woźniak, and Przemysław Kazienko. 2023. Chatgpt: Jack of all trades,
master of none.
Jan Kocoń, Piotr Miłkowski, and Monika ZaśkoZielińska. 2019. Multi-level sentiment analysis of
polemo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning
(CoNLL), pages 980–991.
Phong Le and Willem Zuidema. 2016. Quantifying
the vanishing gradient and long distance dependency
problem in recursive neural networks and recursive
lstms. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 87–93.
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav
Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 4470–4481, Brussels, Belgium. Association for Computational Linguistics.
Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le.
2021. Pay attention to mlps.

Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting
Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention.
Advances in Neural Information Processing Systems,
34:2441–2453.
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He,
Liangke Gui, Graham Neubig, Jonathan May, and
Luke Zettlemoyer. 2023. Mega: Moving average
equipped gated attention. In ICLR.
Eric Martin and Chris Cundy. 2017. Parallelizing
linear recurrent neural nets over sequence length.
ArXiv, abs/1709.04057.
Kevin Meng, David Bau, Alex Andonian, and Yonatan
Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information
Processing Systems, 36.
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
John Miller and Moritz Hardt. 2018. Stable recurrent
models. arXiv: Learning.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of
commonsense stories. In Proceedings of the 2016
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 839–849.
OpenAI. 2022. Introducing chatgpt. https://openai.
com/blog/chatgpt.
OpenAI. 2023. Gpt-4 technical report.
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan
Fernando, Caglar Gulcehre, Razvan Pascanu, and
Soham De. 2023.
Resurrecting recurrent neural networks for long sequences. arXiv preprint
arXiv:2303.06349.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, John Schulman, Jacob Hilton, Fraser Kelton,
Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan
Lowe. 2022. Training language models to follow instructions with human feedback.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and
Raquel Fernandez. 2016. The LAMBADA dataset:
Word prediction requiring a broad discourse context.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 1525–1534, Berlin, Germany.
Association for Computational Linguistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
2012. On the difficulty of training recurrent neural
networks. In International Conference on Machine
Learning.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward
Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An
imperative style, high-performance deep learning library.
Michael Poli, Stefano Massaroli, Eric Nguyen,
Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua
Bengio, Stefano Ermon, and Christopher Ré. 2023.
Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
Ofir Press, Noah A. Smith, and Mike Lewis. 2022.
Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations,
ICLR 2022, Virtual Event, April 25-29, 2022.
Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul
Musker, Maayan Roichman, Guillaume Sylvain,
Nithum Thain, Lucas Dixon, and Jeffrey Sorensen.
2020. Six attributes of unhealthy conversations.
In Proceedings of the Fourth Workshop on Online
Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics.
Markus N. Rabe and Charles Staats. 2022.
attention does not need o(n2 ) memory.

Self-

Melissa Roemmele, Cosmin Adrian Bejan, , and Andrew S. Gordon. 2018. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.

Ilya O. Tolstikhin, Neil Houlsby, Alexander
Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas
Unterthiner, Jessica Yung, Andreas Steiner, Daniel
Keysers, Jakob Uszkoreit, Mario Lucic, and
Alexey Dosovitskiy. 2021. Mlp-mixer: An all-mlp
architecture for vision. CoRR, abs/2105.01601.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. 2023.
Llama: Open and efficient foundation language models.
Aäron van den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray
Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. ArXiv, abs/1609.03499.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
David Vilares and Carlos Gómez-Rodríguez. 2019.
Head-qa: A healthcare dataset for complex reasoning. In ACL.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. 2019. Superglue: A
stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François
Yvon, Matthias Gallé, et al. 2022. Bloom: A 176bparameter open-access multilingual language model.
arXiv preprint arXiv:2211.05100.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.

Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm
Exposed! http://www.kaggle.com/rmsharks4/
sarcasmania-dataset.
[Online; accessed 02February-2023].

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han
Fang, and Hao Ma. 2020. Linformer: Self-attention
with linear complexity.

David R. So, Wojciech Manke, Hanxiao Liu, Zihang
Dai, Noam Shazeer, and Quoc V. Le. 2021. Primer:
Searching for efficient transformers for language
modeling. CoRR, abs/2109.08668.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan,
Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking self-attention in transformer models.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald
Metzler. 2022. Efficient transformers: A survey.
ACM Computing Surveys, 55(6):1–28.

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong
Long, Chengqi Zhang, and S Yu Philip. 2020. A
comprehensive survey on graph neural networks.
IEEE transactions on neural networks and learning
systems, 32(1):4–24.
Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on
World Wide Web, WWW 2017, Perth, Australia, April
3-7, 2017, pages 1391–1399. ACM.

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen
Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng
Yan. 2022. Metaformer is actually what you need
for vision.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, et al. 2020. Big bird: Transformers for
longer sequences. Advances in Neural Information
Processing Systems, 33.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
machine really finish your sentence? In ACL.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2020. Winogrande: An
adversarial winograd schema challenge at scale. In
ACL.
Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen
Huang, Hanlin Goh, Ruixiang Zhang, and Josh
Susskind. 2021. An attention free transformer.
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng
Gao, Kevin Duh, and Benjamin Van Durme. 2018.
Record: Bridging the gap between human and machine commonsense reading comprehension. In
arXiv:1810.12885.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2022. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068.

A Author Contributions
Bo Peng Original RWKV idea, original code,
performance optimizations, original experiments,
and trained RWKV models from 0.1B to 14B.
Eric Alcaide Manuscript (initial draft sections 1,
2; sections 4, 7 and 8; revision and proofreading;
final version ). Figures (2, 3, 4, 7). Experiments
section 6. Appendices D, I. Contributions to Appendix K.
Quentin Anthony Led writing the paper.
Manuscript (initial draft sections 1, 2, 3; revision
and proofreading; final version).
Zhenyuan Zhang Manuscript (revision and
proofreading) Figure 3. Experiments Appendix
G. Contributions to Appendices B and K.
Kranthi Kiran GV Manuscript (sections 2 and
5; contributions to section 3; revision and proofreading). Tables 3 and 4. Appendix C.

Xiangru Tang Manuscript (sections 2 and 3;
contributions to abstract; revision and proofreading). Contributions to Appendix K.
Matteo Grella Manuscript (sections 4.5, 4.6, 8;
contributions to sections 1, 7 and 9; proofreading
and revision). Contributions to Appendix B.
Ferdinand Mom Manuscript (contributions to
section 1, 2, 4.3, 4.6; proofreading and revision).
Contributions to Appendix B.
Atsushi Saito Manuscript (sections 3 and 5; contributions to section 2). Figures 1a , 1b, 1c. Contributions to Appendix H
Krishna Sri Ipsit Mantri Figure 4
Rui-Jie Zhu
table 5.

Tables 1 and 5. Experiments for

Peng Zhou Contributions to Table 5.
Qihang Zhao Manuscript (proofreading and revision). Contributions to Table 5.
Xuzheng He Manuscript (contributions to section 3; proofreading and revision). Contributions
to Figures 1, 7. Appendix G. Contributions to appendix F.
Hayden Lau Manuscript (contributions to section 1; proofreading and revision). Contributions
to Appendix K.
Michael Chung Manuscript (contributions to
section 4.6; proofreading and revision).
Haowen Hou Figure 8. Appendix E
Jiaming Kong Manuscript (revision and proofreading). Appendix F.
Johan S. Wind RWKV performance optimizations (CUDA), Contributions to Appendix C.
Jian Zhu Manuscript (section 2; proofreading
and revision). Figures 3 and 5.
Huanqi Cao Manuscript (contributions to 4.2
and 4.3; proofreading and revision). Experiments
for Appendix G.
Samuel Arcadinho Contributions to Figures 6,
10, and 11. Contributions to Appendix I.
Xin Cheng Manuscript (proofreading and revision). Contributions to Appendix K, H.

Alon Albalak Manuscript (abstract and sections
1, 9; proofreading and revision).

that
a1 = e−w a0 + ek0 v0 = ek0 v0 ,

Jan Kocon Manuscript (sections 1; proofreading
and revision). Contributions to Appendix J.
Przemysław Kazienko Manuscript (section 6;
proofreading and revision). Contributions Appendix J.

b1 = e

Stanisław Woźniak Appendix J.
Bartłomiej Koptyra Contributions to Appendix
J.

(19)

at−1 + eu+kt vt
wkvt =
,
bt−1 + eu+kt

(20)

at = e−w at−1 + ekt vt ,

(21)

bt = e

−w

kt

bt−1 + e .

(22)

The dataflow of the RNN-like time-mixing is
shown in Fig. 7, where the hidden states h is the
numerator-denominator tuple (a, b).

1

2
3

e

Figure 7: RWKV time-mixing block formulated as an
RNN cell. Color codes: yellow (µ) denotes the token
shift, red (1) denotes the denominator, blue (2) denotes
the numerator, pink (3) denotes the fraction computations in 14. h denotes the numerator-denominator tuple
(a, b).

To avoid overflow in calculating ekt , a numerical
trick is used in the official implementation. Note

k0

(23)
(24)

=e ,

q := max(pt−1 , u + kt ),

(25)

a∗t = ept−1 −q a′t−1 + eu+kt −q vt ,
b∗t = ept−1 −q b′t−1 + eu+kt −q ,
a∗
wkvt = ∗t .
bt

(26)
(27)
(28)

The update to a′t , b′t and their shared exponent are
also carried out in similar fashion:

As stated in 4.3, the RWKV time-mixing block can
be formulated as an RNN, as the W KV computation can be written in such a recursive form:
a0 , b0 = 0,

b0 + e

k0

and we set a′1 = v0 , b′1 = 1, p0 = k0 , where pt−1
stores the shared exponents of at and bt . Now the
above recursion can be converted into a numerical
safe version, for each time step t > 1:

Ruichong Zhang Manuscript (proofreading and
revision); Contributions to Figure 5 and Appendix
K.

B Time-Mixing Block as an RNN Cell

−w

q := max(pt−1 − w, kt ),

(29)

a′t = ept−1 −w−q a′t−1 + ekt −q vt ,
b′t = ept−1 −w−q b′t−1 + ekt −q ,

(30)
(31)

pt = q.

(32)

C Parameter and FLOP Count for the
RWKV Models
The following section provides an overview of the
different RWKV model architectures along with
their respective parameter and FLOP counts in Table 2.
Name

Layers

Model Dimension

Parameters

FLOPs per token

169 M
430 M
1.5 B
3B
7B
14 B

12
24
24
32
32
40

768
1024
2048
2560
4096
5120

1.693 × 108
4.304 × 108
1.515 × 109
2.985 × 109
7.393 × 109
1.415 × 1010

2.613 × 108
7.573 × 108
2.823 × 109
5.710 × 109
1.437 × 1010
2.778 × 1010

Table 2: RWKV model architectures and associated
FLOP counts

The number of parameters for each model is
computed using the formula: #parameters =
2V D + 13D2 L + D(11L + 4) where V = 50277
is the vocabulary size, D represents the Model Dimension and L corresponds to the number of layers.
FLOPs is for a forward pass for one token. It
was calculated as 6(V D + 13D2 L), which is the
twice (add and multiply) the number of parameters
in linear layers. The backwards pass FLOPs can be
approximated as twice that of the forward pass. So

D Parameter initializations
We describe the specific parameter initializations
below and motivate the design choices. Parameters belonging to residual blocks are often adjusted
by layer depth and total number of layers. Let #
denote the vocabulary size, s denote the embedding dimension, d denote the hidden size (we use
d = 4s), L the number of layers, l the layer index
(from 0 to L-1), we use the following initializations:
• Embeddings are initialized to U (±1e-4) as
explained in 4.7
• For the channel-mixing blocks (11), µki and
l
µri are initialized to ( si )1− L
• For the time-mixing blocks (16), initializal
l
0.3l
tions are µki = ( si )1− L , µvi = ( si )1− L + L−1
1− Ll

and µri = 0.5( si )
• wi (14), also known as “time decay”, is initial1.3l
i 0.7+ L−1
)
. Intuitively, it is
ized to −5+8·( d−1
the discount factor applied to previous tokens
over time.
• ui (14), also known as “bonus”, is set to
0.5(((i + 1) mod 3) − 1) + log 0.3. It is
the special weighting applied to the current
token in equation 14. The alternating zigzag
pattern initially creates subtle variations in the
tensor elements, which are intended to help
the model treat different dimensions of the
embedding distinctively.
• Wo (15) (time-mixing) and W
qv (channel-

mixing) are initialized to N (0, ds = 2)
• All Wr , Wk , Wv weights are initialized to 0
so the model can start learning from the beginning without noisy signals.

• All LayerNorm weights start from 1 and biases from 0.

E Small Init Embedding
This section presents experimental validation of
small initialization embedding. The experimental
setup is as follows. In the baseline configuration,
the parameters are initialized using a normal distribution with a mean of 0.0 and a standard deviation
of 0.02, which is a commonly used initialization
method in models like BERT and GPT. On the other
hand, in the small initialization of the embedding
(small init emb) experiment, the parameters are initialized using a uniform distribution with a range of
1e-4, which is slightly different from RWKV where
a normal distribution with a standard deviation of
1e-4 is used. However, this difference is negligible
and does not affect our conclusions. The experiments were conducted with a batch size of 400. As
depicted in the figure 8, the loss curve for the small
init emb exhibits a faster rate of decrease and convergence compared to the traditional initialization
using a normal distribution.

11

Baseline
Small Init Emb

10
9

Loss

the total is 6(V D + 13D2 L) per token for training
(3x fw FLOPs). It is noteworthy that FLOPs are
independent of the context length, unlike regular
transformers. The FLOP approximations in this
paper are in line with the methodology used by
Kaplan et al. (2020).
Alternative approximations for FLOPs include
doubling the parameters which yields similar results within 2% for 14B and a 30% discrepancy for
169M variant. Another approximation is based on
the number of non-embedding parameters multiplied by 2. This gives 2(V D + 13D2 L + D(11L +
4)) resulting in 1.6% more FLOPs for 14B model
and 8% more FLOPs for 169M model.

8
7
6
5
4

0

10000

20000

Step

30000

40000

50000

Figure 8: Effect of small initialization embedding.

F Gradient Stability in RWKV
In this section, we present a mathematical description of the gradient stability property in RWKV,
focusing specifically on the time-mixing block. By
gradient stability we mean that if the inputs xt
are bounded and the model parameters are fixed,
then the gradients with respect to Wk and Wv are
uniformly bounded for all T (thus not exploding).
Consequently, we can control the amount each xt
contributes to the gradient at T in a naturally decaying fashion by the weight decay mechanism w

(thus not vanishing unless desired).
First, we make the simplification that there are
no token shifts, this will not affect the final conclusion. In this scenario, wkvT can be written as
Kte vt

e
t=1 Kt

= E(vt ) =

S(vt )
,
S(1)

(33)

Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
Layer 8
Layer 9
Layer 10
Layer 11
Layer 12

0.6
0.4
0.2

where
v t = Wv xt ,

Kte = eWk xt +wT,t ,

0.0

∂(vt )i
= (xt )j ,
∂(Wv )i,j
∂(Kte )i
∂(Wk )i,j

0

= (xt )j (Kte )i ,

and S(·) and E(·) are shorthand for denoting sums
and averages over weights Kte .
The loss function at position T can be written as
LT = l(f (wkvT ), yT ).

(34)

Because wkvT relates to (Wk )i,j and (Wv )i,j only
through the i-th channel (wkvT )i , we have
∂LT ∂(wkvT )i
∂LT
=
.
∂(Wv )i,j
∂(wkvT )i ∂(Wv )i,j

(35)

The first part of above equation contains trivial
operations like output layers, and other layers of
time-mixing, which can be proven inductively. The
second part of above equation can be bounded as
∂Ei [(vt )i ]
∂(wkvT )i
=
∂(Wv )i,j
∂(Wv )i,j
= |Ei [(xt )j ]| ≤ max |(xt )j |, (36)
t

which is irrelevant to T . Similarly,
∂(wkvT )i
Si [(vt )i ]
=∂
/∂(Wk )i,j
∂(Wk )i,j
Si (1)
Si [(xt )j (vt )i ] Si [(xt )j ]Si [(vt )i ]
=
−
Si (1)
Si (1)2
= Ei [(xt )j (vt )i ] − Ei [(xt )j ]Ei [(vt )i ]
= covi ((xt )j , (vt )i )

(37)

can also be bounded. Note that wkv’s softmax operation contains at least two non-zero terms (u and
w), so the above “covariance” will not degenerate
into 0.

The
E
iff
el
Tower
is
located
in
the
city
of

100

200

300

400
Channel

500

600

Information propagation path

1

6

11

Layer

16

21

700

800
1
2
3
4
5
6
7

Log-probability of "Paris"

wkvT = Pt=1
T

0.8
Time Decay

PT

Time decay (sorted along channel axis)

1.0

Figure 9: Model behavior visualizations of the RWKV
model.

G Model Behavior Visualization
In Figure 9, we present visualizations of some behavior of the RWKV model.
The top plot illustrates the time decays (e−w ) in
each layer of the RWKV-169M model, sorted along
the channel axis. Notably, several decays in the last
layers are very close or equal to one, implying that
certain information is preserved and propagated
throughout the model’s temporal context. Meanwhile, many decays in the initial layer are close
to zero, which corresponds to local operations in
wkv (14), likely to be associated with tasks such as
text parsing or lexical analysis. (Note that the local
operations in wkv is due to the extra parameter u,
when e−w is degenerated into 0.) These patterns of
time decays are partly learned, but also come from
parameter initialization as it speeds up training.
The bottom plot shows the information retrieval
and propagation path in the RWKV-430M model.
The experiment follows the causal trace method
introduced by Meng et al. (2022), where we
1. Run the model once, and record all states and
activation of each layer during the computation;
2. Corrupt the input embeddings of the subject
using noise (“The Eiffel Tower” in this example);

3. Restore the states and activation of a certain
layer at a certain token during the computation, and record the log-probability of the
model outputting the correct answer (“Paris”).

such as medicine, nursing, biology, chemistry,
psychology, and pharmacology.
• OpenBookQA (Mihaylov et al., 2018) A QA
dataset to evaluate human comprehension of
a subject by incorporating open book facts,
scientific knowledge, and perceptual common
sense, drawing inspiration from open book
exams.
• SciQ (Johannes Welbl Nelson F. Liu, 2017)
A multiple-choice QA dataset which was created using an innovative approach to gather
well-crafted multiple-choice questions that are
focused on a specific domain.
• TriviaQA (Joshi et al., 2017) A QA-IR dataset
which is constituted of triples of questions,
answers, supporting evidence, and independently collected evidence documents, with an
average of six documents per question for reliable sources.
• ReCoRD (Zhang et al., 2018) A benchmark
for evaluating commonsense reasoning in
reading comprehension by generating queries
from CNN/Daily Mail news articles and requiring text span answers from corresponding
summarizing passages.
• COPA (Roemmele et al., 2018) A dataset to
evaluate achievement in open-domain commonsense causal reasoning.
• MMMLU (Hendrycks et al., 2021) A multitask dataset for 57 tasks containing elementary
mathematics, US history, computer science,
law, etc.

Unlike transformers, RWKV relies on recursive
propagation of information in the time dimension.
In this case, the fact that "the Eiffel Tower is located
in Paris" is retrieved in layer 4. It is then passed
down to the subsequent layers. In layer 20, mostly,
the information is propagated through time until
reaching where it is needed. Finally, it is passed
down to the last layer for outputting the answer.

H Evaluation Details
The results for following tasks are in Table 3 and 4.
Tasks:
• LAMBADA (Paperno et al., 2016). A benchmark dataset that evaluates the model’s contextual reasoning and language comprehension
abilities by presenting context-target pairs,
where the objective is to predict the most probable target token.
• PIQA (Bisk et al., 2020). A benchmark for
the task of physical common sense reasoning,
which consists of a binary choice task that
can be better understood as a set of two pairs,
namely (Goal, Solution).
• HellaSwag (Zellers et al., 2019) A novel
benchmark for commonsense Natural Language Inference (NLI) which is build by adversarial filtering against transformer models.
• Winogrande (Zellers et al., 2020) A dataset
designed to evaluate the acquisition of common sense reasoning by neural language models, aiming to determine whether we are accurately assessing the true capabilities of machine common sense.
• StoryCloze (Mostafazadeh et al., 2016) A
benchmark to present a novel approach to assess comprehension of narratives, narrative
generation, and script acquisition, focusing on
commonsense reasoning.
• ARC Challenge (Clark et al., 2018) A dataset
designed for multiple-choice question answering, encompassing science exam questions
ranging from third grade to ninth grade.
• ARC Easy An easy subset of ARC.
• HeadQA (Vilares and Gómez-Rodríguez,
2019) A benchmark consisting of graduatelevel questions encompassing various fields

I

Inference results

Figures 10 and 11 illustrate, respectively, the results
on time (s) and memory (RAM, VRAM) requirements for LLM inference in float32 precision. We
benchmark the following model families and sizes:
• RWKV: 169m, 430m, 1.4b, 3b, 7b, 14b
• Bloom (Scao et al., 2022): 560m, 1b, 3b
• OPT (Zhang et al., 2022): 125m, 350m, 1.3b,
2.7b, 6.7b, 13b
• GPT-Neo (Black et al., 2022): 125m, 1.3b,
2.7b
• Pythia (Biderman et al., 2023): 160m, 410m,
1.4b, 2.8b, 6.7b, 12b
Missing models in are due to Out Of Memory
(OOM) errors. A comparison at 512 tokens is
shown in Figure 11 as some large transformer models produced an OOM when inferencing longer se-

Model

Params
B

PIQA
acc

StoryCloze
acc

HellaSwag
acc_norm

WinoGrande
acc

ARC-e
acc

ARC-c
acc_norm

OBQA
acc_norm

RWKV-4
Pythia
GPT-Neo

0.17
0.16
0.16

65.07
62.68
63.06

58.79
58.47
58.26

32.26
31.63
30.42

50.83
52.01
50.43

47.47
45.12
43.73

24.15
23.81
23.12

29.60
29.20
26.20

RWKV-4
Pythia
GPT-Neo

0.43
0.40
0.40

67.52
66.70
65.07

63.87
62.64
61.04

40.90
39.10
37.64

51.14
53.35
51.14

52.86
50.38
48.91

25.17
25.77
25.34

32.40
30.00
30.60

RWKV-4
Pythia
GPT-Neo

1.5
1.4
1.4

72.36
71.11
71.16

68.73
67.66
67.72

52.48
50.82
48.94

54.62
56.51
54.93

60.48
57.74
56.19

29.44
28.58
25.85

34.00
30.80
33.60

RWKV-4
Pythia
GPT-Neo

3.0
2.8
2.8

74.16
73.83
72.14

70.71
70.71
69.54

59.89
59.46
55.82

59.59
61.25
57.62

65.19
62.84
61.07

33.11
32.25
30.20

37.00
35.20
33.20

RWKV-4
Pythia
GPT-J

7.4
6.9
6.1

76.06
74.54
75.41

73.44
72.96
74.02

65.51
63.92
66.25

61.01
61.01
64.09

67.80
66.79
66.92

37.46
35.07
36.60

40.20
38.00
38.20

RWKV-4
GPT-level∗

14.2
14.2

77.48
76.49

76.06
74.97

70.65
68.72

63.85
65.14

70.24
70.77

38.99
37.99

41.80
39.27

Pythia (c.f.)
GPT-NeoX (c.f.)

11.8
20.6

75.90
77.69

74.40
76.11

67.38
71.42

64.72
65.98

69.82
72.69

36.77
40.44

38.80
40.20

Table 3: Zero-Shot Performance of the model on Common Sense Reasoning Tasks. ∗ Interpolation of Pythia and
GPT-Neo models

Model

Params
B

LAMBADA
ppl

LAMBADA
acc

headQA
acc_norm

sciq
acc

triviaQA
acc

ReCoRD
em

COPA
acc

RWKV-4
Pythia
GPT-Neo

0.17
0.16
0.16

29.33
24.38
30.27

32.99
38.97
37.36

25.78
25.82
25.16

77.50
76.50
76.60

1.26
1.31
1.18

62.03
66.32
64.92

66.00
62.00
64.00

RWKV-4
Pythia
GPT-Neo

0.43
0.40
0.40

13.04
11.58
13.88

45.16
50.44
47.29

27.32
25.09
26.00

80.30
81.50
81.10

2.35
2.03
1.38

70.48
75.05
73.79

65.00
67.00
65.00

RWKV-4
Pythia
GPT-Neo

1.5
1.4
1.4

7.04
6.58
7.5

56.43
60.43
57.25

27.64
27.02
27.86

85.00
85.50
86.00

5.65
5.52
5.24

76.97
81.43
80.62

77.00
73.00
69.00

RWKV-4
Pythia
GPT-Neo

3.0
2.8
2.8

5.25
4.93
5.63

63.96
65.36
62.22

28.45
28.96
27.17

86.50
87.70
89.30

11.68
9.63
4.82

80.87
85.10
83.80

82.00
77.00
80.00

RWKV-4
Pythia
GPT-J

7.4
6.9
6.1

4.38
4.3
4.1

67.18
67.98
68.31

31.22
28.59
28.67

88.80
90.00
91.50

18.30
15.42
16.74

83.68
86.44
87.71

85.00
85.00
83.00

RWKV-4
GPT-level∗

14.2
14.2

3.86
3.81

70.83
70.94

32.64
31.03

90.40
92.20

24.58
22.37

85.67
87.89

85.00
82.66

Pythia (c.f.)
GPT-NeoX (c.f.)

11.8
20.6

3.89
3.64

70.44
71.94

30.74
31.62

91.80
93.00

20.57
25.99

87.58
88.52

82.00
84.00

Table 4: Zero-Shot Performance of various models on different tasks. ∗ Interpolation of Pythia and GPT-Neo
models

Method

L

d

T

Train bpc

Test bpc

Time Complexity

Transformer
Transformer

12
24

512
256

1024
1024

0.977
1.039

1.137
1.130

O(T d)
O(T 2 d)

O(T 2 + T d)
O(T 2 + T d)

Reformer
Synthesizer
Linear Transformer
Performer
AFT-simple

12
12
12
12
12

512
512
512
512
512

1024
1024
1024
1024
1024

1.040
0.994
0.981
1.002
0.854

1.195
1.298
1.207
1.199
1.180

O(T log T d)
O(T 2 d)
O(T d2 )
O(T d2 log d)
O(T d)

O(T log T + T d)
O(T 2 + T d)
O(T d + d2 )
O(T d log d + d2 log d)
O(T d)

RWKV-RNN

6

512

1024

0.720

-

O(Td)

O(d)

2

Space Complexity

Table 5: Enwik8 results, measured in bits per character (bpc): the lower the better. Baseline comparisons are made
with Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020) (the best performing dense version), Linear
Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). L, d, and T denote the number
of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and
Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other
models are implemented in native Pytorch.

quences. For GPU experiments, we use an NVIDIA
A100 with 80GB of VRAM. For CPU experiments,
we use an AMD EPYC processor with 30 CPU
cores and 200 GiB RAM.

Figure 11: Text generation inference time for LLMs.
Figure 10: Text generation inference memory (CPU
RAM, GPU VRAM) for LLMs. Model parameters are
not accounted.

Task Name

Measure
type

ChatGPT
[%]

GPT-4
[%]

RWKV-4
GPT [%]

RWKV-4
changed [%]

SOTA
[%]

RTE
WNLI
GoEmotions
PolEmo2

F1 Macro
Accuracy
F1 Macro
F1 Macro

88.1
81.7
25.6
44.1

91.3
91.6
23.1
41.0

44.2
47.9
7.9
38.2

74.8
49.3
7.9
40.9

92.1
97.9
52.8
76.4

Table 6: ChatGPT, GPT-4 and RWKV-4-Raven-14B
reasoning performance comparison in RTE (Wang
et al., 2019), WNLI (Wang et al., 2018), GoEmotions
(Demszky et al., 2020), and PolEmo2 (Kocoń et al.,
2019) benchmarks. SOTA is provided as a supplementary reference.

J

Importance of prompt construction
and comparison to GPT models

Inspired by article (Kocoń et al., 2023), we compared the zero-shot performance of the RWKV4-Raven-14B with ChatGPT (access in February
2023) and GPT-4 using several known NLP tasks,
i.e., recognizing textual entailment (RTE), Winograd Natural Language Inference (WNLI), and recognizing emotions elicited in readers (GoEmotions
and PolEmo2). Each model got the same prompts
manually chosen to receive proper responses from
the ChatGPT model. As shown in Tab. 6, RWKV
performs significantly worse than ChatGPT and
GPT-4 in specific task performance. We suspect
that this disparity is likely caused by the choice
of prompts used to generate the answers. Given
that prompts are in natural language and do not
consider that RWKV is an RNN, so it can not look
back inside an instruction.
When the instruction style was adapted to respect that RNNs is not capable for retrospective
processing, quality on some datasets increased significantly (ex. for RTE (Wang et al., 2019) F1
Macro increased from 44.2% to 74.8%). We hypothesize that RWKV models are more sensitive
to the position of the components in the context,
as RNN-based architectures cannot look back and
readjust the weight of previous information. For
better performance, desired information should be
after the question. Example of ChatGPT prompt to
RTE:
Having premise <here is a premise> judge if the
following hypothesis <here is a hypothesis> are
logically connected with the premise? Answer "entailment" if yes, or "not_entailment" if no.
RWKV prompt taking into account the characteristics of the RNN:
Can you tell me if the hypothesis is entailment or is
not entailment to the premise?

Task Name

Measure
type

ChatGPT
[%]

RWKV-4
adapted [%]

SOTA
[%]

Aggression
MathQA
Sarcasm
TweetSent
Unhealthy

F1 Macro
Accuracy
F1 Macro
F1 Macro
F1 Macro

69.10
71.40
49.88
63.32
45.21

56.66
80.69
50.96
52.50
43.30

74.45
83.20
53.57
72.07
50.96

Table 7: ChatGPT and RWKV-4-Raven-14B performance comparison in Aggresion (Wulczyn et al., 2017),
Sarcasm (Siddiqui, 2019), Unhealthy (Price et al.,
2020), MathQA (Cobbe et al., 2021), and TweetSent
(Barbieri et al., 2020) benchmarks. SOTA is provided
as a supplementary reference.

premise: <here is a premise>
hypothesis: <here is a hypothesis>
While separating the instruction from the input
is relatively easy to do, other aspects of prompt
engineering are harder to quantify. Testing the approach of stating the input after the question on
multiple other tasks, shown in tab. 7, suggests that
better prompts might reduce the disparity between
models. Raven achieves comparable result to ChatGPT on unhealthy conversation detection and even
surpasses it on the sarcasm detection dataset. While
this approach in prompting looks necessary, it alone
is not enough to replace the capability of having
free access to the whole context. Therefore, prompt
engineering seems to be of significantly more importance to the RNN models compared to standard transformers. It is entirely possible that good
prompts to RNN models do not mean additional
restrictions, but should simply be constructed using
completely different guidelines. While authors of a
forementioned paper (Kocoń et al., 2023) perform
chain-of-thought to improve results on MathQA
dataset, trying the same approach on the Raven
model achieved a very low accuracy of 48.11%.
Despite that, using a very similar prompt without
the chain-of-thought outperformed ChatGPT on
this dataset. It would not be surprising if, thanks
to the hidden state, RNN models do not need additional steps during inference, and adding those only
makes the model forget the information it analyzed
and deemed important. This further emphasizes the
importance of the order of information the model
receives.
The template used to prompt the Raven model in
MathQA with chain-of-thought prompting:
Write the reasoning and highlight the answer to
the question at the end in format: ’Answer: ’. The
question is: <here is question>

The template used to prompt the Raven model in
MathQA without chain-of-thought prompting:
Write the answer to the math question in the format:
’Answer: ’.
The question is: <here is question>

K Cases

In this part, we present a few instances of outputs produced by RWKV model using a Chat interface56 .

RWKV-4-Raven-14B
Tell me about ravens.
Ravens are large, black birds with a distinctive white head and neck. They are found in most parts
of the world, including North America, Europe, Asia, and Australia. Ravens are known for their
intelligence and problem-solving abilities. They are also considered to be symbols of death and
transformation in many cultures. Ravens are often associated with the afterlife or death because
they have been known to eat carrion or even other birds. In some cultures, ravens are also believed
to be messengers of the gods or guardians of treasure.

RWKV-4-Raven-14B
Explain the concept of machine learning to a 10-year-old.
Machine learning is like a super-smart computer that can learn on its own. It can recognize patterns
and make decisions based on what it has learned. Just like how you might remember something
from the past, the computer can remember what it’s learned and use it to make predictions or
decisions in the future. It’s like having a smart friend who always knows what to do!

RWKV-4-Raven-14B
Write a short poem describing nature’s beauty in springtime.
Nature’s beauty in springtime is rare,
A symphony of colors, a gentle surprise.
The flowers bloom, the birds sing and sing,
As the sun shines down, warming our skin.
The trees sway in the gentle breeze,
As we take in their beauty with ease.
The sky is blue, the clouds are white,
A sight to behold, it’s truly divine.

5
6

https://github.com/BlinkDL/ChatRWKV
https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

RWKV-4-Raven-14B
Below is the numerical stable formula for updating the state variable in RWKV:
a∗t = ep−q at−1 + eu+kt −q vt
b∗t = ep−q bt−1 + eu+kt −q
a∗
wkvt = ∗t
bt
Please write out the corresponding pytorch code to do so, where both at and bt are
torch.Tensor([n])
Here is the PyTorch code to update the state variable in RWKV using the given formula:
import t o r c h
def u p d a t e _ s t a t e ( a , b , u , v , p , q ) :
a _ t = t o r c h . exp ( p − q ) * a + t o r c h . exp ( u + k − q ) * v
b _ t = t o r c h . exp ( p − q ) * b + t o r c h . exp ( u + k − q )
return a_t , b_t
a , b = t o r c h . randn (2 , 3)
u = t o r c h . randn ( 2 )
v = t o r c h . randn ( 2 )
wkv = t o r c h . r a n d n ( 2 )
a _ t , b _ t = u p d a t e _ s t a t e ( a , b , u , v , wkv , t o r c h . r a n d n ( 2 ) )
print ( " a_t : " , a_t )
print ( " b_t : " , b_t )
This code defines a function update_state that takes in the state variables a and b, the control
variables u and v, and the parameters wkv and torch.randn. It then uses the PyTorch torch.exp
function to compute the exponential of the control variables and the log of the normalization factor.
Finally, it uses these values to update the state.

RWKV-4-Raven-14B
Explain the following metaphor: Life is like cats.
The metaphor "Life is like cats" suggests that life can be unpredictable and full of unexpected
twists and turns, just like a cat’s behavior. It implies that we cannot fully control or predict the
outcome of our lives, just as we cannot control the movements of a cat. The metaphor also implies
that life is complex and multifaceted, with many different variables and factors at play. Just as a
cat can have multiple personalities or moods, life can have many twists and turns that are difficult
to predict or understand. Overall, the metaphor suggests that life is complex and unpredictable,
much like a cat’s behavior.

RWKV-4-Raven-14B
Test the convergence of

P∞

−i
i=0 2 .

The given series is a geometric series with the first term a0 = 2 and common ratio r = 21 . We can
find the sum of this series using the formula for a geometric series:
∞
X

ai =

i=0

1
1
2
a0
= 2 1 =
= =2
1
1−r
1
1− 2
1− 2

So, the sum of the given series is 2 .

RWKV-4-Raven-14B
I would like to cook some Chinese food at home. Do you have any suggestions on the meal I
could choose?
Yes, of course! If you are not very experienced in cooking Chinese food, you could start with
making noodles or rice, then make a dish using one of those. I also think making stir-fry with
some meat is an excellent idea.
I want to make something more special. What about red braised pork belly?
Yes, red braised pork belly is also very popular and tasty.
Tell me the steps to cook.
You will need a few ingredients such as pork belly, star anise, sugar, ginger, scallions and wine.
Put all of these in a pot, cook on medium heat and simmer until the meat is tender. After that,
remove the meat from the pot and reserve the sauce. Slice the meat and add some ginger slices
into it, then boil until it’s almost done.
What do I need to be careful with the most during the cooking process?
To cook pork belly properly, it is very important to monitor the heat of the stove and be patient
with the cooking process. Overcooking it can cause the pork to be tough, dry and difficult to chew.
You can add a few ingredients like soy sauce or black beans to give the pork more flavor.
Tell me more about the proper heat and timing.
If the temperature of the water is too high, it can damage the collagen and connective tissues,
resulting in tough pork belly. It is essential to control the temperature carefully, especially
when braising the pork. Additionally, keep the water simmering but not boiling to maintain a
low simmer, as high heat will dry out the pork belly and result in tough and unpalatable meat.
Lastly, monitor the heat during cooking and make sure the heat is medium, as this is what is most
appropriate for slow cooking pork belly.
I see. Thank you for your advice.
You’re welcome! It was a pleasure chatting with you. Let me know if you need any further
assistance in the future.

RWKV-4-Raven-14B
If all apples are fruits and all fruits are sweet, are all apples sweet? Please explain your
answer in detail.
Yes, all apples are fruits, and all fruits are sweet. Therefore, it is safe to assume that all apples are
sweet.
If birds can fly. Penguins are birds. Can penguins fly? Please explain your answer in detail.
No, penguins cannot fly. Penguins are birds, but they are not capable of flight like other birds.
They have wings and can glide for short distances, but they cannot take off or fly long distances
like other birds.
The day before two days after the day before tomorrow is Saturday. What day is it today?
Please explain your answer in detail.
Today is Friday. Two days after tomorrow is Sunday. Tomorrow is Saturday.
If five cats can catch five mice in five minutes, how long will it take one cat to catch one
mouse? Please explain your answer in detail.
If five cats can catch five mice in five minutes, it means that each cat can catch one mouse in one
minute. To find out how long it will take one cat to catch one mouse, we can use the concept of
proportionality.
Let’s assume that it takes x minutes for a cat to catch a mouse. Then, the number of mice caught
by a cat in one minute is equal to the number of cats multiplied by the number of mice caught by
each cat: x = 5 * 1 = 5
So, it will take 5 minutes for one cat to catch one mouse.

RWKV-4-Raven-14B
Summarize the following paragraph.
<The Abstract of this paper goes here>
The paper presents a novel architecture, Receptance Weighted Key Value (RWKV), that combines
the efficient parallelizable training of Transformers with the efficient inference of RNNs. The
model is able to scale to tens of billions of parameters and exhibits linear computational complexity
during training and inference, making it a promising alternative to Transformers for sequence
processing tasks.
Please further summarize it in one sentence.
The researchers have developed a new architecture that combines the efficient parallelizable
training of Transformers with the efficient inference of RNNs, which allows for linear scaling in
memory and computational requirements.