[Paper] RWKV: Reinventing RNNs for the Transformer Era

Interesting paper out of EleutherAI demonstrating a novel architecture with near-linear-time-scale token-generation.



Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.

Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.


I came across this preprint a few months ago.
I find it interesting as well, but it’s not clear to me how it is superior in practical terms (not just theoretically) to architectures like GPT, as there are few implementations.

Here you can find a Github repository and a demonstration of the RWKV model that is hosted on HuggingFace.

I’m not that knowledgeable in this field, but I hope this can be of some reference to you.


For me, the most interesting parts are,

Table 1

Figure 7

Showing hugely reduced demands for inference.

There are still open questions, of course, but if this develops into something genuinely equivalent to Transformers, well… that will be, as they say, huge if true.

The net result should be models which have far lower VRAM requirements and are blazingly fast.

This would have two, immediate, consequences.

  1. It will be much easier to self-host larger, more powerful models.
  2. Even larger, even more powerful models will cost much less to run at scale, causing API prices to plunge.

Imagine GPT-4 API calls at 1/4 the cost of GPT-3.5-Turbo…

1 Like

I totally agree.

This characteristic that the computational complexity only increases linearly with the length of the input is very interesting.

Recently, there are lightweight language models that can be hosted on PCs.

There are also language models that work minimally well on mobile devices (although this seems to be using the transformer model).

I am also intrigued by how RWKV will be utilized in practical applications, although it may be unclear whether the applications of RWKV and GPT are entirely the same, and considering that there is ongoing progress in making the GPT architecture lighter.

The traditional belief that large language models consume large amounts of computational resources may be overturned by replacing them with such models.

We’ll have to keep an eye on how future technology develops!

1 Like

Just wanted to mention, I came across the official RWKV site.

It provides a lot of detailed info on their technology, which could help clarify some of the points we’ve been discussing.

If you haven’t seen it yet, it may be worth a look.

1 Like

I guess what’s weird, is I was seeing months ago that RWKV was essentially deprecated based on reading other papers, like Hyena.

I’ve seen RWKV outshined in various contexts.

The proof is not aways performance, but also what other things RWKV brings to the table too.

So much research blasting out of the firehose these days :rofl:


Great find!

Thank you for coming back and sharing this.

One thing I learned today, following the link to their Twitter from the project homepage you shared was this idea of an “uncheatable eval” which makes is reallyinteresting

The basic idea is to just take a sample of 1,000 recent papers from arxiv, grab the first few thousand characters, tokenizer, and computer the log probs for the sequence.

In general, a model might be considered “good” if it considers the sequence highly probable.

The claim is that Model A is “better” than Model B if the total log probabilities from Model A are closer to zero than the total log probabilities from Model B.

This idea makes some intuitive sense as we would normally think that a model which is more likely to generate the ground truth is better than a model which is less likely to generate the ground truth, but…

It really depends on what the “quality” of that ground truth is, right?

For instance, take any of my random posts on this forum. If one model is more likely than another to generate one of my exact posts, does that make it “better?”

I don’t know. I write a lot of dumb, random, crazy, and wrong things, so maybe a model that is more likely to generate one of my posts is actually a worse model if we’re interested in generating high quality text…

Perhaps a better “uncheatable” eval would be,

Given the abstracts of 1000 random arxiv papers what are the total log probabilities of the conclusions from those same papers?

I don’t know.

I don’t have an answer, I just thought it was interesting. Maybe I’ll split this out into a new topic and solicit feedback.

1 Like

While I lack a deep understanding of this field and may not grasp many details, my understanding of ground truth in the context of language model testing is as follows:

Model testing:
Ground truth data is used as test data, where the trained algorithm is tested for model accuracy.

The concept of eval published in this GitHub repository and your idea seem to use the beginnings (or abstracts) of arXiv papers to see if the models can correctly predict the remainder (or the conclusions).

However, when it comes to language models, I wonder if it’s appropriate to use the ability to correctly complete the content of pre-reviewed papers as a benchmark.

This is because the evaluation of language model outputs is subject to human judgment, and the content of arXiv papers is too volatile to be used as ground truth.

Moreover, as the term “likelihood” suggests, language models fundamentally generate text based on the concept of likelihood.

And I think that the evaluation of language model outputs probably needs to consider the following points:

  • Whether it can theoretically handle language tasks correctly.
  • Whether reasoning is performed without deviation.

Furthermore, it seems to me that language models are not necessarily designed to convey facts, nor is it clear that they should be.

In my view, the question of what a language model should express may need to be considered independently of its ability to reproduce specific texts; since language models can serve as tools for rephrasing or reinterpreting existing knowledge in a variety of ways.

An older arXiv paper I found earlier seems to have some degree of universality.

If one overtrains the language model to tell the truth, the model will lose its flexibility and fail to catch up with the changing facts, and as a result, the model may start to tell lies.

Even if the conventional beliefs are overturned at some point, the language model will still try to tell the conventional beliefs as the truth.

This problem also affects the creative process, such as fiction, which has nothing to do with truth.

I think its a difficult issue about evaluating language models, but I am glad if this can lead to some discussion, or even if not, just some consideration!

But I apologize if this reply is too off-topic.