Recursal AI development blog 09月25日
新型AI模型发布
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

We are proud to announce the updated QRWKV-72B1 and 32B models. Both are available on Hugging Face and Featherless.ai. The 72B model surpasses existing transformer models in several benchmarks while following closely in others. This builds on previous experiments converting QRWKV6, where the Qwen 2.5 32B model was converted to RWKV. The core idea involves freezing weights, deleting the attention layer, and replacing it with RWKV, training through multiple stages while referencing the original model logits as a 'teacher model'. The model shows lower inference cost, param size, and better performance. The shift towards linear architectures represents a dramatic reduction in compute and VRAM requirements, allowing for large-scale applications and rapid iteration of new architectures.

🔬 The QRWKV-72B1 and 32B models surpass GPT-3.5 turbo without QKV attention, showing lower inference cost, param size, and better performance.

🔄 The conversion process involves freezing weights, deleting the attention layer, and replacing it with RWKV, training through multiple stages while referencing the original model logits as a 'teacher model'.

📈 The shift towards linear architectures reduces compute and VRAM requirements, allowing for large-scale applications and rapid iteration of new architectures.

🧠 The majority of AI model knowledge is in the FFN layer, not the attention mechanism, suggesting that attention mechanisms guide the model to focus on 'what the model thinks' about in the FFN layer.

🚀 The new models enable faster testing, iteration, and validation of new RWKV attention architectures, accelerating the development of personalized AI and AGI.

We are proud to announce the updated QRWKV-72B1 and 32B.

Both models are available on huggingface and featherless.ai

The largest model to date - that is not based on the transformer attention architecture.

Surpassing existing transformer models in several benchmarks, while following right behind in others.

This builds on our previous experiments in converting the QRWKV6, where we converted the previous Qwen 2.5 32B model to RWKV. And the previous 72B preview.

Which we applied instead for the Qwen-QwQ-32B model and the Qwen-72B model respectively.

But lets take a step back at what this means …


We now have a model far surpassing
GPT-3.5 turbo, without QKV attention

While slowly closing in on GPT-4O-mini

With lower inference cost, param size, and better performance.

In 2024: When we proposed scaling up RWKV to replace attention.
Many believed transformer attention, is the only viable path
to GPT 3.5 or better intelligence. Today this is disproven false.

We need no super cluster - only a single server.

Because we were keeping most of the feed forward network layer the same.
We can perform the conversion, (barely) within a single server of 8 MI300 GPU’s

Requiring the full 192GB VRAM allocation per GPU


How the conversion is done: A summary

While more details will be revealed in an upcoming paper. The core idea is similar to the previous QRWKV6 conversion , but this time we apply it to the Qwen-72B and QwQ-32B models

At a high level, you take an existing transformer model

Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages

All while referencing the original model logits as a “teacher model”

More specifically it would be the following

Unfortunately, due to the limitation of VRAM, our training was limited to 8k context length. However we view this as a resource constraint, and not a method constraint.


Implication:
AI knowledge, is not in attention, but FFN

Due to the limited token training of 200-500M, of the converted layers. We do not believe that the newly trained RWKV layers, is sufficiently trained for “knowledge/intelligence” at this level.

In other words, the vast majority of an AI model knowledge, is not in the attention but the matrix multiplication FFN (Feed-Forward-Network) layer.

It would be more accurate, to view the Attention mechanisms, be it transformer based, or RWKV. As a means of guiding the model to focus on “what the model thinks” about in the FFN layer.


Benefits: Ideal for large scale application

Additionally, with the shift towards inference-time-computing.

Linear architectures represents a dramatic reduction in both compute and vram requirement cost. Allowing us to scale hundreds to thousand requests per GPU.


We can now rapidly iterate new RWKV architectures on <100B scales

By dramatically reducing the compute requirement for scaling and testing a new RWKV attention architecture. To a small number of GPU’s

We will be able to test, iterate, and validate newer architecture changes faster, taking experiments what previously took weeks (or even months), to days.

Historically, the RWKV group has been averaging 4 major versions across 2 years. With improvement to both model architecture accuracy and memories at every step.

A trend which we plan to accelerate moving forward.

As we work on our roadmap to Personalized AI and eventually Personalized AGI, which you can see more in our following article …

Featherless AI - recursive dev blog
🛣️ Our roadmap to Personalized AI and AGI
If you want to find out more about the latest QRWKV model, that makes all of this possible, it is recommended to read this first…
Read more
1

This model was originally published as Qwerky-72B. However, due to confusion with another similar naming company/model, we have been requested to avoid using the Qwerky name, so we have renamed our models to QRWKV-72B

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

QRWKV-72B1 QRWKV-32B AI Models Transformer Attention RWKV Linear Architectures FFN Layer Personalized AI AGI
相关文章