Recursal AI development blog 09月25日 18:02
新型线性模型性能突破
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

最新线性模型QRWKV6-32B-Instruct在多项关键英语基准测试中超越所有现有RWKV、状态空间和液体AI模型,刷新了多项记录。该模型通过将Qwen 32B Instruct模型的权重转换为定制的QRWKV6架构实现,成功替换了现有的Transformer注意力头为RWKV-V6注意力头。训练过程由Recursal AI团队与RWKV和EleutherAI开源社区联合开发,使用了16台AMD MI300X GPU(每台192GB VRAM)进行,大幅降低了训练成本并提升了效率。模型在推理成本上提供超过1000倍的计算效率,尤其在大上下文长度下表现突出,但受限于其“父”模型Qwen的30种语言支持,且因计算预算限制当前支持的最大上下文长度为16k。

🔍 该模型QRWKV6-32B-Instruct在多项关键英语基准测试中超越所有现有RWKV、状态空间和液体AI模型,刷新了多项记录,展示了其强大的性能。

🧠 模型通过将Qwen 32B Instruct模型的权重转换为定制的QRWKV6架构实现,成功替换了现有的Transformer注意力头为RWKV-V6注意力头,这一创新转换训练过程由Recursal AI团队与RWKV和EleutherAI开源社区联合开发。

💻 训练过程使用了16台AMD MI300X GPU(每台192GB VRAM),大幅降低了训练成本并提升了效率,使得模型在大规模部署中更具成本效益。

⚡ 模型在推理成本上提供超过1000倍的计算效率,尤其在大上下文长度下表现突出,使得AI应用更加高效和普及。

🌐 尽管性能突出,但模型受限于其“父”模型Qwen的30种语言支持,且因计算预算限制当前支持的最大上下文长度为16k,未来可能需要进一步训练以支持更大的上下文长度。

The strongest linear model to date, beating out all previous RWKV, State Space and Liquid AI models, smashing all previous key english benchmarks and evals.

You can find this model available on both

Note: that as an instruction preview, the model is not considered final

Trained by converting the weights of the Qwen 32B Instruct model, into a customized QRWKV6 architecture. We were successfully able to replace the existing transformer attention heads with RWKV-V6 attention heads, through a groundbreaking new conversion training process.

This unique training process was developed by the team at Recursal AI, in joint collaboration with the RWKV and EleutherAI open source community.

Benchmarks

We compared QRWKV6 against existing open weights models, both transformer based and linear-architecture based.

In overall, what is most exciting is how the QRWKV6 converted model, perform similarly to its original 32B model.

GPU Sponsor: TensorWave

The conversion process for QRWKV6, was done on 16 AMD MI300X GPUs kindly donated by TensorWave. Each MI300X comes with a whopping 192GB of VRAM, while having comparable H100 level of compute performance.

This allowed us to reduce the minimum number of nodes required for our training process, and simplify our overall training and conversion process.

The conversion process took about 8 hours

The Exciting

Linear models hold promise in substantially lower compute cost at scale. Delivering over a 1000x compute efficiency in inference cost, especially over large context length. A key multiplier unlock for both O1 style inference time thinking, and making AI more accessible for the world.

This technique is also scalable to larger transformer based models. Which we have since started.

The Good

The benefit of this process is that we are able to convert any previously trained QKV Attention based model, such as Qwen and LLaMA based models, into a variant of RWKV. Without needing to retrain the model from scratch.

This allows us to quickly test and prove out the significantly more efficient RWKV Linear attention mechanic at a larger scale, with a much smaller budget, without training from scratch. Proving out the architecture design and scalability of RWKV.

Once again proving, QKV attention is not all you need.
( Someone ping @jefrankle and @srush_nlp )

The Bad

The disadvantage of this process is that the model inherent knowledge and dataset training, is based on its “parent” model. Meaning unlike previous RWKV models trained on over 100+ languages. The QRWKV model is limited to the approximate 30 languages supported by the Qwen line of models.

Additionally, instead of RWKV based channel mix and feedforward network layers, we retain the “parent” model feed forward network architecture design. This means there will be incompatibility with existing RWKV inference code.

Separately, due to our compute budget, we were only able to do the conversion process up to 16k context length. While the model does exhibit stability beyond the given context length, the following model may need additional training to accurately support larger context length

Future Followups

Currently Q-RWKV-6 72B Instruct model is being trained

Additionally with the finalization of RWKV-7 architecture happening soon, we intend to repeat the process and provide a full line up of

We intend to provide more details on the conversion process, along with our paper after the subsequent model release.


References

Acknowledgements

And of course a huge thank you to the many developers around the world working hard to improve the RWKV ecosystem and provide environmentally friendly open source AI for all.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

QRWKV6-32B-Instruct 线性模型 AI性能 开源模型 计算效率
相关文章