SWAX混合架构提升长文处理能力

cs.AI updates on arXiv.org 09月30日 12:07

SWAX混合架构提升长文处理能力

本文介绍了一种名为SWAX的混合架构，结合滑动窗口注意力和线性RNN层，在长文处理上表现优异。研究发现，大窗口并不总是提升性能，短窗口注意力能更好地训练RNN的长期记忆。

arXiv:2509.24552v1 Announce Type: cross Abstract: Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

混合架构 SWAX RNN 注意力机制长文处理

相关文章

AnchorGT: A Novel Attention Architecture for Graph Transformers as a Flexible Building Block to Improve the Scalability of a Wide Range of Graph Transformer Models

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

Hyperparameter Optimization through Neural Network Partitioning with Christos Louizos - #627

Unlocking the Potential of SirLLM: Advancements in Memory Retention and Attention Mechanisms

Language Understanding and LLMs with Christopher Manning - #686

Aaren: Rethinking Attention as Recurrent Neural Network RNN for Efficient Sequence Modeling on Low-Resource Devices

重新思考注意力的数学机制

Inspectus: An Open-Sourced Large Language Model LLM Attention Visualization Library

This AI Paper from Georgia Institute of Technology Introduces LARS-VSA (Learning with Abstract RuleS): A Vector Symbolic Architecture For Learning with Abstract Rules

拯救Transformer推理能力，DeepMind新研究TransNAR：给模型嵌入“算法推理大脑”