Reorganizing attention-space geometry with expressive attention

cs.AI updates on arXiv.org 08月11日

Reorganizing attention-space geometry with expressive attention

文章提出一种基于平方点积的表达注意力（EA）机制，通过增强平行或反平行查询和键的注意力权重，在保持计算成本和内存需求不变的情况下，提升模型性能，特别是在复杂任务和多任务设置中表现优于传统点积注意力（DPA）。

arXiv:2407.18601v3 Announce Type: replace-cross Abstract: Attention regulates information transfer between tokens. For this, query and key vectors are compared, typically in terms of a scalar product, $\mathbf{Q}^T\mathbf{K}$, together with a subsequent softmax normalization. In geometric terms, the standard dot-product attention (DPA) leads to large/small attention weights for parallel/antiparallel queries and keys. Here we study expressive attention (EA), which is based on $(\mathbf{Q}^T\mathbf{K})^2$, the squared dot product. In this case, attention is enhanced when query and key are either parallel or antiparallel, and suppressed for orthogonal configurations. EA can be introduced into any attention-based code without additional compute costs or memory requirements. For a series of autoregressive prediction tasks, we find that expressive attention performs at least as well as vanilla DPA. Increasing task complexity, EA is observed to outperform DPA with increasing margins, which also holds for multi-task settings. For a given model size, EA manages to achieve 100% performance for a range of complexity levels not accessible to DPA. Our results show that it is possible to reorganize the geometry of the matching condition in the space of attention heads without loss of performance.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

表达注意力模型性能注意力机制点积注意力复杂任务

相关文章

AnchorGT: A Novel Attention Architecture for Graph Transformers as a Flexible Building Block to Improve the Scalability of a Wide Range of Graph Transformer Models

This AI newsletter is all you need #98

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

Hyperparameter Optimization through Neural Network Partitioning with Christos Louizos - #627

Unlocking the Potential of SirLLM: Advancements in Memory Retention and Attention Mechanisms

Language Understanding and LLMs with Christopher Manning - #686

Aaren: Rethinking Attention as Recurrent Neural Network RNN for Efficient Sequence Modeling on Low-Resource Devices

重新思考注意力的数学机制

Inspectus: An Open-Sourced Large Language Model LLM Attention Visualization Library

This AI Paper from Georgia Institute of Technology Introduces LARS-VSA (Learning with Abstract RuleS): A Vector Symbolic Architecture For Learning with Abstract Rules