Crisp Attention: Regularizing Transformers via Structured Sparsity

cs.AI updates on arXiv.org 08月11日

Crisp Attention: Regularizing Transformers via Structured Sparsity

本文提出在DistilBERT模型中引入结构化稀疏注意力，提高Transformer模型在情感分析任务上的准确率，验证稀疏性可作为有效正则化手段。

arXiv:2508.06016v1 Announce Type: cross Abstract: The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Transformer 稀疏注意力模型准确性 DistilBERT 正则化

相关文章

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Trends in Computer Vision with Georgia Gkioxari - #549

Social Commonsense Reasoning with Yejin Choi - #518

Trends in Natural Language Processing with Sameer Singh - #445

AI趨勢周報第252期：取代Transformer？LSTM之父發表新LLM架構

How ‘Chain of Thought’ Makes Transformers Smarter

This AI Paper by Toyota Research Institute Introduces SUPRA: Enhancing Transformer Efficiency with Recurrent Neural Networks

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Octo: An Open-Sourced Large Transformer-based Generalist Robot Policy Trained on 800k Trajectories from the Open X-Embodiment Dataset

惊喜发现又祛魅一项能力：读论文 CS 专业一路走来被论文折磨，现以为脱离苦海，但又不得不紧跟看 LLM SD 论文，痛点就是：看不下去，精神涣散?‍♂️啃能读完...