视频语言模型中NSA的应用与性能提升

cs.AI updates on arXiv.org 10月03日

视频语言模型中NSA的应用与性能提升

本文提出了一种名为VideoNSA的新方法，通过在216K视频指令数据集上端到端训练Qwen2.5-VL，并采用混合注意力机制，在视频语言模型中应用Native Sparse Attention（NSA），显著提升了长视频理解、时间推理和空间基准测试的性能。

arXiv:2510.02295v1 Announce Type: cross Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频语言模型 NSA 注意力机制性能提升长视频理解

相关文章

AnchorGT: A Novel Attention Architecture for Graph Transformers as a Flexible Building Block to Improve the Scalability of a Wide Range of Graph Transformer Models

A Survey Report on New Strategies to Mitigate Hallucination in Multimodal Large Language Models

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

Hyperparameter Optimization through Neural Network Partitioning with Christos Louizos - #627

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

Unlocking the Potential of SirLLM: Advancements in Memory Retention and Attention Mechanisms

Language Understanding and LLMs with Christopher Manning - #686

Aaren: Rethinking Attention as Recurrent Neural Network RNN for Efficient Sequence Modeling on Low-Resource Devices

重新思考注意力的数学机制