SaSaSa2VA：提升视频目标分割性能的新方法

cs.AI updates on arXiv.org 09月23日

SaSaSa2VA：提升视频目标分割性能的新方法

本文提出了一种名为SaSaSa2VA的视频目标分割新方法，通过有效分割增强和测试时集成，显著提高了基于多模态大型语言模型的视频目标分割性能。

arXiv:2509.16972v1 Announce Type: cross Abstract: Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J\&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频目标分割多模态大型语言模型 Sa2VA SaSaSa2VA 分割增强

相关文章

CharXiv: A Comprehensive Evaluation Suite Advancing Multimodal Large Language Models Through Realistic Chart Understanding Benchmarks

ProcTag: A Data-Oriented AI Method that Assesses the Efficacy of Document Instruction Data

Nature Methods特刊评论：用人工智能之“钥”，开空间组学之“锁”

SafeBench：多模态大模型安全评估框架，揭示MLLM安全隐患

Meta AI Releases LongVU: A Multimodal Large Language Model that can Address the Significant Challenge of Long Video Understanding

SAM 2: Meta's Next-Gen Model for Video and Image Segmentation

Sa2VA: A Unified AI Framework for Dense Grounded Video and Image Understanding through SAM-2 and LLaVA Integration

统一SAM2和LLaVA！字节豆包提出Dense Video多模态大模型Sa2VA

Advancing MLLM Alignment Through MM-RLHF: A Large-Scale Human Preference Dataset for Multimodal Tasks

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends