Ro-Bench：评估MLLMs在对抗性视频数据上的鲁棒性

cs.AI updates on arXiv.org 10月13日 12:13

Ro-Bench：评估MLLMs在对抗性视频数据上的鲁棒性

本文提出Ro-Bench，首个针对动态异常分布视频测试集评估多模态大语言模型（MLLMs）鲁棒性的基准。研究发现，当前模型在对抗性视频内容上性能显著下降，而使用对抗性数据进行微调可显著提升模型鲁棒性。

arXiv:2510.08936v1 Announce Type: cross Abstract: Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ro-Bench MLLMs 视频理解鲁棒性对抗性数据

相关文章

Fairness and Robustness in Federated Learning with Virginia Smith -#504

High-Dimensional Robust Statistics with Ilias Diakonikolas - #351

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

Gemini视频推理遥遥领先GPT-4o，Jeff Dean连续转发三次，首个视频多模态基准Video-MME来了

RABBITS: A Specialized Dataset and Leaderboard to Aid in Evaluating LLM Performance in Healthcare

Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs)

多模态大模型看懂图片也会答错，智源联合多家机构推出多模态模型鲁棒性测试基准

Meet Jockey: A Conversational Video Agent Powered by LangGraph and Twelve Labs API

北航沙磊教授：当Agentic RAG照进现实——Agent Insights