VGGSounder: Audio-Visual Evaluations for Foundation Models

cs.AI updates on arXiv.org 08月12日

VGGSounder: Audio-Visual Evaluations for Foundation Models

文章提出VGGSounder，针对VGGSounder数据集的局限性，重新标注并扩展，旨在更精确评估音频-视觉模型的多模态理解能力。

arXiv:2508.08237v1 Announce Type: cross Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSounder dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSounder, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

音频-视觉模型 VGGSounder 数据集评估多模态理解模型性能

相关文章

Import AI 370: 213 AI safety challenges; everything becomes a game; Tesla’s big cluster

This AI newsletter is all you need #98

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

「中杯」Claude 3.5突然上线，竟比GPT-4o还强，全新Artifacts改写模型交互

Perplexity: ↩️ This model outperforms Claude 3 Opus and GPT-4o on our internal benchmarks.

WAIC观察：大模型AI应用开始小规模稳步爆发

构建Agent系统，那些没人会告诉你的经验教训！

20个实验数据创造AI蛋白质里程碑！上海交大联合上海AI Lab发布FSFP，有效优化蛋白质预训练模型

微软CTO：AI大模型的“Scaling Law”还能走多远

“最强开源 AI 模型”，4050 亿参数版 Meta Llama 3 被曝 7 月 23 日发布