MUSE基准测试：MLLM音乐理解评估

cs.AI updates on arXiv.org 10月23日 12:09

MUSE基准测试：MLLM音乐理解评估

本文介绍MUSE基准测试，用于评估MLLM的音乐理解能力。通过对比SOTA模型与人类基准，发现模型在音乐感知方面存在明显不足，并分析了CoT提示的效果。

arXiv:2510.19055v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MLLM 音乐理解 MUSE基准测试模型评估 CoT提示

相关文章

英國釋出AI模型安全評估平臺Inspect

GenAI-Arena: An Open Platform for Community-Based Evaluation of Generative AI Models

With 700,000 Large Language Models (LLMs) On Hugging Face Already, Where Is The Future of Artificial Intelligence AI Headed?

未来的杀手级AI应用，正在扣子上匿名PK

LeCun谢赛宁首发全新视觉多模态模型，等效1000张A100干翻GPT-4V

GPT-4批评GPT-4实现「自我提升」，OpenAI前超级对齐团队又一力作被公开

大模型 API 文档一览：有的简洁易用，有的乱七八糟

Anthropic Pushes for Third-Party AI Model Evaluations

The future of productivity agents with NinjaTech AI and AWS Trainium

Gemma 2成最强开源模型，力压 Llama 3！但榜单prompt微调引争议