音频模型Audio-Maestro提升推理性能

cs.AI updates on arXiv.org 10月14日

音频模型Audio-Maestro提升推理性能

本文介绍Audio-Maestro，一个增强音频语言模型推理能力的工具辅助框架，通过调用外部工具并整合其输出，显著提高了音频模型的推理准确率。

arXiv:2510.11454v1 Announce Type: cross Abstract: Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

音频模型推理性能工具辅助框架

相关文章

Stability AI釋出文字生成聲音模型開源版本Stable Audio Open

快20倍还便宜！NVIDIA GPU的“掘墓人”出现了？

Optimizing Large Language Models (LLMs) on CPUs: Techniques for Enhanced Inference and Efficiency

Qwen2-Audio Released: A Revolutionary Audio-Language Model Overcoming Complex Audio Challenges with Unmatched Precision and Versatile Interaction Capabilities

阿里通义开源音频语言模型Qwen2-Audio，相关论文入选顶会ACL 2024

阿里通义开源音频语言模型Qwen2-Audio 相关论文入选顶会ACL 2024

阿里通义开源音频语言模型Qwen2-Audio，相关论文入选顶会ACL 2024

阿里通义千问开源 Qwen2-Audio 7B 语音交互大模型：自由互动，无需输入文本

VideoLLaMA 2 Released: A Set of Video Large Language Models Designed to Advance Multimodal Research in the Arena of Video-Language Modeling

摩尔线程开源音频理解大模型 MooER：基于国产全功能 GPU 训练和推理