MGM-Omni：多模态理解与长时语音生成新范式

cs.AI updates on arXiv.org 09月30日 12:07

MGM-Omni：多模态理解与长时语音生成新范式

本文提出MGM-Omni，一种统一的多模态语言模型，实现跨模态理解和长时语音生成。通过双轨架构，有效分离多模态推理与实时语音生成，实现高效跨模态交互和低延迟语音生成。

arXiv:2509.25131v1 Announce Type: cross Abstract: We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态理解长时语音生成 MGM-Omni 跨模态交互语音合成

相关文章

Import AI 370: 213 AI safety challenges; everything becomes a game; Tesla’s big cluster

正面硬刚OpenAI与谷歌？微软竟然偷偷自研出5000亿参数大模型

Neural Synthesis of Binaural Speech From Mono Audio with Alexander Richard - #514

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

AI News Weekly - Issue #386: Best AI Voice Generators 2024: What Scarlett Johansson's AI Dispute Taught Us - May 23rd 2024

快来感受一下，大早上震撼到我了，这也太真实了。这个视频里的声音是推上一个人用开源 TTS https://github.com/2noise/ChatTTS 生成的。 B站这里还有个作者演示...

自己尝试了一下这个 ChatTTS 语音合成项目。真的很牛批，他是有感情的，并且会自己在合适的地方添加语气词帮助衔接内容。并且和字节新上的 LLM 语音合成做了一...

Instreamatic Announces New Opportunities for Brands to Access Professional Union Talent for Highly Personalized Ads

搭了一个ChatTTS WebUI界面和api接口

sleepytales - Have AI read and write personalized bedtime stories