MarkTechPost@AI 08月26日
微软发布VibeVoice-1.5B,革新长篇幅多语种语音合成技术
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软最新开源的VibeVoice-1.5B文本转语音(TTS)模型,突破了长篇幅、多语种语音合成的界限。该模型具备生成长达90分钟不间断、自然流畅音频的能力,并支持最多四位不同说话者同时发声,甚至可实现跨语言和歌唱合成。其流式架构和即将推出的7B模型预示着AI语音技术在对话、播客和合成语音研究领域将迎来重大进展。VibeVoice-1.5B采用MIT许可,强调研究、透明度和可复现性,为AI语音应用提供了强大的开源框架。

🔊 **长时、多说话人语音合成突破:** VibeVoice-1.5B能够合成长达90分钟的语音,并支持最多四位不同说话者同时发声,显著超越了传统TTS模型单次1-2位说话者的限制,为生成更复杂、更自然的对话场景提供了可能。

🎶 **跨语言及歌唱合成能力:** 尽管主要训练数据为英语和中文,该模型展现了出色的跨语言合成能力,并能生成歌唱语音,这是开源TTS模型中罕见的特性,极大地扩展了其应用范围。

💡 **高效流式架构与可扩展性:** 模型采用了低帧率(7.5Hz)的声学和语义分词器,并结合轻量级的扩散模型解码器,优化了长序列的计算效率和一致性,为未来更大规模的7B模型奠定了基础。

⚖️ **MIT许可与负责任使用:** VibeVoice-1.5B采用MIT许可协议,完全开源并商业友好,旨在促进研究、透明度和可复现性。微软明确禁止用于语音模仿、虚假信息传播或绕过身份验证等非法用途,并要求用户遵守法律并披露AI生成内容。

🚀 **研究与开发价值:** 该模型为AI语音研究者和开发者提供了强大的工具,尤其适用于需要高保真度、情感表达和多角色交互的对话式AI、播客制作等领域,标志着开源AI语音合成技术的重要里程碑。

Microsoft’s latest open source release, VibeVoice-1.5B, redefines the boundaries of text-to-speech (TTS) technology—delivering expressive, long-form, multi-speaker generated audio that is MIT licensed, scalable, and highly flexible for research use. This model isn’t just another TTS engine; it’s a framework designed to generate up to 90 minutes of uninterrupted, natural-sounding audio, support simultaneous generation of up to four distinct speakers, and even handle cross-lingual and singing synthesis scenarios. With a streaming architecture and a larger 7B model announced for the near future, VibeVoice-1.5B positions itself as a major advance for AI-powered conversational audio, podcasting, and synthetic voice research.

Key Features

https://huggingface.co/microsoft/VibeVoice-1.5B

Architecture and Technical Deep Dive

VibeVoice’s foundation is a 1.5B-parameter LLM (Qwen2.5-1.5B) that integrates with two novel tokenizers—Acoustic and Semantic—both designed to operate at a low frame rate (7.5Hz) for computational efficiency and consistency across long sequences.

Model Limitations and Responsible Use

Conclusion

Microsoft’s VibeVoice-1.5B is a breakthrough in open TTS: scalable, expressive, and multi-speaker, with a lightweight diffusion-based architecture that unlocks long-form, conversational audio synthesis for researchers and open source developers. While use is currently research-focused and limited to English/Chinese, the model’s capabilities—and the promise of upcoming versions—signal a paradigm shift in how AI can generate and interact with synthetic speech.

For technical teams, content creators, and AI enthusiasts, VibeVoice-1.5B is a must-explore tool for the next generation of synthetic voice applications—available now on Hugging Face and GitHub, with clear documentation and an open license. As the field pivots toward more expressive, interactive, and ethically transparent TTS, Microsoft’s latest offering is a landmark for open source AI speech synthesis.


FAQs

What makes VibeVoice-1.5B different from other text-to-speech models?

VibeVoice-1.5B can generate up to 90 minutes of expressive, multi-speaker audio (up to four speakers), supports cross-lingual and singing synthesis, and is fully open source under the MIT license—pushing the boundaries of long-form conversational AI audio generation

What hardware is recommended for running the model locally?

Community tests show that generating a multi-speaker dialog with the 1.5 B checkpoint consumes ≈ 7 GB of GPU VRAM, so an 8 GB consumer card (e.g., RTX 3060) is generally sufficient for inference.

Which languages and audio styles does the model support today?

VibeVoice-1.5B is trained only on English and Chinese and can perform cross-lingual narration (e.g., English prompt → Chinese speech) as well as basic singing synthesis. It produces speech only—no background sounds—and does not model overlapping speakers; turn-taking is sequential.


Check out the Technical ReportModel on Hugging Face and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VibeVoice-1.5B Microsoft Text-to-Speech TTS AI语音 开源 语音合成 多说话人 长篇幅 跨语言 歌唱合成 MIT License Conversational AI Open Source Speech Synthesis Multi-Speaker Long-Form Audio Cross-Lingual Singing Synthesis
相关文章