微软发布VibeVoice-1.5B，革新长篇幅多语种语音合成技术

Microsoft’s latest open source release, VibeVoice-1.5B, redefines the boundaries of text-to-speech (TTS) technology—delivering expressive, long-form, multi-speaker generated audio that is MIT licensed, scalable, and highly flexible for research use. This model isn’t just another TTS engine; it’s a framework designed to generate up to 90 minutes of uninterrupted, natural-sounding audio, support simultaneous generation of up to four distinct speakers, and even handle cross-lingual and singing synthesis scenarios. With a streaming architecture and a larger 7B model announced for the near future, VibeVoice-1.5B positions itself as a major advance for AI-powered conversational audio, podcasting, and synthetic voice research.

Key Features

Massive Context and Multi-Speaker Support

90 minutes of speech

four distinct speakers

Simultaneous Generation

parallel audio streams

Cross-Lingual and Singing Synthesis

cross-lingual synthesis

MIT License

Scalable for Streaming and Long-Form Audio

efficient long-duration synthesis

7B streaming-capable

Emotion and Expressiveness

emotion control

natural expressiveness

https://huggingface.co/microsoft/VibeVoice-1.5B

Architecture and Technical Deep Dive

VibeVoice’s foundation is a 1.5B-parameter LLM (Qwen2.5-1.5B) that integrates with two novel tokenizers—Acoustic and Semantic—both designed to operate at a low frame rate (7.5Hz) for computational efficiency and consistency across long sequences.

Acoustic Tokenizer

3200x downsampling

Semantic Tokenizer

proxy

Diffusion Decoder Head

Context Length Curriculum

65k tokens

Sequence Modeling

Model Limitations and Responsible Use

English and Chinese Only

No Overlapping Speech

not model overlapping speech

Speech-Only

does not generate background sounds, Foley, or music

Legal and Ethical Risks

voice impersonation, disinformation, or authentication bypass

Not for Professional Real-Time Applications

not optimized for low-latency, interactive, or live-streaming scenarios

Conclusion

Microsoft’s VibeVoice-1.5B is a breakthrough in open TTS: scalable, expressive, and multi-speaker, with a lightweight diffusion-based architecture that unlocks long-form, conversational audio synthesis for researchers and open source developers. While use is currently research-focused and limited to English/Chinese, the model’s capabilities—and the promise of upcoming versions—signal a paradigm shift in how AI can generate and interact with synthetic speech.

For technical teams, content creators, and AI enthusiasts, VibeVoice-1.5B is a must-explore tool for the next generation of synthetic voice applications—available now on Hugging Face and GitHub, with clear documentation and an open license. As the field pivots toward more expressive, interactive, and ethically transparent TTS, Microsoft’s latest offering is a landmark for open source AI speech synthesis.

FAQs

What makes VibeVoice-1.5B different from other text-to-speech models?

VibeVoice-1.5B can generate up to 90 minutes of expressive, multi-speaker audio (up to four speakers), supports cross-lingual and singing synthesis, and is fully open source under the MIT license—pushing the boundaries of long-form conversational AI audio generation

What hardware is recommended for running the model locally?

Community tests show that generating a multi-speaker dialog with the 1.5 B checkpoint consumes ≈ 7 GB of GPU VRAM, so an 8 GB consumer card (e.g., RTX 3060) is generally sufficient for inference.

Which languages and audio styles does the model support today?

VibeVoice-1.5B is trained only on English and Chinese and can perform cross-lingual narration (e.g., English prompt → Chinese speech) as well as basic singing synthesis. It produces speech only—no background sounds—and does not model overlapping speakers; turn-taking is sequential.

Check out the Technical Report, Model on Hugging Face and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers appeared first on MarkTechPost.

Table of contents

Key Features

Architecture and Technical Deep Dive

Model Limitations and Responsible Use

Conclusion

FAQs

What makes VibeVoice-1.5B different from other text-to-speech models?

What hardware is recommended for running the model locally?

Which languages and audio styles does the model support today?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签