MarkTechPost@AI 09月23日
VoXtream:实现首词即发音的实时零样本TTS技术
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

VoXtream是一项突破性的文本转语音(TTS)技术,旨在解决传统流式TTS的延迟问题。与大多数TTS系统在接收到文本块后才开始发音不同,VoXtream能够实现“全流式”处理,即在接收到第一个单词后立即开始生成音频,并在80毫秒的帧内输出。该技术通过动态的音素前瞻和全自回归模型架构,显著降低了首次发包延迟(FPL),使其在现代GPU上达到102毫秒。VoXtream在保持低每帧计算量的同时,优化了模型对输入文本的实时响应能力,为实时交互应用提供了更流畅的语音合成体验,并在自然度和单词错误率上展现出优于现有方法的性能。

🎤 **首词即发音的全流式TTS**:VoXtream的核心创新在于其“全流式”TTS架构,能够实现从接收第一个单词开始就生成音频,打破了传统流式TTS必须等待文本块的限制。通过在80毫秒的帧内输出音频,并实现102毫秒的低首次发包延迟(FPL),它极大地减少了用户感知到的静默时间,提升了交互的即时性。

🧠 **动态音素前瞻与全自回归模型**:VoXtream通过在增量式音素转换器(PT)中采用动态音素前瞻(最多10个音素)来稳定韵律,但关键在于它不等待预设的上下文,而是允许在第一个单词进入缓冲区后立即开始生成。这种设计,结合Phoneme Transformer (PT)、Temporal Transformer (TT)和Depth Transformer (DT)的串联全自回归模型,确保了整个生成过程的流畅性和低延迟。

🚀 **性能与实际应用**:在实际测试中,VoXtream在现代GPU上实现了显著的性能提升,例如在编译后RTF(Real-Time Factor)可达0.17,意味着其速度超过实时5倍。与现有流行流式TTS基线相比,VoXtream在全流式场景下展现出更低的单词错误率(WER)和更高的自然度偏好,使其成为构建实时语音代理和应用的理想选择,尽管在说话人相似度方面可能略逊于某些专注于此的技术。

Real-time agents, live dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks still wait for a chunk of text before they emit sound, so the human hears a beat of silence before the voice starts. VoXtream—released by KTH’s Speech, Music and Hearing group—attacks this head-on: it begins speaking after the first word, outputs audio in 80 ms frames, and reports 102 ms first-packet latency (FPL) on a modern GPU (with PyTorch compile).

What exactly is “full-stream” TTS and how is it different from “output streaming”?

Output-streaming systems decode speech in chunks but still require the entire input text upfront; the clock starts late. Full-stream systems consume text as it arrives (word-by-word from an LLM) and emit audio in lockstep. VoXtream implements the latter: it ingests a word stream and generates audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. The architecture explicitly targets first-word onset rather than only steady-state throughput.

https://arxiv.org/pdf/2509.15969

How does VoXtream start speaking without waiting for future words?

The core trick is a dynamic phoneme look-ahead inside an incremental Phoneme Transformer (PT). PT may peek up to 10 phonemes to stabilize prosody, but it does not wait for that context; generation can start immediately after the first word enters the buffer. This avoids fixed look-ahead windows that add onset delay.

What’s the model stack under the hood?

VoXtream is a single, fully-autoregressive (AR) pipeline with three transformers:

Mimi’s streaming codec design and dual-stream tokenization are well documented; VoXtream uses its first codebook as “semantic” context and the rest for high-fidelity reconstruction.

Is it actually fast in practice—or just “fast on paper”?

The repository includes a benchmark script that measures both FPL and real-time factor (RTF). On A100, the research team report 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile; on RTX 3090, 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

The research team evaluates short-form output streaming and full-stream scenarios. On LibriSpeech-long full-stream (where text arrives word-by-word), VoXtream shows lower WER (3.24 %) than CosyVoice2 (6.11 %) and a significant naturalness preference for VoXtream in listener studies (p ≤ 5e-10), while CosyVoice2 scores higher on speaker-similarity—consistent with its flow-matching decoder. In runtime, VoXtream has the lowest FPL among the compared public streaming systems, and with compile it operates >5× faster than real time (RTF ≈ 0.17).

https://arxiv.org/pdf/2509.15969
https://arxiv.org/pdf/2509.15969

Why does this AR design beat diffusion/flow stacks on onset?

Diffusion/flow vocoders typically generate audio in chunks, so even if the text-audio interleaving is clever, the vocoder imposes a floor on first-packet latency. VoXtream keeps every stage AR and frame-synchronous—PT→TT→DT→Mimi decoder—so the first 80 ms packet emerges after one pass through the stack rather than a multi-step sampler. The introduction surveys prior interleaved and chunked approaches and explains how NAR flow-matching decoders used in IST-LM and CosyVoice2 impede low FPL despite strong offline quality.

Did they get here with huge data—or something smaller and cleaner?

VoXtream trains on a ~9k-hour mid-scale corpus: roughly 4.5k h Emilia and 4.5k h HiFiTTS-2 (22 kHz subset). The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. Everything is resampled to 24 kHz, and the dataset card spells out the preprocessing pipeline and alignment artifacts (Mimi tokens, MFA alignments, duration labels, and speaker templates).

Are the headline quality metrics holding up outside cherry-picked clips?

Table 1 (zero-shot TTS) shows VoXtream is competitive on WER, UTMOS (MOS predictor), and speaker similarity across SEED-TTS test-en and LibriSpeech test-clean; the research team also runs an ablation: adding the CSM Depth Transformer and speaker encoder notably improves similarity without a significant WER penalty relative to a stripped baseline. The subjective study uses a MUSHRA-like protocol and a second-stage preference test tailored to full-stream generation.

source: marktechpost.com

Where does this land in the TTS landscape?

As per the research paper, it positions VoXtream among recent interleaved AR + NAR vocoder approaches and LM-codec stacks. The core contribution isn’t a new codec or a giant model—it’s a latency-focused AR arrangement plus a duration-token alignment that preserves input-side streaming. If you build live agents, the important trade-off is explicit: a small drop in speaker similarity vs. order-of-magnitude lower FPL than chunked NAR vocoders in full-stream conditions.


Check out the PAPER, Model on Hugging, GitHub Page and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VoXtream 文本转语音 TTS 实时语音合成 零样本TTS 低延迟 AI VoXtream Text-to-Speech TTS Real-time Speech Synthesis Zero-Shot TTS Low Latency AI
相关文章