MarkTechPost@AI 10月03日 14:35
NeuTTS Air: 7.48亿参数的本地化语音模型,支持实时语音克隆
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Neuphonic发布了开源的NeuTTS Air,一个能在CPU上进行实时推理的文本转语音(TTS)模型。该模型基于Qwen2架构,拥有7.48亿参数,并提供GGUF量化版本,无需云端依赖即可运行。NeuTTS Air结合了轻量级Qwen骨干网络和NeuCodec音频编码器,能够从约3秒的参考音频中克隆声音,生成超逼真的语音。其设计重点在于低延迟、小占用空间以及在笔记本电脑、手机和树莓派等设备上的本地部署,特别适合对隐私敏感的应用和语音代理。

🚀 **NeuTTS Air的创新性与核心技术**:NeuTTS Air是Neuphonic推出的一款开创性的开源文本转语音(TTS)模型,其核心在于能够直接在CPU上实现实时语音生成,极大地降低了对计算资源的依赖。该模型采用了7.48亿参数的Qwen2架构,并集成了Neuphonic自家的NeuCodec音频编码器。这种组合使得模型在保持模型规模相对较小的同时,能够提供高度逼真的语音输出,并且支持从极短的参考音频(约3秒)中进行即时语音克隆,这在当前的技术环境下具有显著的优势。

💻 **本地化部署与跨平台兼容性**:该模型的一大亮点是其对本地化部署的强调。通过提供GGUF量化版本(Q4/Q8),NeuTTS Air能够通过llama.cpp等工具在各种设备上运行,包括笔记本电脑、智能手机甚至树莓派等嵌入式设备。这种设计显著提升了模型的可用性和灵活性,尤其对于需要离线运行或对数据隐私有严格要求的应用场景,如语音助手、客服机器人等,提供了强大的支持,无需依赖云服务。

🔒 **隐私保护与负责任的AI应用**:NeuTTS Air在设计时充分考虑了用户隐私和负责任的AI应用。模型在本地运行,意味着音频数据和文本信息不会离开用户设备,除非用户明确授权,这大大增强了数据安全性。此外,Neuphonic还在生成的音频中内置了Perth(Perceptual Threshold)水印,这有助于追踪音频来源,防止滥用,并促进AI技术的负责任使用。

📈 **性能与效率的平衡**:尽管参数量不大,NeuTTS Air在保持低延迟和紧凑模型占用的同时,实现了“超逼真”的语音合成效果。它通过结合一个0.5B(Qwen2级别)的骨干网络和NeuCodec(0.8 kbps / 24 kHz)来平衡延迟、模型大小和输出质量。这种高效的设计使得它能够在主流硬件上实现实时生成,为开发者提供了在资源受限环境中构建先进语音应用的可能。

Neuphonic has released NeuTTS Air, an open-source text-to-speech (TTS) speech language model designed to run locally in real time on CPUs. The Hugging Face model card lists 748M parameters (Qwen2 architecture) and ships in GGUF quantizations (Q4/Q8), enabling inference through llama.cpp/llama-cpp-python without cloud dependencies. It is licensed under Apache-2.0 and includes a runnable demo and examples.

So, what is new?

NeuTTS Air couples a 0.5B-class Qwen backbone with Neuphonic’s NeuCodec audio codec. Neuphonic positions the system as a “super-realistic, on-device” TTS LM that clones a voice from ~3 seconds of reference audio and synthesizes speech in that style, targeting voice agents and privacy-sensitive applications. The model card and repository explicitly emphasize real-time CPU generation and small-footprint deployment.

Key Features

Explain the model architecture and runtime path?

On-device performance focus

NeuTTS Air showcases ‘real-time generation on mid-range devices‘ and offers CPU-first defaults; GGUF quantization is intended for laptops and single-board computers. While no fps/RTF numbers are published on the card, the distribution targets local inference without a GPU and demonstrates a working flow through the provided examples and Space.

Voice cloning workflow

NeuTTS Air requires (1) a reference WAV and (2) the transcript text for that reference. It encodes the reference to style tokens and then synthesizes arbitrary text in the reference speaker’s timbre. The Neuphonic team recommends 3–15 s clean, mono audio and provides pre-encoded samples.

Privacy, responsibility, and watermarking

Neuphonic frames the model for on-device privacy (no audio/text leaves the machine without user’s approval) and notes that all generated audio includes a Perth (Perceptual Threshold) watermarker to support responsible use and provenance.

How it compares?

Open, local TTS systems exist (e.g., GGUF-based pipelines), but NeuTTS Air is notable for packaging a small LM + neural codec with instant cloning, CPU-first quantizations, and watermarking under a permissive license. The “world’s first super-realistic, on-device speech LM” phrasing is the vendor’s claim; the verifiable facts are the size, formats, cloning procedure, license, and provided runtimes.

Our Comments

The focus is on system trade-offs: a ~0.7B Qwen-class backbone with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a pragmatic recipe for real-time, CPU-only TTS that preserves timbre using ~3–15 s style references while keeping latency and memory predictable. The Apache-2.0 licensing and built-in watermarking are deployment-friendly, but publishing RTF/latency on commodity CPUs and cloning-quality vs. reference-length curves would enable rigorous benchmarking against existing local pipelines. Operationally, an offline path with minimal dependencies (eSpeak, llama.cpp/ONNX) lowers privacy/compliance risk for edge agents without sacrificing intelligibility.


Check out the Model Card on Hugging Face and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NeuTTS Air 文本转语音 TTS 开源模型 本地部署 语音克隆 AI Neuphonic Qwen2 GGUF CPU推理 隐私保护 NeuTTS Air Text-to-Speech TTS Open Source Model On-Device Deployment Voice Cloning AI Neuphonic Qwen2 GGUF CPU Inference Privacy Protection
相关文章