MarkTechPost@AI 11月12日 09:37
Maya1:驱动AI语音表达的新模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Maya1是一款由Maya Research推出的30亿参数的文本转语音模型,能够根据文本描述生成富有表现力的语音。该模型支持实时运行于单个GPU上,并且能够精确捕捉人类情感和进行精细的语音设计。用户可以通过自然语言描述声音特征,如口音、语速和情绪,并结合文本内容生成音频。Maya1支持超过20种内联情绪标签,如等,用于本地化控制语音的表达。其架构采用Llama风格的Transformer,通过SNAC神经音频编解码器生成离散的语音编码,再解码为24kHz的音频,提高了生成效率和可扩展性。该模型在开源Apache 2.0许可下发布,并提供多种部署方式,包括Hugging Face Space演示、GGUF量化版本以及ComfyUI节点集成,使高质量、可控的AI语音合成更加易于获取和使用。

✨ **先进的文本转语音能力**:Maya1是一款30亿参数的AI模型,能将文本和自然语言的声音描述转化为富有表现力的语音。它能精确捕捉人类情感,并允许用户通过文本描述来定制语音特征,如口音、语速、音色和情绪,为语音合成带来了前所未有的控制力。

🚀 **实时性能与单GPU部署**:该模型设计精良,能够在单个GPU上实现实时语音生成,这对于需要即时反馈的应用场景至关重要。通过集成SNAC神经音频编解码器和优化的推理脚本(如vLLM),Maya1实现了低延迟和高效的部署,使其适用于游戏、直播和交互式代理等多种应用。

🎭 **丰富的表情控制与开放生态**:Maya1支持超过20种内联情绪标签,如等,让开发者能够精细地控制语音的情感表达。同时,该模型以Apache 2.0许可开源,并提供Hugging Face Space演示、GGUF量化版本及ComfyUI节点等多种形式,极大地降低了使用门槛,促进了AI语音技术在社区和商业中的广泛应用。

Maya Research has released Maya1, a 3B parameter text to speech model that turns text plus a short description into controllable, expressive speech while running in real time on a single GPU.

What Maya1 Actually Does?

Maya1 is a state of the art speech model for expressive voice generation. It is built to capture real human emotion and precise voice design from text inputs.

The core interface has 2 inputs:

    A natural language voice description, for example ‘Female voice in her 20s with a British accent, energetic, clear diction” or “Demon character, male voice, low pitch, gravelly timbre, slow pacing’.The text that should be spoken

The model combines both signals and generates audio that matches the content and the described style. You can also insert inline emotion tags inside the text, such as <laugh>, <sigh>, <whisper>, <angry>, <giggle>, <gasp>, <cry> and more than 20 emotions.

Maya1 outputs 24 kHz mono audio and supports real time streaming, which makes it suitable for assistants, interactive agents, games, podcasts and live content.

The Maya Research team claims that the model outperforms top proprietary systems while remaining fully open source under the Apache 2.0 license.

Architecture and SNAC Codec

Maya1 is a 3B parameter decoder only transformer with a Llama style backbone. Instead of predicting raw waveforms, it predicts tokens from a neural audio codec named SNAC.

The generation flow is

text → tokenize → generate SNAC codes (7 tokens per frame) → decode → 24 kHz audio

SNAC uses a multi scale hierarchical structure at about 12, 23 and 47 Hz. This keeps the autoregressive sequence compact while preserving detail. The codec is designed for real time streaming at about 0.98 kbps.

The important point is that the transformer operates on discrete codec tokens instead of raw samples. A separate SNAC decoder, for example hubertsiuzdak/snac_24khz, reconstructs the waveform. This separation makes generation more efficient and easier to scale than direct waveform prediction.

Training Data And Voice Conditioning

Maya1 is pretrained on an internet scale English speech corpus to learn broad acoustic coverage and natural coarticulation. It is then fine tuned on a curated proprietary dataset of studio recordings that include human verified voice descriptions, more than 20 emotion tags per sample, multiple English accents, and character or role variations.

The documented data pipeline includes:

    24 kHz mono resampling with about minus 23 LUFS loudnessVoice activity detection with silence trimming between 1 and 14 secondsForced alignment using Montreal Forced Aligner for phrase boundariesMinHash LSH text deduplicationChromaprint based audio deduplicationSNAC encoding with 7 token frame packing

The Maya Research team evaluated several ways to condition the model on a voice description. Simple colon formats and key value tag formats either caused the model to speak the description or did not generalize well. The best performing format uses an XML style attribute wrapper that encodes the description and text in a natural way while remaining robust.

In practice, this means developers can describe voices in free form text, close to how they would brief a voice actor, instead of learning a custom parameter schema.

https://huggingface.co/maya-research/maya1

Inference And Deployment On A Single GPU

The reference Python script on Hugging Face loads the model with AutoModelForCausalLM.from_pretrained("maya-research/maya1", torch_dtype=torch.bfloat16, device_map="auto") and uses the SNAC decoder from SNAC.from_pretrained("hubertsiuzdak/snac_24khz").

The Maya Research team recommends a single GPU with 16 GB or more of VRAM, for example A100, H100 or a consumer RTX 4090 class card.

For production, they provide a vllm_streaming_inference.py script that integrates with vLLM. It supports Automatic Prefix Caching for repeated voice descriptions, a WebAudio ring buffer, multi GPU scaling and sub 100 millisecond latency targets for real time use.

Beyond the core repository, they have released:

These projects reuse the official model weights and interface, so they stay consistent with the main implementation.

Key Takeaways

    Maya1 is a 3B parameter, decoder only, Llama style text to speech model that predicts SNAC neural codec tokens instead of raw waveforms, and outputs 24 kHz mono audio with streaming support.The model takes 2 inputs, a natural language voice description and the target text, and supports more than 20 inline emotion tags such as <laugh>, <cry>, <whisper> and <gasp> for local control of expressiveness.Maya1 is trained with a pipeline that combines large scale English pretraining and studio quality fine tuning with loudness normalization, voice activity detection, forced alignment, text deduplication, audio deduplication and SNAC encoding.The reference implementation runs on a single 16 GB plus GPU using torch_dtype=torch.bfloat16, integrates with a SNAC decoder, and has a vLLM based streaming server with Automatic Prefix Caching for low latency deployment.Maya1 is released under the Apache 2.0 license, with official weights, Hugging Face Space demo, GGUF quantized variants and ComfyUI integration, which makes expressive, emotion rich, controllable text to speech accessible for commercial and local use.

Editorial Comments

Maya1 pushes open source text to speech into territory that was previously dominated by proprietary APIs. A 3B parameter Llama style decoder that predicts SNAC codec tokens, runs on a single 16 GB GPU with vLLM streaming and Automatic Prefix Caching, and exposes more than 20 inline emotions with natural language voice design, is a practical building block for real time agents, games and tools. Overall, Maya1 shows that expressive, controllable TTS can be both open and production ready.


Check out the Model Weights and Demo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Maya1 文本转语音 AI语音 开源模型 深度学习 实时语音 Text-to-Speech AI Voice Open Source Deep Learning Real-time
相关文章