MarkTechPost@AI 08月30日
微软发布两款自研AI模型,加速语音与语言能力发展
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软AI实验室近日正式推出两款自主研发的模型:MAI-Voice-1和MAI-1-preview。MAI-Voice-1是一款高效的语音合成模型,能在单GPU上快速生成自然流畅的语音,适用于语音助手和播客等场景。MAI-1-preview则是微软首个端到端、完全自研的基础语言模型,专注于指令遵循和日常对话任务。这两款模型的发布标志着微软在AI研发上迈入新阶段,展示了其在自研AI基础设施和人才方面的投入,旨在将AI技术更广泛地应用于实际产品和用户体验中。

🎙️ **MAI-Voice-1:高效自然的语音合成** MAI-Voice-1是一款先进的语音生成模型,能够在一分钟内以极低的延迟生成高质量、听起来自然的语音,仅需单个GPU即可运行。它采用基于Transformer的架构,并在多样化的多语言语音数据集上进行了训练,支持单人和多人的语音场景,能够输出富有表现力和符合语境的语音。该模型已被集成到微软Copilot等产品中,并可在Copilot Labs中供用户测试,用于创建音频故事或引导式叙述。

🧠 **MAI-1-Preview:微软首个自研基础语言模型** MAI-1-preview是微软自主研发的首个端到端的基础语言模型,完全在微软自己的基础设施上训练,结合了混合专家(Mixture-of-Experts)架构和约15,000个NVIDIA H100 GPU。该模型专注于指令遵循和日常对话任务,旨在优化用户体验,使其特别适合面向消费者的应用。微软已开始在Copilot的特定文本场景中逐步推出该模型的使用权。

🚀 **强大的基础设施与人才支持** 这两款模型的开发得到了微软下一代GB200 GPU集群的支持,这是一个专门为训练大型生成模型而优化的定制基础设施。微软还在人才方面进行了大量投资,组建了在生成式AI、语音合成和大规模系统工程领域拥有深厚专业知识的团队。这种软硬件结合的方法,确保了模型不仅在理论上先进,而且在实际应用中可靠且有用。

Microsoft AI lab officially launched MAI-Voice-1 and MAI-1-preview, marking a new phase for the company’s artificial intelligence research and development efforts. The announcement explains how Microsoft AI Lab is getting involved in AI research without any third party involvement. MAI-Voice-1 and MAI-1-preview models supports distinct but complementary roles in speech synthesis and general-purpose language understanding.

MAI-Voice-1: Technical Details and Capabilities

MAI-Voice-1 is a speech generation model that produces audio with high fidelity. It generates one minute of natural-sounding audio in under one second using a single GPU, supporting applications such as interactive assistants and podcast narration with low latency and hardware needs. Try out here

The model uses a transformer-based architecture trained on a diverse multilingual speech dataset. It handles single-speaker and multi-speaker scenarios, providing expressive and context-appropriate voice outputs.

MAI-Voice-1 is integrated into Microsoft products like Copilot Daily for voice updates and news summaries. It is available for testing in Copilot Labs, where users can create audio stories or guided narratives from text prompts.

Technically, the model focuses on quality, versatility, and speed. Its single-GPU operation differs from systems requiring multiple GPUs, enabling integration in consumer devices and cloud applications beyond research settings

MAI-1-Preview: Foundation Model Architecture and Performance

MAI-1-preview is Microsoft’s first end-to-end, in-house foundation language model. Unlike previous models that Microsoft integrated or licensed from outside, MAI-1-preview was trained entirely on Microsoft’s own infrastructure, using a mixture-of-experts architecture and approximately 15,000 NVIDIA H100 GPUs.

Microsoft AI team have made the MAI-1-preview on the LMArena platform, placing it next to several other models. MAI-1-preview is optimized for instruction-following and everyday conversational tasks, making it suitable for consumer-focused applications rather than enterprise or highly specialized use cases. Microsoft has begun rolling out access to the model for select text-based scenarios within Copilot, with a gradual expansion planned as feedback is collected and the system is refined.

Model Development and Training Infrastructure

The development of MAI-Voice-1 and MAI-1-preview was supported by Microsoft’s next-generation GB200 GPU cluster, a custom-built infrastructure specifically optimized for training large generative models. In addition to hardware, Microsoft has invested heavily in talent, assembling a team with deep expertise in generative AI, speech synthesis, and large-scale systems engineering. The company’s approach to model development emphasizes a balance between fundamental research and practical deployment, aiming to create systems that are not just theoretically impressive but also reliable and useful in everyday scenarios.

Applications

MAI-Voice-1 can be used for real-time voice assistance, audio content creation in media and education, or accessibility features. Its ability to simulate multiple speakers supports use in interactive scenarios such as storytelling, language learning, or simulated conversations. The model’s efficiency also allows for deployment on consumer hardware.

MAI-1-preview is focused on general language understanding and generation, assisting with tasks like drafting emails, answering questions, summarizing text, or helping with understanding and assisting school tasks in a conversational format.

Conclusion

Microsoft’s release of MAI-Voice-1 and MAI-1-preview shows the company can now develop core generative AI models internally, backed by substantial investment in training infrastructure and technical talent. Both models are intended for practical, real-world use and are being refined with user feedback. This development adds to the diversity of model architectures and training methods in the field, with a focus on systems that are efficient, reliable, and suitable for integration into everyday applications. Microsoft’s approach—using large-scale resources, gradual deployment, and direct engagement with users—offers one example of how organizations can progress AI capabilities while emphasizing practical, incremental improvement.


Check out the Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Microsoft AI MAI-Voice-1 MAI-1-preview 语音合成 基础模型 人工智能 Speech Synthesis Foundation Model Artificial Intelligence
相关文章