MarkTechPost@AI 10月02日 01:11
Liquid AI发布LFM2-Audio-1.5B,实现端到端音频处理与低延迟
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Liquid AI推出了LFM2-Audio-1.5B,一款紧凑型的音频-语言基础模型,能够通过单一端到端架构理解和生成语音及文本。该模型专为资源受限设备上的低延迟实时助手设计,将LFM2系列扩展至音频领域,同时保持了小巧的体积。其核心创新在于统一的骨干网络,通过分离式音频输入输出实现。输入采用连续嵌入,直接从原始波形块投影;输出则为离散音频码。模型支持两种生成模式:一种用于实时语音到语音的交错生成,另一种用于自动语音识别和文本转语音的顺序生成。在实际测试中,LFM2-Audio-1.5B实现了低于100毫秒的端到端延迟,并在VoiceBench和ASR基准测试中表现出色。

🎙️ **统一的端到端音频-语言模型**:LFM2-Audio-1.5B采用单一的端到端架构,能够同时处理音频和文本,实现对语音和文本的理解与生成。这种统一的设计简化了模型结构,并为低延迟的实时应用提供了基础。

💡 **创新的音频输入输出分离**:该模型将音频输入表示为连续嵌入,直接从原始波形块生成;而输出则为离散音频码。这种方法避免了输入端的离散化伪影,同时保持了跨模态输出的自回归训练和生成特性,提高了模型效率和性能。

⚡ **超低延迟的实时交互**:LFM2-Audio-1.5B特别针对实时应用进行了优化,报告的端到端延迟低于100毫秒。这使得它在语音助手等需要快速响应的场景下具有显著优势,能够提供更流畅的用户体验。

🏆 **多模态基准测试表现优异**:在VoiceBench等音频助手评估套件中,LFM2-Audio-1.5B取得了令人瞩目的成绩,并在自动语音识别(ASR)任务上与领先模型相当甚至有所超越,证明了其在多种语音处理任务上的泛化能力。

Liquid AI has released LFM2-Audio-1.5B, a compact audio–language foundation model that both understands and generates speech and text through a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained devices, extending the LFM2 family into audio while retaining a small footprint.

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

But what’s actually new? a unified backbone with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language backbone to treat audio and text as first-class sequence tokens. Crucially, the model disentangles audio representations: inputs are continuous embeddings projected directly from raw waveform chunks (~80 ms), while outputs are discrete audio codes. This avoids discretization artifacts on the input path while keeping training and generation autoregressive for both modalities on the output path.

On the implementation side, the released checkpoint uses:

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

Two generation modes for real-time agents

Liquid AI provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

Latency: <100 ms to first audio

Liquid AI team reports end-to-end latency below 100 ms from a 4-second audio query to the first audible response—a proxy for perceived responsiveness in interactive use—stating it is faster than models smaller than 1.5B parameters under their setup.

Benchmarks: VoiceBench and ASR results

On VoiceBench—a suite of nine audio-assistant evaluations—Liquid reports an overall score of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed in the blog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI team contrasts this result with larger models like Qwen2.5-Omni-3B and Moshi-7B in the same table. (VoiceBench is an external benchmark introduced in late 2024 for LLM-based voice assistants)

The model card on Hugging Face provides an additional VoiceBench table (with closely related—but not identical—per-task values) and includes classic ASR WERs where LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets despite being a generalist speech–text model. For example (lower is better): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Most “omni” stacks couple ASR → LLM → TTS, which adds latency and brittle interfaces. LFM2-Audio’s single-backbone design with continuous input embeddings and discrete output codes reduces glue logic and allows interleaved decoding for early audio emission. For developers, this translates to simpler pipelines and faster perceived response times, while still supporting ASR, TTS, classification, and conversational agents from one model. Liquid AI provides code, demo entry points, and distribution via Hugging Face.


Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LFM2-Audio-1.5B Liquid AI 音频处理 语言模型 AI 低延迟 端到端模型 语音识别 文本转语音 Audio Processing Language Model Low Latency End-to-End Model Speech Recognition Text-to-Speech
相关文章