使用Hugging Face构建高级语音AI代理

MarkTechPost@AI 09月18日

本教程演示了如何利用Hugging Face的免费模型构建一个端到端的语音AI代理。通过整合Whisper进行语音识别，FLAN-T5进行自然语言推理，以及Bark进行语音合成，整个流程被简化到可在Google Colab上流畅运行。这种方法避免了复杂的依赖、API密钥或设置，专注于实现语音输入到有意义对话的转化，并实时获得自然发音的语音响应。教程详细介绍了代码实现，包括模型加载、对话格式化、核心功能函数（语音转文字、生成回复、语音合成）以及Gradio用户界面的构建。

💡 **集成Hugging Face模型实现端到端语音AI**：教程的核心在于利用Hugging Face提供的Whisper（语音识别）、FLAN-T5（自然语言推理）和Bark（语音合成）这三个强大的模型，构建一个能够理解语音输入并以语音形式回复的AI代理。这种集成方式简化了开发流程，使其能够轻松在Google Colab等环境中运行，无需复杂的API调用或设置。

⚙️ **简化流程与环境部署**：整个语音AI代理的构建流程被设计得极为简洁，便于在Google Colab等云端环境中部署和运行。通过直接使用Hugging Face的`pipeline`功能，可以快速加载和调用模型，避免了繁琐的环境配置和模型管理，使得开发者可以专注于AI的核心功能实现。

🗣️ **多模态交互能力**：该语音AI代理支持语音和文本两种交互方式。用户可以通过麦克风录制语音，AI会将其转换为文本并进行理解和回复；用户也可以直接输入文本，AI同样会生成文本回复并将其转化为语音输出。此外，教程还提供了导出完整对话的功能，增强了用户体验和数据管理。

🚀 **可扩展性与未来探索**：教程在展示了基础功能的实现后，也指出了其可扩展性。用户可以基于此框架尝试使用更大、更复杂的模型，集成多语言支持，或者根据具体需求添加自定义的逻辑和功能，为构建更高级的语音AI应用奠定了基础。

In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfileimport os, torch, tempfile, numpy as npimport gradio as grfrom transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLMDEVICE = 0 if torch.cuda.is_available() else -1asr = pipeline(   "automatic-speech-recognition",   model="openai/whisper-small.en",   device=DEVICE,   chunk_length_s=30,   return_timestamps=False)LLM_MODEL = "google/flan-t5-base"tok = AutoTokenizer.from_pretrained(LLM_MODEL)llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")tts = pipeline("text-to-speech", model="suno/bark-small")

We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

SYSTEM_PROMPT = (   "You are a helpful, concise voice assistant. "   "Prefer direct, structured answers. "   "If the user asks for steps or code, use short bullet points.")def format_dialog(history, user_text):   turns = []   for u, a in history:       if u: turns.append(f"User: {u}")       if a: turns.append(f"Assistant: {a}")   turns.append(f"User: {user_text}")   prompt = (       "Instruction:\n"       f"{SYSTEM_PROMPT}\n\n"       "Dialog so far:\n" + "\n".join(turns) + "\n\n"       "Assistant:"   )   return prompt

We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

def transcribe(filepath):   out = asr(filepath)   text = out["text"].strip()   return textdef generate_reply(history, user_text, max_new_tokens=256):   prompt = format_dialog(history, user_text)   inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)   with torch.no_grad():       ids = llm.generate(           **inputs,           max_new_tokens=max_new_tokens,           temperature=0.7,           do_sample=True,           top_p=0.9,           repetition_penalty=1.05,       )   reply = tok.decode(ids[0], skip_special_tokens=True).strip()   return replydef synthesize_speech(text):   out = tts(text)   audio = out["audio"]   sr = out["sampling_rate"]   audio = np.asarray(audio, dtype=np.float32)   return (sr, audio)

We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

def clear_history():   return [], []def voice_to_voice(mic_file, history):   history = history or []   if not mic_file:       return history, None, "Please record something!"   try:       user_text = transcribe(mic_file)   except Exception as e:       return history, None, f"ASR error: {e}"   if not user_text:       return history, None, "Didn't catch that. Try again?"   try:       reply = generate_reply(history, user_text)   except Exception as e:       return history, None, f"LLM error: {e}"   try:       sr, wav = synthesize_speech(reply)   except Exception as e:       return history + [(user_text, reply)], None, f"TTS error: {e}"   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"def text_to_voice(user_text, history):   history = history or []   user_text = (user_text or "").strip()   if not user_text:       return history, None, "Type a message first."   try:       reply = generate_reply(history, user_text)       sr, wav = synthesize_speech(reply)   except Exception as e:       return history, None, f"Error: {e}"   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"def export_chat(history):   lines = []   for u, a in history or []:       lines += [f"User: {u}", f"Assistant: {a}", ""]   text = "\n".join(lines).strip() or "No conversation yet."   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:       f.write(text)       path = f.name   return path

We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:   gr.Markdown(       "##  Advanced Voice AI Agent (Hugging Face Pipelines Only)\n"       "- **ASR**: openai/whisper-small.en\n"       "- **LLM**: google/flan-t5-base\n"       "- **TTS**: suno/bark-small\n"       "Speak or type; the agent replies with voice + text."   )   with gr.Row():       with gr.Column(scale=1):           mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")           say_btn = gr.Button(" Speak")           text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")           text_btn = gr.Button(" Send")           export_btn = gr.Button(" Export Chat (.txt)")           reset_btn = gr.Button(" Reset")       with gr.Column(scale=1):           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)           transcript = gr.Textbox(label="Transcript", lines=6)           chat = gr.Chatbot(height=360)   state = gr.State([])   def update_chat(history):       return [(u, a) for u, a in (history or [])]   say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(       update_chat, inputs=state, outputs=chat   )   text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(       update_chat, inputs=state, outputs=chat   )   reset_btn.click(clear_history, None, [chat, state])   export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签