Sam Patterson's Blog 09月30日
Zonos TTS 模型初体验:安装、测试与初步评价
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文记录了作者在Ubuntu 22.04系统上安装和测试新发布的Zonos TTS模型的过程。作者详细介绍了安装步骤,包括依赖项安装、代码克隆、环境配置以及运行测试。在测试过程中,作者遇到了端口占用问题,并给出了解决方案。通过与另一款TTS模型Kokoro的对比,作者发现Zonos在默认设置下表现平平,生成音频存在长度限制和质量问题,但其语音克隆功能展现出一定的潜力,作者对此保持谨慎乐观。

📦 **安装与环境配置**: 文章详细阐述了在Linux (Ubuntu 22.04) 系统上安装Zonos TTS模型的步骤,包括安装`espeak-ng`、克隆GitHub仓库、使用`uv`包管理器进行环境同步以及安装必要的库,并提供了运行测试脚本`sample.py`的命令,该过程会自动下载模型文件。

⚠️ **Gradio界面测试与问题解决**: 作者通过`gradio_interface.py`启动了Zonos的Gradio用户界面,但遇到了端口冲突(`OSError: Cannot find empty port`)。通过修改`gradio_interface.py`文件,将默认端口`7860`更改为`7861`,成功解决了该问题。

📉 **TTS质量与性能初步评估**: 在默认设置下,Zonos TTS的生成质量和速度并未达到作者的预期。作者发现生成的音频存在大约30秒的长度限制,且早期测试的音频质量不稳定,甚至出现“破音”现象。与Kokoro模型对比,Zonos在速度和整体音质上表现较弱。

🌟 **语音克隆功能亮点**: 尽管TTS生成质量有待提升,Zonos的语音克隆功能给作者留下了深刻印象。作者使用20秒的录音测试了该功能,认为其结果“不算差”,并看到了其在未来应用中的潜力,对该功能表示谨慎乐观。

I noticed a few folks mention the new Zonos TTS release today, so I wanted to try it out locally. You can read more about it in the beta release announcement.

I’m on Ubuntu 22.04 and I’ve got a 4090, so I need to test new models when they release in order to justify my purchase.

Main Takeway

Using this through Gradio with the default settings isn’t very impressive. When I have more time I’ll fiddle more. The voice cloning is neat, but out of the box right now, I much prefer Kokoro. If you’ve played with it and gotten it to work well, please share what you did.

Installation

If you use Linux and have a 4090 you probably don’t need a guide to help you get Zonos working. Too bad, here it is.

You need espeak-ng installed:

sudo apt install -y espeak-ng

Clone the git repo:

git clone https://github.com/Zyphra/Zonos.git

Move into the new repo:

cd Zonos

Their repo instructions recommend using uv as a package manager, I guess because it’s faster. I’ve never used it but I can’t refuse a recommended tag so I installed it:

uv sync

This creates a new virtual environment which installs Torch and all the nvidia stuff, so it’ll take a few minutes.

Once it has installed all the packages, you then run:

uv sync --extra compile

To test you can then run:

uv run sample.py

This automatically downloaded the model.safetensors file for me, which was 3.25G, but downloaded ridiculously fast (there’s no amount of nostalgia that makes me yearn for the 56k days again).

If everything goes well, you should have a sample.wav file in your directory. It’ll say “hello world”, or at least it’ll supposed to. It’ll really say “hello worl,” because it cuts off the end of everything, unless they fixed that since I’ve written this.

A two second, cut off clip is exciting and all, but I decided to launch the Gradio interface they provided to test it properly:

uv run gradio_interface.py

That’s when I ran into an issue.

OSError: Cannot find empty port in range: 7860-7860. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.

Oops, I’m already using that port. Without checking, it’s probably Kokoro, since I set it up to be my TTS for OpenWebUI.

I opened up the gradio_interface.py file and changed the port:

if __name__ == "__main__":    demo = build_interface()    demo.launch(server_name="0.0.0.0", server_port=7861, share=True)

Then it launched just fine.

Results

Ok, now what? Well I really wanted to test the voice cloning, because the pranking potential is so high, but first I dutifully testing the straightforward TTS quality.

(Actually, I spent about two hours setting up a screenshot > jsDelivr pipeline so that I could include screenshots in these blog posts easily. But I’ll write about that tomorrow.)

My first test was the introductory paragraph from Winnie-the-Pooh.

It was very… meh, until 30 seconds in, when it got exciting, and by exciting, I mean it burst my eardrums.

You don’t need to be an audio engineer to know that a waveform probably shouldn’t look like that.

So I tried again, curious to see if 30 seconds was the cutoff.

First impression: It’s not all that fast. The claim is it’s 2X realtime with a 4090. I’ve got a 4090, and… maybe? Most recently I’ve used Kokoro and that’s way, way faster than this, not even close.

Second impression: My first impression might be wrong because it’s over 300 seconds now generating a dinosaur joke I asked phi4 to make. It’s probably borked somehow… yeah errors abound in terminal. It works now, and it’s fairly fast too.

There’s definitely a 30 cut off here. And the quality is weird.

Ok I’m wondering if there’s more of an issue with the Gradio default settings, or me doing something wrong, because this isn’t anywhere as good as Kokoro. I just opened up the Kokoro Gradio interface and tested the same input - Kokoro is much faster, sounds better, and doesn’t choke on anything longer than 30 seconds.

Voice cloning

At this point I’m sure I need to understand how to tune the controls to make this better, but before I spend the time, I wanted to test the voice cloning. I recorded a 20 second .wav of myself, dropped that into the section in Gradio, and then popped in the text I read.

The result was… not bad! Not great, but considering it was only 20 seconds and I haven’t really gotten the hang of using this model yet, I can see why people are excited about this feature.

I’ll keep a cautiously optimistic eye out on this one.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Zonos TTS 语音合成 AI 模型评测 Linux Voice Cloning Gradio
相关文章