Stability AI Research 09月19日
Stable Audio:利用AI生成定制长度的音频
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Stable Audio 是一种基于潜空间扩散模型(Latent Diffusion Model)的音频生成技术,它能够根据文本描述、音频时长和起始时间来生成音频。与以往模型只能生成固定长度的音频不同,Stable Audio 能够生成指定长度的音频,最高可达训练窗口大小。该技术利用音频的低维潜空间表示,显著提高了生成速度,能在不到一秒钟的时间内在一块NVIDIA A100 GPU上生成95秒的44.1 kHz立体声音频。Stable Audio 的架构包含变分自编码器(VAE)、文本编码器和 U-Net 扩散模型,并结合了 CLAP 模型进行文本条件约束,以及对音频时长和起始时间的嵌入,从而实现对生成音频内容和长度的精确控制。

🎶 **创新音频生成方式**: Stable Audio 引入了一种新的音频生成方法,基于潜空间扩散模型(Latent Diffusion Model),克服了传统音频生成模型在生成固定长度音频方面的限制。它能够根据文本元数据、音频时长和起始时间来生成指定长度的音频,最高可达训练窗口大小。

⚡ **高效的生成速度**: 通过在音频的低维潜空间表示上进行操作,Stable Audio 大大提升了生成效率。在 NVIDIA A100 GPU 上,它能在不到一秒钟的时间内生成长达95秒的44.1 kHz立体声音频,这对于需要快速生成大量音频内容的场景尤为重要。

🧠 **精细的控制能力**: 该模型通过结合文本提示(text metadata)、音频时长(duration)和起始时间(start time)作为条件进行生成。其技术细节包括使用变分自编码器(VAE)压缩音频,利用 CLAP 模型的文本编码器理解文本,以及将音频的起始秒数和总秒数转化为离散的嵌入(embeddings),这些都赋予了用户对生成音频内容和长度的精细控制能力。

Visit stableaudio.com

Introduction

The introduction of diffusion-based generative models has revolutionized the field of generative AI over the last few years, leading to rapid improvements in the quality and controllability of generated images, video, and audio. Diffusion models working in the latent encoding space of a pre-trained autoencoder, termed “latent diffusion models”, provide significant speed improvements to the training and inference of diffusion models. 

One of the main issues with generating audio using diffusion models is that diffusion models are usually trained to generate a fixed-size output. For example, an audio diffusion model might be trained on 30-second audio clips, and will only be able to generate audio in 30-second chunks. This is an issue when training on and trying to generate audio of greatly varying lengths, as is the case when generating full songs.

Audio diffusion models tend to be trained on randomly cropped chunks of audio from longer audio files, cropped or padded to fit the diffusion model’s training length. In the case of music, this causes the model to tend to generate arbitrary sections of a song, which may start or end in the middle of a musical phrase.

We introduce Stable Audio, a latent diffusion model architecture for audio conditioned on text metadata as well as audio file duration and start time, allowing for control over the content and length of the generated audio. This additional timing conditioning allows us to generate audio of a specified length up to the training window size.

Working with a heavily downsampled latent representation of audio allows for much faster inference times compared to raw audio. Using the latest advancements in diffusion sampling techniques, our flagship Stable Audio model is able to render 95 seconds of stereo audio at a 44.1 kHz sample rate in less than one second on an NVIDIA A100 GPU.

Audio Samples

Music

Instruments

Sound Effects


Technical details

The Stable Audio models are latent diffusion models consisting of a few different parts, similar to Stable Diffusion: A variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model.

The VAE compresses stereo audio into a data-compressed, noise-resistant, and invertible lossy latent encoding that allows for faster generation and training than working with the raw audio samples themselves. We use a fully-convolutional architecture based on the Descript Audio Codec encoder and decoder architectures to allow for arbitrary-length audio encoding and decoding, and high-fidelity outputs.

To condition the model on text prompts, we use the frozen text encoder of a CLAP model trained from scratch on our dataset. The use of a CLAP model allows the text features to contain some information about the relationships between words and sounds. We use the text features from the penultimate layer of the CLAP text encoder to obtain an informative representation of the tokenized input text. These text features are provided to the diffusion U-Net through cross-attention layers.

For the timing embeddings, we calculate two properties during training time when gathering a chunk of audio from our training data: the second from which the chunk starts (termed “seconds_start”) and the overall number of seconds in the original audio file (termed “seconds_total”). For example, if we take a 30-second chunk from an 80-second audio file, with the chunk starting at 0:14, then “seconds_start” is 14, and “seconds_total” is 80. These second values are translated into per-second discrete learned embeddings and concatenated with the prompt tokens before being passed into the U-Net’s cross-attention layers. During inference, these same values are provided to the model as conditioning, allowing the user to specify the overall length of the output audio. 

The diffusion model for Stable Audio is a 907M parameter U-Net based on the model used in Moûsai. It uses a combination of residual layers, self-attention layers, and cross-attention layers to denoise the input conditioned on text and timing embeddings. Memory-efficient implementations of attention were added to the U-Net to allow the model to scale more efficiently to longer sequence lengths.

Dataset

To train our flagship Stable Audio model, we used a dataset consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems, as well as corresponding text metadata, provided through a deal with stock music provider AudioSparx. This dataset adds up to over 19,500 hours of audio.


Future work and open models

Stable Audio represents the cutting-edge audio generation research by Stability AI’s generative audio research lab, Harmonai. We continue to improve our model architectures, datasets, and training procedures to improve output quality, controllability, inference speed, and output length.

Keep an eye out for upcoming releases from Harmonai, including open-source models based on Stable Audio and training code to allow you to train your audio generation models.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Stable Audio AI音频生成 Diffusion Models Latent Diffusion Models Text-to-Audio Generative AI
相关文章