Foley Control：轻量级视频引导音效合成方法

Stability AI Research 10月28日 02:13

Foley Control：轻量级视频引导音效合成方法

本文介绍了一种轻量级的视频引导音效合成方法Foley Control，通过学习音频-视频之间的依赖关系，实现时间同步，且无需重新训练音频模型。Foley Control在视频-音频基准测试中表现出优异的时序和语义对齐，参数少，且易于模块化升级。

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

Read the paper

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Foley Control 视频引导音效合成轻量级模型音频-视频同步

相关文章

Google釋出Gemma 2，提供9B及27B兩種版本

GPT-4o迷你版发布，ChatGPT杀死ChatGPT

OpenAI推出“小”模型GPT-4o Mini，成本骤降，开始跟各位卷性价比了

大模型风向变了，OpenAI苹果掉头布阵

OuteAI Unveils New Lite-Oute-1 Models: Lite-Oute-1-300M and Lite-Oute-1-65M As Compact Yet Powerful AI Solutions

mhGPT: Advancing Mental Health AI with a Lightweight, Expert Knowledge-Infused Transformer for Low-Resource Environments

微软连发3款Phi-3.5模型：128K上下文，首用MoE架构，部分性能超GPT-4o mini

Llama 3.2 Released: Unlocking AI Potential with 1B and 3B Lightweight Text Models and 11B and 90B Vision Models for Edge, Mobile, and Multimodal AI Applications

Meta首款多模态Llama 3.2开源，1B羊驼宝宝，跑在手机上了

英伟达深化与印度的合作：推出印地语AI模型