SyncLipMAE：同步感知的说话人脸视频预训练框架

cs.AI updates on arXiv.org 10月14日 12:08

SyncLipMAE：同步感知的说话人脸视频预训练框架

本文提出SyncLipMAE，一种从无标签音频-视觉流中学习同步感知和可迁移面部动态的自监督预训练框架。该框架结合了掩码视觉建模和跨模态对比对齐，使用三个帧提示令牌显式编码说话人脸帧的基本因素——身份、语音运动和周围运动。对比目标使用时间对齐的语音运动和音频令牌作为正例，使用错位配对作为负例，驱动两种模态进入共享嵌入空间，实现音频-视觉流同步。预训练后，对齐的音频令牌与视觉提示令牌（身份、语音运动、周围运动）形成一个统一的接口，用于四个不同的下游设置：音频-视觉流同步、面部情绪和头部/面部动作识别、视觉语音识别以及视觉配音。

arXiv:2510.10069v1 Announce Type: new Abstract: We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SyncLipMAE 预训练框架说话人脸视频同步感知自监督学习

相关文章

Trends in Deep Reinforcement Learning with Kamyar Azizzadenesheli - #560

Trends in Computer Vision with Amir Zamir - #338

Google DeepMind Introduces Video-to-Audio V2A Technology: Synchronizing Audiovisual Generation

大模型最强架构TTT问世，一夜推翻Transformer？

大模型最强架构TTT问世！斯坦福UCSD等5年磨一剑，一夜推翻Transformer

澳大利亚国立大学Nick Barnes团队 | 对息肉分割的再思考: 从分布外视角展开

ECCV 2024｜盲视频去闪烁通用方法BlazeBVD来了，美图&国科大联合提出

产品安利社 08月01日

POA: A Novel Self-Supervised Learning Paradigm for Efficient Multi-Scale Model Pre-Training

数字华夏展示“夏澜”人形机器人：高仿外观、百变人脸、自然语音交互