cs.AI updates on arXiv.org 10月14日 12:08
SyncLipMAE:同步感知的说话人脸视频预训练框架
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出SyncLipMAE,一种从无标签音频-视觉流中学习同步感知和可迁移面部动态的自监督预训练框架。该框架结合了掩码视觉建模和跨模态对比对齐,使用三个帧提示令牌显式编码说话人脸帧的基本因素——身份、语音运动和周围运动。对比目标使用时间对齐的语音运动和音频令牌作为正例,使用错位配对作为负例,驱动两种模态进入共享嵌入空间,实现音频-视觉流同步。预训练后,对齐的音频令牌与视觉提示令牌(身份、语音运动、周围运动)形成一个统一的接口,用于四个不同的下游设置:音频-视觉流同步、面部情绪和头部/面部动作识别、视觉语音识别以及视觉配音。

arXiv:2510.10069v1 Announce Type: new Abstract: We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SyncLipMAE 预训练框架 说话人脸视频 同步感知 自监督学习
相关文章