https://simonwillison.net/atom/everything 09月30日
视频模型:通用视觉基础模型的新篇章
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌DeepMind的最新研究表明,像Veo 3这样的生成式视频模型,有望在机器学习视觉领域扮演类似大型语言模型(LLMs)在文本领域的作用。LLMs通过预测下一个词元,发展成为处理各种任务的通用基础模型。生成式视频模型也可能成为视觉和图像推理任务的通用基础模型,能够通过提示解决多种任务,而非仅限于特定任务。研究通过对大量生成视频的分析,展示了Veo 3在零样本学习和“帧序列”(Chain-of-Frames, CoF)视觉推理方面的潜力,标志着视频模型能力的快速进步,并预示着未来成本将大幅下降。

🎥 **视频模型作为通用视觉基础模型:** 谷歌DeepMind的研究提出,生成式视频模型(如Veo 3)有望成为视觉领域的通用基础模型,类似于LLMs在自然语言处理(NLP)中的地位。它们能够通过简单的提示(prompting)来执行多样化的视觉任务,而非依赖于专门训练的特定模型。

💡 **零样本学习与“帧序列”推理:** Veo 3模型在未经特定训练或调整的情况下,能够解决广泛的视觉任务,展现了零样本学习的能力。研究还提出了“帧序列”(Chain-of-Frames, CoF)这一概念,用以描述视频模型如何通过逐帧处理来模拟“思维链”(Chain-of-Thought)进行视觉推理,能够感知、建模和操作视觉世界。

🚀 **能力的快速发展与成本趋势:** 尽管当前视频模型运行成本较高,但研究表明从Veo 2到Veo 3的性能提升显著且一致,预示着视频模型能力的快速进步。这有望追随LLMs的价格下降轨迹,变得更加经济实惠,从而推动其广泛应用。

Video models are zero-shot learners and reasoners. Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model - and generative video models in general - serve a similar role in the machine learning visual ecosystem as LLMs do for text.

LLMs took the ability to predict the next token and turned it into general purpose foundation models for all manner of tasks that used to be handled by dedicated models - summarization, translation, parts of speech tagging etc can now all be handled by single huge models, which are getting both more powerful and cheaper as time progresses.

Generative video models like Veo 3 may well serve the same role for vision and image reasoning tasks.

From the paper:

We believe that video models will become unifying, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP). [...]

Machine vision today in many ways resembles the state of NLP a few years ago: There are excellent task-specific models like “Segment Anything” for segmentation or YOLO variants for object detection. While attempts to unify some vision tasks exist, no existing model can solve any problem just by prompting. However, the exact same primitives that enabled zero-shot learning in NLP also apply to today’s generative video models—large-scale training with a generative objective (text/video continuation) on web-scale data. [...]

    Analyzing 18,384 generated videos across 62 qualitative and 7 quantitative tasks, we report that Veo 3 can solve a wide range of tasks that it was neither trained nor adapted for.Based on its ability to perceive, model, and manipulate the visual world, Veo 3 shows early forms of “chain-of-frames (CoF)” visual reasoning like maze and symmetry solving.While task-specific bespoke models still outperform a zero-shot video model, we observe a substantial and consistent performance improvement from Veo 2 to Veo 3, indicating a rapid advancement in the capabilities of video models.

I particularly enjoyed the way they coined the new term chain-of-frames to reflect chain-of-thought in LLMs. A chain-of-frames is how a video generation model can "reason" about the visual world:

Perception, modeling, and manipulation all integrate to tackle visual reasoning. While language models manipulate human-invented symbols, video models can apply changes across the dimensions of the real world: time and space. Since these changes are applied frame-by-frame in a generated video, this parallels chain-of-thought in LLMs and could therefore be called chain-of-frames, or CoF for short. In the language domain, chain-of-thought enabled models to tackle reasoning problems. Similarly, chain-of-frames (a.k.a. video generation) might enable video models to solve challenging visual problems that require step-by-step reasoning across time and space.

They note that, while video models remain expensive to run today, it's likely they will follow a similar pricing trajectory as LLMs. I've been tracking this for a few years now and it really is a huge difference - a 1,200x drop in price between GPT-3 in 2022 ($60/million tokens) and GPT-5-Nano today ($0.05/million tokens).

The PDF is 45 pages long but the main paper is just the first 9.5 pages - the rest is mostly appendices. Reading those first 10 pages will give you the full details of their argument.

The accompanying website has dozens of video demos which are worth spending some time with to get a feel for the different applications of the Veo 3 model.

It's worth skimming through the appendixes in the paper as well to see examples of some of the prompts they used. They compare some of the exercises against equivalent attempts using Google's Nano Banana image generation model.

For edge detection, for example:

Veo: All edges in this image become more salient by transforming into black outlines. Then, all objects fade away, with just the edges remaining on a white background. Static camera perspective, no zoom or pan.

Nano Banana: Outline all edges in the image in black, make everything else white.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式视频模型 Veo 3 基础模型 视觉推理 零样本学习 帧序列 Chain-of-Frames LLM Generative Video Models Foundation Models Vision Reasoning Zero-Shot Learning
相关文章