a16z 前天 22:58
视频模型进入产品时代
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

去年,每周似乎都有新的前沿视频模型出现,不断超越同行,推动我们的期望。我们学会了简单地期待在可衡量的基准上 relentless 的改进:更长的视频输出、持续的故事、更好的物理效果、角色一致性、运动连贯性等等。但今年,如果你很在意基准,可能会感到有些 restless。我们几乎期望所有视频模型都能生成 10-15 秒的同步音频和语音,以及整体上的现实主义感;这是一个伟大的成就,但不是我们被习惯的那种 runaway 模型进步。相反,我们正进入一个更令人兴奋的时代:视频模型的产品时代,这里的“丰富性”不是关于基准性能,而是关于多样性。我们开始看到模型在特定维度上专业化:没有“全能模型”。初创公司正在两个主要维度上找到新的机会:精通某一关键点的视频模型(物理!动漫!多镜头!)和抽象化艰苦工作流程的产品。即使视频模型进步完全停止,创始人也会花几年时间围绕当前模型能力构建产品。这对初创公司很棒:它为垂直领域和楔形创造了空间,这些空间可以成为他们自己的巨大公司。

🔍 视频模型正进入产品时代,不再追求单一全能,而是专注于特定领域,如物理模拟、动漫风格或多镜头处理,满足不同垂直需求。

🚀 初创公司利用模型专业化趋势,开发针对特定场景(如家居展示、营销或动画)的定制产品,创造差异化竞争优势。

💡 模型进步放缓促使大型实验室转向产品开发,通过抽象化复杂工作流程,提供更易用的端到端解决方案,如 Runway 的编辑套件和 Sora Storyboard。

🎨 模型专业化带来多样性,Veo 3 擅长物理和运动同步,Sora 2 适合从短提示生成故事,Wan 提供开源风格生态,Grok 专注动漫,Hedra 适合长对话片段。

🛠️ 产品侧仍有巨大发展空间,创作者需整合多模型完成复杂任务,如保持角色一致性或通过图像编辑控制摄像机运动,未来将有更智能的产品引导模型生成特定内容。

Last year, it seemed like every week a new frontier video model would one-up its peers and push our expectations. We learned to simply expect relentless improvement on measurable benchmarks: longer video output, sustained stories, better physics, character consistency, movement cohesion, and more. 

But this year, if you care about the benchmarks a lot, you might be getting a little restless. We pretty much expect all video models to generate 10-15 seconds with synced audio and speech, and an all-around sense of realism; which is a great achievement, but not the runaway model progress to which we’d been conditioned. Instead, we’re entering a more exciting time: the product era of video models, where “abundance” isn’t about benchmark performance – it’s about variety.   

  • We’re beginning to see models specialize across specific dimensions: there is no “God Model” that’s great at everything.
  • Startups are finding new opportunities across two main dimensions: video models that excel at one key thing (Physics! Anime! Multiple shots!) and products that abstract away arduous workflows. 
  • Hot take: even if video model progress stopped entirely, founders would be playing years of catch-up building products around current model capabilities.  
  • This is great for startups: it creates space for verticals and wedges that can become their own massive companies. 

A brief history of models

For the past few years, frontier diffusion model labs have consistently released models that outperform prior generations on major benchmarks. This progress inculcated an assumption among researchers, creators, and (yes) VC’s like myself: a “god-tier” model that is great at everything would eventually emerge, and become the default provider for a multitude of different video use-cases. But this year, the assumption was challenged: Sora 2 was released last month, and appeared below Veo 3 on benchmarks like LMarena. There’s a sense that progress might be slowing on the diffusion model level, and the concept of “SOTA”, at least when it comes to video models, may not actually exist. 

Of course, this assumption (and subsequent confrontation with reality) is not unique to diffusion models. Frontier lab LLMs also followed a pattern of step-function improvements from 2023-2025, and then saw performance begin to level off across a variety of benchmarks. When this happened with LLMs, we saw major foundation labs begin to specialize in specific verticals, and the overall AI product layer take off; in fact, we wrote about this trend earlier this year.  

Upon reflection it does make sense that progress on video models is beginning to plateau, at least when it comes to public benchmarks. Models have come an extremely long way in terms of realism and aesthetic quality, and now generate outputs that are so realistic that at this point, they might be hitting an asymptote. Once you convincingly look like real life, how much more “real” can you get? A good analogy here might be still-life painting in the 17th and 18th century: at a certain point, the old masters got so good at painting realistic portraits and pastoral scenes that preferring one artist over another came down to a simple matter of aesthetic taste, rather than any objective measure of what looked more “real”. 

So what happens when realism stops being the differentiator for video models? We get more variety and specialization. Welcome to the abundance era. 

The Models are Specializing 

Before we explore the increased specialization we’re seeing at the model layer, it’s worth taking a quick stroll (or more appropriately, scroll) down memory lane. Back in early 2024, I wrote a piece taking stock of the state of video model outputs. At the time, it was a struggle to generate videos longer than 3-4 seconds. Physics was a joke: people might randomly melt into the ground; basketballs might rebound off the backboard, ricochet toward the ground, then miraculously end up in the hoop in the final frame. At the time, I remarked that it would be awhile before we saw AI generate a Pixar-level short film. Google hadn’t yet released Veo, or any public model for that matter; they would just tease the timeline with research papers.

How quickly things change. Google now dominates LMArena and other leaderboards with their Veo models. OpenAI is producing a $30 million animated feature with a team of 30 people on a timeline of nine months (admittedly they’re not one-shotting the feature from a single prompt, but this is still staggering progress!). As stated in the introduction, we can now expect longer video outputs, and higher degrees of physics realism: the basketball rebounds off the backboard and hits the ground now.

But even as models collectively are getting better, we’re beginning to see them specialize. Why? Because no one model is good at everything. Some teams are optimizing for price and speed of outputs. Others are going all-in on post-training, with apparent focuses on specific verticals. Here are a couple of examples of what I’m seeing:

    Veo 3: has the strongest handle on physics, complex movement, and audio & speech synced with video
    Sora 2: will “direct” for you from a short prompt, writing the script (it’s often funny!) and creating multiple shots.
    Wan – a solid open source player with an ecosystem of LoRAs for specific styles, motions, or effects.
    Grok: fast, inexpensive, and particularly great at anime and animation
    Hedra: great at long-form clips of talking characters

Comparing outputs from Sora 2 and Veo 3 is a great illustration of how models are beginning to specialize. Both are exceptional video models, but I’d argue they’re good at very different things. Sora excels at inserting yourself (or friends) into a scene, and crafting a story (or meme, or joke) from a short prompt, like the Stephen Hawking sports clips or Pikachu in famous movies. It’s great for consumers and meme makers.

However, it’s not as good as Veo 3 at syncing video with audio. You’ll often get multiple fast clips where the audio lags behind or the wrong person is talking. And in my opinion, it’s less good at understanding physics or any kind of more complex motion. Veo 3, meanwhile, has less of a sense of humor and needs more guidance, but I’d argue that it’s a more powerful and controllable model for prosumer and professional creatives.

This trend of model specialization is also positive for players up and down the stack. AI video cloud providers like Fal and Replicate now host dozens of models for users seeking to access these various vertical use-cases. Editing suites like Krea give users a central hub that allows them to interact with multiple models and build entire workflows around them.    

Still, I expect to see the extremely well-capitalized labs trying to achieve “god mode” when it comes to video, and we hope and expect to see continued performance improvements! But there’s also a ton of opportunity in being the best for a certain use case or in a certain vertical.

Products for the Masses

If you follow me on X, you’ve probably picked up on the fact that I love experimenting with multiple video and image models to create extremely bespoke outputs. This involves a ton of cross-pollination between different model providers and editing tools. For example, creating a custom furniture staging video involves leveraging Ideogram, nano-banana, and Veo3, and adding product swag to an existing video requires nano banana, Hedra, and editing suites like Krea and Kapwing. Not everyone has access to all of these tools (or the level of masochism) required to generate their desired outputs. We need better end-to-end products.

There’s still a ton of “room to run” when it comes to catching up to model progress on the product side. Creators are hacking together complex workflows across multiple products to do things that could feasibly be done by the models – e.g. getting consistent characters across generations, extending a scene by taking the end frame of the last clip, controlling camera movement via start and end frames using an image edit model, or collaging storyboards.

The good news is that some labs are starting to address this product gap. Runway released a suite of apps that enable creators to edit camera angles, do next-shot video generation, perform style transfers from one clip to the next, change the weather, and add or remove items from clips. Sora Storyboard enables users to specify precise moment-to-moment action in a video sequence. And Veo 3.1, which was released earlier this month, consists almost entirely of product updates around audio and visual control, as opposed to model-level improvements. 

Some people like to say that if LLM progress stalled tomorrow, entrepreneurs would still have years of catchup to play building useful products around the models’ capabilities. The same is true of video models too: we’re only just beginning to see end-to-end products being built around these models, with a lot of opportunities that haven’t been explored yet. In the future I’m excited to see smaller, more specialized models. There will likely be products that guide these models towards the best generations for specific use cases or industries – like home-staging, marketing, or animation. And finally, we still need creative suites to bring all the modalities together and make it easier to generate and edit in one place: video, dubbed audio, music, and more.

If you’re working on new verticals in video models, or end-to-end products for AI video orchestration, reach out! I’d love to hear from you. You can find me on X at @venturetwins, or email me at jmoore@a16z.com.

                                            </div>

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频模型 产品时代 模型专业化 初创公司 AI视频
相关文章