ByteByteGo 09月25日
Meta AI动画功能实现解析
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI的'animate'功能通过优化扩散模型和全球基础设施,将静态图像转化为动态视频。采用bfloat16精度、高效时间注意力机制、DPM-Solver等模型优化,结合容量建模、区域路由和流量管理,实现近实时全球规模视频合成。该功能展示了AI研究如何通过工程实践服务于大规模用户。

🔍 Meta AI通过bfloat16精度减少内存需求,提升GPU处理速度,同时保持模型高精度输出,优化了扩散模型的计算效率。

⏱️ 通过改进时间注意力机制,Meta将相同上下文信息复制的计算开销从每帧转移到线性投影后,显著降低了内存占用。

🚀 DPM-Solver算法将扩散模型生成视频的采样步骤从数十甚至数百次减少至约15次,大幅缩短处理时间,同时保持输出质量。

💡 Step Distillation与Classifier-free Guidance Distillation结合,将32步教师模型压缩为8步学生模型,实现3倍效率提升。

📈 PyTorch 2.0与torch.compile的组件级部署支持高级并行技术,如上下文并行和序列并行,以及多GPU推理,进一步加速大规模工作负载。

Developer-first security for your entire codebase (Sponsored)

Ready to embrace secure coding without slowing down? SonarQube Advanced Security helps you find and fix quality and security issues across your entire codebase—from first-party and AI-generated code to your open source dependencies.

SonarQube addresses security challenges head on, offering a unified solution for:

Learn more


Disclaimer: The details in this post have been derived from the official documentation shared online by the Meta Engineering Team. All credit for the technical details goes to the Meta Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Meta AI’s “animate” feature turns a generated image into a short, dynamic video using AI.

It wasn’t a matter of simply introducing another creative effect, but an attempt at something far more complex: delivering video synthesis at a global scale and in near real time.

Billions of users might click “animate” on any given day, and each request comes with critical constraints. The system must produce results in just a few seconds, maintain a consistently high success rate, and operate within tight GPU budgets that leave little room for waste. What looks like a playful feature in the product UI is a carefully engineered balancing act between cutting-edge AI research and large-scale distributed systems.

The challenge is twofold:

These two domains (model/inference-level acceleration and planet-scale traffic engineering) form the backbone of Meta’s solution. In this article, we will look at how Meta implemented this feature and the challenges they faced.


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


Model and Inference Optimizations

The “animate” feature is powered by diffusion models, which normally take a lot of steps and computation to produce good results.

For reference, a diffusion model is a type of AI that creates images (or videos) by starting with pure random noise and then gradually “cleaning” it step by step until a clear picture appears. It learns how to do this by studying lots of real images and figuring out patterns. Think of it like an artist who begins with a messy canvas and slowly refines it.

To make animation practical for billions of users, Meta had to speed up the models without lowering quality. They achieved this with several optimizations as follows:

1 - Half-precision with bfloat16

Most AI models run in 32-bit floating point (FP32), which uses a lot of memory and processing power. FP32 has a wide range and fine-grained precision, which helps when weights or gradients are very large or very small.

Meta switched to a lighter format called bfloat16 (a type of 16-bit precision) for both training and inference. Unlike FP16, it keeps the same range as FP32 but sacrifices some precision. This means it can represent very large and very small numbers safely. It cuts the memory requirement in half and allows GPUs to process data faster. The model’s accuracy remains high, but performance improves immediately.

2 - More Efficient Temporal Attention

When creating a video from images, the model needs to understand how frames relate to each other. Normally, this involves copying the same context information across all frames before running attention layers, which wastes memory and compute.

Meta changed the process so that the copying happens after the linear projection step in cross-attention. Since the context is identical, delaying expansion avoids unnecessary work and lowers the memory footprint.

3 - Fewer Sampling Steps with DPM-Solver

Diffusion models usually need to run through many tiny steps to turn random noise into a clear picture or video.

The more steps we use, the better the result, but it takes longer. Meta used a faster method called DPM-Solver, which can get almost the same quality in far fewer steps (about 15 instead of dozens or even hundreds).

This makes the process much quicker while still keeping the output sharp.

4 - Combining Step Distillation and Guidance Distillation

Meta combined two powerful techniques into one process:

See the diagram below:

By merging these, Meta reduced the pipeline to 8 solver steps, with just one U-Net forward pass per step. In effect, they compressed 32 teacher steps into 8 student steps, while also removing the overhead of multiple passes per step. This was a 3X reduction.

5 - Improvements with PyTorch Deployment

At first, the system used TorchScript with freezing, which improved performance through graph simplification, constant folding, and fusing operations. Later, Meta migrated to PyTorch 2.0 with torch.compile at the component level.

This provided better tracing, faster iteration for new features, and support for advanced techniques like context parallelism and sequence parallelism, as well as multi-GPU inference for larger workloads.

Running at Planet Scale

Speeding up the model itself was only half the challenge.

Meta also had to make sure the system could handle billions of requests every day across the globe without slowing down or failing. That required careful planning and new traffic management systems.

1 - Capacity Modeling and Load Tests

Meta started by estimating how many people would use the feature.

They looked at data from earlier AI launches, calculated how many requests per second that would mean, and then mapped those numbers to the GPU power required. Using these projections, they ran large-scale load tests, identified bottlenecks, and removed them step by step until the system could handle the expected traffic.

2 - Fixing Latency from Cross-Region Routing

Even after model optimizations, end-to-end latency was higher than expected.

The problem was that requests were often routed to faraway regions, adding seconds of network delay.

To fix this, Meta built a traffic management system that creates a routing table. This ensures requests stay within their own region whenever possible, and are only redirected elsewhere if the local region is close to running out of capacity. This adjustment brought latency back down to target levels.

3 - How the Traffic Manager Makes Decisions

The system works in tiers:

See the diagram below:

If overloads are severe, the system gradually expands the search radius outward, creating “rings” of nearby regions. Once it finds a stable balance, it locks in a routing table, which is then used for live requests.

4 - No Queue Rule and Success Rate Dips

Each GPU can process only one request at a time.

To keep latency low, Meta does not allow queuing. If a GPU is busy, the request is rejected. This keeps performance consistent, but causes more failures near capacity. Instead of building a large, slow global queue, Meta relied on quick retries to help requests find available GPUs. This worked before regionalization.

5 - Retry Storms

Once traffic was pinned to local regions, retrying became a problem. With fewer machines to send retries to, spikes of retries caused “retry storms,” where too many requests flooded the system at once.

To solve this, Meta made two key changes:

These adjustments smoothed out traffic, prevented cascades, and restored high success rates.

See the diagram below:

Conclusion

Meta’s “animate” feature demonstrates the kind of engineering discipline required to bring cutting-edge AI research into the hands of billions of people.

What appears to users as a simple transformation from a static picture to a moving image is the outcome of dozens of technical breakthroughs across both model design and infrastructure.

On the modeling side, the team reduced the cost of diffusion-based generation through precision optimizations, more efficient attention mechanisms, faster solvers, and aggressive distillation. Together, these measures compressed what once required dozens of heavy steps into just a handful of efficient operations, all while preserving output quality.

Yet scaling the feature was as much a systems problem as it was a research one. Running diffusion pipelines across a global user base required careful demand forecasting, new approaches to region-aware routing, and strict rules around GPU utilization to keep latency predictable.

The key learning is that AI isn’t just about new models, but also about the infrastructure, software tooling, and operational strategies that allow those models to perform at scale.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta AI 扩散模型 AI动画 模型优化 全球基础设施
相关文章