Microsoft Research Blog - Microsoft Research 09月12日
RenderFormer:AI重塑3D渲染新范式
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软研究团队推出了RenderFormer,一种全新的基于机器学习的3D渲染模型,彻底摆脱了传统的光线追踪和栅格化计算。RenderFormer能够处理任意3D场景和全局光照,将3D模型、材质和光照信息转化为三角形令牌,并通过双Transformer架构(分别处理与视角无关和与视角相关的效果)生成最终图像。该模型在Objaverse数据集上训练,能够精确模拟阴影、漫反射和镜面高光,并支持生成连续视频。RenderFormer的出现标志着神经网络在图形渲染领域的重大突破,为未来视觉计算和人工智能的融合开辟了新路径。

✨ **AI驱动的通用3D渲染**:RenderFormer是首个完全基于机器学习的通用3D渲染模型,无需传统的光线追踪或栅格化计算。它能处理任意3D场景,并支持全局光照,将整个渲染流程整合到神经网络中。

📐 **基于令牌的场景表示**:该模型将3D场景表示为三角形令牌,每个令牌包含空间位置、表面法线以及漫反射颜色、镜面反射颜色和粗糙度等物理材质属性。光照也被建模为具有发射值的令牌。

💡 **双Transformer架构**:RenderFormer采用双Transformer设计。一个Transformer处理与视角无关的特征,如阴影和漫反射光传输;另一个Transformer则通过交叉注意力机制处理与视角相关的效果,如可见性、反射和镜面高光。

📊 **强大的泛化能力与训练**:在Objaverse数据集上训练的RenderFormer,展现出对复杂真实世界场景的良好泛化能力。它能够精确再现阴影、漫反射着色和镜面高光,并能生成连续视频序列。

🚀 **未来发展潜力**:RenderFormer为未来在图形渲染和人工智能领域的结合奠定了基础,有望推动视频生成、图像合成、机器人学和具身AI等领域的发展,开启视觉计算的新可能性。

3D rendering—the process of converting three-dimensional models into two-dimensional images—is a foundational technology in computer graphics, widely used across gaming, film, virtual reality, and architectural visualization. Traditionally, this process has depended on physics-based techniques like ray tracing and rasterization, which simulate light behavior through mathematical formulas and expert-designed models.

Now, thanks to advances in AI, especially neural networks, researchers are beginning to replace these conventional approaches with machine learning (ML). This shift is giving rise to a new field known as neural rendering.

Neural rendering combines deep learning with traditional graphics techniques, allowing models to simulate complex light transport without explicitly modeling physical optics. This approach offers significant advantages: it eliminates the need for handcrafted rules, supports end-to-end training, and can be optimized for specific tasks. Yet, most current neural rendering methods rely on 2D image inputs, lack support for raw 3D geometry and material data, and often require retraining for each new scene—limiting their generalizability.

RenderFormer: Toward a general-purpose neural rendering model

To overcome these limitations, researchers at Microsoft Research have developed RenderFormer, a new neural architecture designed to support full-featured 3D rendering using only ML—no traditional graphics computation required. RenderFormer is the first model to demonstrate that a neural network can learn a complete graphics rendering pipeline, including support for arbitrary 3D scenes and global illumination, without relying on ray tracing or rasterization. This work has been accepted at SIGGRAPH 2025 and is open-sourced on GitHub (opens in new tab).

Architecture overview

As shown in Figure 1, RenderFormer represents the entire 3D scene using triangle tokens—each one encoding spatial position, surface normal, and physical material properties such as diffuse color, specular color, and roughness. Lighting is also modeled as triangle tokens, with emission values indicating intensity.

Figure 1. Architecture of RenderFormer

To describe the viewing direction, the model uses ray bundle tokens derived from a ray map—each pixel in the output image corresponds to one of these rays. To improve computational efficiency, pixels are grouped into rectangular blocks, with all rays in a block processed together.

The model outputs a set of tokens that are decoded into image pixels, completing the rendering process entirely within the neural network.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Opens in a new tab

Dual-branch design for view-independent and view-dependent effects

The RenderFormer architecture is built around two transformers: one for view-independent features and another for view-dependent ones.

Additional image-space effects, such as anti-aliasing and screen-space reflections, are handled via self-attention among ray bundle tokens.

To validate the architecture, the team conducted ablation studies and visual analyses, confirming the importance of each component in the rendering pipeline.

Table 1. Ablation study analyzing the impact of different components and attention mechanisms on the final performance of the trained network.

To test the capabilities of the view-independent transformer, researchers trained a decoder to produce diffuse-only renderings. The results, shown in Figure 2, demonstrate that the model can accurately simulate shadows and other indirect lighting effects.

Figure 2. View-independent rendering effects decoded directly from the view-independent transformer, including diffuse lighting and coarse shadow effects.

The view-dependent transformer was evaluated through attention visualizations. For example, in Figure 3, the attention map reveals a pixel on a teapot attending to its surface triangle and to a nearby wall—capturing the effect of specular reflection. These visualizations also show how material changes influence the sharpness and intensity of reflections.

Figure 3. Visualization of attention outputs

Training methodology and dataset design

RenderFormer was trained using the Objaverse dataset, a collection of more than 800,000 annotated 3D objects that is designed to advance research in 3D modeling, computer vision, and related fields. The researchers designed four scene templates, populating each with 1–3 randomly selected objects and materials. Scenes were rendered in high dynamic range (HDR) using Blender’s Cycles renderer, under varied lighting conditions and camera angles.

The base model, consisting of 205 million parameters, was trained in two phases using the AdamW optimizer:

The model supports arbitrary triangle-based input and generalizes well to complex real-world scenes. As shown in Figure 4, it accurately reproduces shadows, diffuse shading, and specular highlights.

Figure 4. Rendered results of different 3D scenes generated by RenderFormer

RenderFormer can also generate continuous video by rendering individual frames, thanks to its ability to model viewpoint changes and scene dynamics.

3D animation sequence rendered by RenderFormer

Looking ahead: Opportunities and challenges

RenderFormer represents a significant step forward for neural rendering. It demonstrates that deep learning can replicate and potentially replace the traditional rendering pipeline, supporting arbitrary 3D inputs and realistic global illumination—all without any hand-coded graphics computations.

However, key challenges remain. Scaling to larger and more complex scenes with intricate geometry, advanced materials, and diverse lighting conditions will require further research. Still, the transformer-based architecture provides a solid foundation for future integration with broader AI systems, including video generation, image synthesis, robotics, and embodied AI. 

Researchers hope that RenderFormer will serve as a building block for future breakthroughs in both graphics and AI, opening new possibilities for visual computing and intelligent environments.

Opens in a new tab

The post RenderFormer: How neural networks are reshaping 3D rendering appeared first on Microsoft Research.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RenderFormer 3D渲染 神经网络 机器学习 计算机图形学 AI Neural Rendering 3D Rendering Neural Networks Machine Learning Computer Graphics
相关文章