MarkTechPost@AI 09月18日
MapAnything:统一3D重建任务的Transformer新模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta Reality Labs和卡内基梅隆大学的研究人员发布了MapAnything,这是一种端到端的Transformer架构,可以直接从图像和可选传感器输入回归出因子化的、度量的3D场景几何。该模型支持超过12种不同的3D视觉任务,并以一种前馈方式完成。MapAnything克服了传统分段式3D重建流水线的局限性,能够处理多达2000张输入图像,并灵活使用辅助数据,无需繁琐的优化过程即可生成直接的度量3D重建。其核心在于多视图交替注意力Transformer和因子化场景表示,使其在多种3D视觉任务上均取得了领先的性能。

🗺️ **统一的端到端Transformer模型**:MapAnything是一种创新的Transformer架构,能够将多种3D重建任务,如单目深度估计、多视图立体匹配(MVS)、运动恢复结构(SfM)和相机标定等,整合到一个单一的前馈模型中。这打破了以往需要针对不同任务设计特定流水线的传统模式,显著简化了3D场景几何的获取过程。

📐 **因子化场景表示**:该模型的核心优势在于其独特的因子化场景表示,将3D场景信息分解为独立的组件:每视图的射线方向(相机内参)、沿射线的深度(预测至尺度)、相对于参考视图的相机姿态,以及一个全局度量尺度因子。这种分解方式不仅避免了冗余,还极大地增强了模型的通用性和模块化能力,使其能够灵活适应不同输入和任务。

🚀 **卓越的性能和可扩展性**:MapAnything在多个3D视觉基准测试中取得了最先进(SoTA)的性能,尤其在多视图密集重建、两视图重建和深度估计方面表现突出。模型能够处理高达2000张输入图像,并且在集成辅助数据(如相机内参、姿态和深度图)后,性能进一步提升,显示出强大的可扩展性和鲁棒性。

💡 **开放的资源与贡献**:研究团队不仅发布了性能卓越的模型,还开源了包括数据处理、训练脚本、基准测试和预训练权重在内的全部代码,并遵循Apache 2.0协议发布。这一举措极大地促进了3D视觉领域的研究和应用,为构建通用3D重建骨干模型奠定了坚实的基础。

A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything advances beyond specialist pipelines by supporting over 12 distinct 3D vision tasks in a single feed-forward pass.

https://map-anything.github.io/assets/MapAnything.pdf

Why a Universal Model for 3D Reconstruction?

Image-based 3D reconstruction has historically relied on fragmented pipelines: feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, or monocular depth inference. While effective, these modular solutions require task-specific tuning, optimization, and heavy post-processing.

Recent transformer-based feed-forward models such as DUSt3R, MASt3R, and VGGT simplified parts of this pipeline but remained limited: fixed numbers of views, rigid camera assumptions, or reliance on coupled representations that needed expensive optimization.

MapAnything overcomes these constraints by:

The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.

Architecture and Representation

At its core, MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.

The network outputs a factored representation:

This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.

https://map-anything.github.io/assets/MapAnything.pdf

Training Strategy

MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:

Key training strategies include:

Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.

Benchmarking Results

Multi-View Dense Reconstruction

On ETH3D, ScanNet++ v2, and TartanAirV2-WB, MapAnything achieves state-of-the-art (SoTA) performance across pointmaps, depth, pose, and ray estimation. It surpasses baselines like VGGT and Pow3R even when limited to images only, and improves further with calibration or pose priors.

For example:

Two-View Reconstruction

Against DUSt3R, MASt3R, and Pow3R, MapAnything consistently outperforms across scale, depth, and pose accuracy. Notably, with additional priors, it achieves >92% inlier ratios on two-view tasks, significantly beyond prior feed-forward models.

Single-View Calibration

Despite not being trained specifically for single-image calibration, MapAnything achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

On the Robust-MVD benchmark:

Overall, benchmarks confirm 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.

Key Contributions

The research team highlight four major contributions:

    Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.

Conclusion

MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.


Check out the Paper, Codes and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MapAnything 3D Reconstruction Transformer Computer Vision Meta AI Carnegie Mellon University 3D Geometry AI
相关文章