Microsoft Research Blog - Microsoft Research 09月12日
MindJourney:AI探索三维空间的新框架
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MindJourney是一种新研究框架,旨在克服视觉语言模型(VLM)在理解和描述静态图像方面的优势,但难以解读2D图像背后的交互式3D世界这一局限性。该框架通过让AI代理在虚拟空间中进行“心智探索”,模拟人类的空间推理能力,来解决这一问题。MindJourney利用视频生成系统训练的“世界模型”,预测不同视角下的场景,并结合VLM过滤和选择最有用的视角,从而高效地探索3D空间,提升AI的空间理解能力,并在SAT基准测试中将VLM的准确率提高了8%。

💡 **MindJourney框架革新AI空间理解能力**:该框架通过模拟AI代理在虚拟三维空间中的“心智探索”,解决了现有视觉语言模型(VLM)难以理解2D图像背后复杂交互式3D世界的问题。它让AI能够像人类一样,通过想象和组合不同视角下的信息来解决空间推理任务,显著提升了AI对物体位置和自身移动的理解。

🗺️ **“世界模型”与VLM协同探索**:MindJourney的核心在于其“世界模型”,一个通过大量单视角移动视频训练的视频生成系统,能预测新场景在不同视角下的外观。在实际应用中,该模型能生成代理当前位置出发的多种可能视角图像,而VLM则作为过滤器,选出最有可能回答用户空间问题的视角。通过这种迭代式的生成、评估和整合,AI能够有效探索3D空间,而无需额外的训练。

🚀 **高效的空间搜索与潜在应用**:MindJourney采用“空间束搜索”算法,优先探索最有前景的路径,确保在有限步数内收集到充足的证据,从而高效且有条理地进行3D空间推理。这一技术不仅在Spatial Aptitude Training (SAT)基准测试中显著提升了VLM的性能,还为自主机器人、智能家居和辅助视觉障碍人士等领域开辟了新的应用前景,使AI能够更好地理解和适应现实世界环境。

🔗 **连接计算机视觉与规划**:MindJourney将简单的图像描述系统转变为能够持续评估下一步观察方向的主动代理,实现了计算机视觉与规划的有效连接。由于探索完全在模型的潜在空间(内部场景表示)中进行,机器人可以在实际行动前测试多个视角,从而可能减少磨损、能耗和碰撞风险,实现更安全、更高效的运行。

A new research framework helps AI agents explore three-dimensional spaces they can’t directly detect. Called MindJourney, the approach addresses a key limitation in vision-language models (VLMs), which give AI agents their ability to interpret and describe visual scenes.  

While VLMs are strong at identifying objects in static images, they struggle to interpret the interactive 3D world behind 2D images. This gap shows up in spatial questions like “If I sit on the couch that is on my right and face the chairs, will the kitchen be to my right or left?”—tasks that require an agent to interpret its position and movement through space. 

People overcome this challenge by mentally exploring a space, imagining moving through it and combining those mental snapshots to work out where objects are. MindJourney applies the same process to AI agents, letting them roam a virtual space before answering spatial questions. 

How MindJourney navigates 3D space

To perform this type of spatial navigation, MindJourney uses a world model—in this case, a video generation system trained on a large collection of videos captured from a single moving viewpoint, showing actions such as going forward and turning left or right, much like a 3D cinematographer. From this, it learns to predict how a new scene would appear from different perspectives.

At inference time, the model can generate photo-realistic images of a scene based on possible movements from the agent’s current position. It generates multiple possible views of a scene while the VLM acts as a filter, selecting the constructed perspectives that are most likely to answer the user’s question.

These are kept and expanded in the next iteration, while less promising paths are discarded. This process, shown in Figure 1, avoids the need to generate and evaluate thousands of possible movement sequences by focusing only on the most informative perspectives.

Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM’s spatial interpretation through generated observations when encountering new challenges. 

 

To make its search through a simulated space both effective and efficient, MindJourney uses a spatial beam search—an algorithm that prioritizes the most promising paths. It works within a fixed number of steps, each representing a movement. By balancing breadth with depth, spatial beam search enables MindJourney to gather strong supporting evidence. This process is illustrated in Figure 2.

Figure 2. The MindJourney workflow starts with a spatial beam search for a set number of steps before answering the query. The world model interactively generates new observations, while a VLM interprets the generated images, guiding the search throughout the process.

By iterating through simulation, evaluation, and integration, MindJourney can reason about spatial relationships far beyond what any single 2D image can convey, all without the need for additional training. On the Spatial Aptitude Training (SAT) benchmark, it improved the accuracy of VLMs by 8% over their baseline performance.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Opens in a new tab

Building smarter agents  

MindJourney showed strong performance on multiple 3D spatial-reasoning benchmarks, and even advanced VLMs improved when paired with its imagination loop. This suggests that the spatial patterns that world models learn from raw images, combined with the symbolic capabilities of VLMs, create a more complete spatial capability for agents. Together, they enable agents to infer what lies beyond the visible frame and interpret the physical world more accurately. 

It also demonstrates that pretrained VLMs and trainable world models can work together in 3D without retraining either one—pointing toward general-purpose agents capable of interpreting and acting in real-world environments. This opens the way to possible applications in autonomous robotics, smart home technologies, and accessibility tools for people with visual impairments. 

By converting systems that simply describe static images into active agents that continually evaluate where to look next, MindJourney connects computer vision with planning. Because exploration occurs entirely within the model’s latent space—its internal representation of the scene—robots would be able to test multiple viewpoints before determining their next move, potentially reducing wear, energy use, and collision risk. 

Looking ahead, we plan to extend the framework to use world models that not only predict new viewpoints but also forecast how the scene might change over time. We envision MindJourney working alongside VLMs that interpret those predictions and use them to plan what to do next. This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments.

Opens in a new tab

The post MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation appeared first on Microsoft Research.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MindJourney AI 三维空间 空间推理 视觉语言模型 世界模型 计算机视觉 规划 MindJourney AI 3D Space Spatial Reasoning Vision-Language Models World Model Computer Vision Planning
相关文章