多视角文本描述提升VLN导航性能

cs.AI updates on arXiv.org 09月30日

多视角文本描述提升VLN导航性能

本文提出将多视角文本描述融入VLN模型，通过文本类比推理增强全局场景理解和空间推理，在R2R数据集上显著提升了导航性能。

arXiv:2509.25139v1 Announce Type: new Abstract: Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent's contextual understanding by incorporating textual descriptions from multiple perspectives that facilitate analogical reasoning across images. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VLN 文本描述类比推理导航性能 R2R数据集

相关文章

PICO交互感知团队 - 可控3D版生成来袭：Coin3D实现三维可控的物体生成

DreamHOI: A Novel AI Approach for Realistic 3D Human-Object Interaction Generation Using Textual Descriptions and Diffusion Models

为什么DeepSeek偏爱“量子纠缠”？

阿德莱德大学吴琦：VLN 仍是 VLA 的未竟之战丨具身先锋十人谈

类比的长河，为何流到大模型就被截流？

Modeling Understanding of Story-Based Analogies Using Large Language Models

DeFine: Decision-Making with Analogical Reasoning over Factor Profiles

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

ICCV 2025 | 北大开源AR-VRM：第一人称视频驱动关键点迁移，实现机器人类比学习