Nvidia Developer 10小时前
提升机器人操作能力:感知与规划的融合
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了感知驱动的任务与运动规划(TAMP)以及GPU加速TAMP在机器人长时程操作中的应用。通过将视觉和语言信息转化为机器人可执行的子目标和约束,机器人能够更新计划并适应动态环境。文章介绍了OWL-TAMP、VLM-TAMP和NOD-TAMP等框架,它们分别侧重于自然语言指令执行、视觉丰富环境下的多步任务规划以及泛化物体表示。此外,cuTAMP利用GPU并行化显著加速了规划过程,而Fail2Progress则通过学习失败来改进机器人的操作技能,从而提升了机器人应对复杂和不确定场景的能力。

🤖 **感知驱动的TAMP:** 传统TAMP系统在静态模型下表现不佳,而将感知能力与TAMP相结合,使机器人能够在中途更新计划并适应动态环境。这包括利用视觉和语言信息将像素转化为子目标、可供性(affordances)和可微分约束,从而实现更灵活的机器人操作。

💡 **长时程操作的先进框架:** OWL-TAMP通过融合视觉-语言模型(VLMs)与TAMP,支持机器人执行自然语言描述的复杂长时程操作。VLM-TAMP则在视觉丰富的环境中规划多步任务,通过结合视觉和语言上下文来处理模糊信息,显著提升了复杂操作的性能。NOD-TAMP通过使用神经物体描述符(NODs)来泛化物体类型,克服了传统TAMP在泛化能力上的不足。

🚀 **GPU加速与失败学习:** cuTAMP利用GPU并行化显著加速了机器人规划过程,将原本需要数分钟甚至数小时的计算缩短到数秒,使得在实际应用中解决长时程问题成为可能。Fail2Progress框架则通过学习机器人自身的失败来改进操作技能,利用Stein变分推断生成针对性的合成数据集,从而减少重复性失败,提升了模型在长时程任务中的鲁棒性。

Traditional task and motion planning (TAMP) systems for robot manipulation use cases operate on static models that often fail in new environments. Integrating perception with manipulation is a solution to this challenge, enabling robots to update plans mid-execution and adapt to dynamic scenarios.

In this edition of the NVIDIA Robotics Research and Development Digest (R²D²), we explore the use of perception-based TAMP and GPU-accelerated TAMP for long-horizon manipulation. We’ll also learn about a framework for improving robot manipulation skills. And we’ll show how vision and language can be used to translate pixels into subgoals, affordances, and differentiable constraints.

    Subgoals are smaller intermediate objectives that guide the robot step-by-step toward the final goal. Affordances describe the actions that an object or environment allows a robot to perform, based on its properties and context. For instance, a handle affords “grasping,” a button affords “pressing,” and a cup affords “pouring.”Differentiable constraints in robot-motion planning ensure that the robot’s movements satisfy physical limits (like joint angles, collision avoidance, or end-effector positions) while still being adjustable via learning. Because they’re differentiable, GPUs can compute and refine them efficiently during training or real-time planning.

How task and motion planning transforms vision and language into robot action

TAMP involves deciding what a robot should do and how it should move to do it. Doing this requires combining high-level task-planning (what task to do) and low-level motion-planning (how to move to perform the task). 

Modern robots can use both vision and language (like pictures and instructions) to break down complex tasks into smaller steps, called subgoals. These subgoals help the robot understand what needs to happen next, what objects to interact with, and how to move safely.

This process uses advanced models to turn images and written instructions into clear plans the robot can follow in real-world situations. Long-horizon manipulation requires structured intentions that can be satisfied by the planner. Let’s see how OWL-TAMP, VLM-TAMP, and NOD-TAMP help address this:

    OWL-TAMP: This workflow enables robots to execute complex, long-horizon manipulation tasks described in natural language, such as “put the orange on the table.” OWL-TAMP is a hybrid workflow that integrates vision-language models (VLMs) with TAMP, where the VLM generates constraints that describe how to ground open-world language (OWL) instructions in robot action space. These constraints are incorporated into the TAMP system, which ensures physical feasibility and correctness through simulation feedback.  VLM-TAMP: This is a workflow for planning multi-step tasks for robots in visually rich environments. VLM-TAMP combines VLMs with traditional TAMP to generate and refine action plans in real-world scenes. It uses a VLM to understand images and uses task descriptions (like “make chicken soup”) to generate high-level plans for the robot. These plans are then iteratively refined through simulation and motion planning to check feasibility. This hybrid model outperforms both the VLM-only and TAMP-only baselines on long-horizon kitchen tasks that require 30 to 50 sequential actions and involve up to 21 different objects. This workflow enables robots to handle ambiguous information by using both visual and language context, resulting in improved performance in complex manipulation tasks.

Figure 1. VLM-TAMP overcomes the pitfalls of using TAMP alone or VLM task then motion planning when solving long-horizon robot manipulation problems.
    NOD-TAMP: Traditional TAMP frameworks often struggle to generalize on long-horizon manipulation tasks because they rely on explicit geometric models and object representations. NOD-TAMP overcomes this by using neural object descriptors (NODs) to help generalize object types. NODs are learned representations derived from 3D point clouds that encode spatial and relational properties of objects. This enables robots to interact with new objects and helps the planner adapt actions dynamically.​

How cuTAMP accelerates robot planning with GPU parallelization

Classical TAMP first analyzes the outline of actions for a task (called plan skeletons) and then proceeds to solve the continuous variables. This second step is usually the bottleneck in manipulation systems, which is vastly accelerated in cuTAMP. For a specified skeleton in cuTAMP, thousands of seeds (particles) are sampled, and then differentiable batch optimization is executed on the GPU to satisfy the various constraints (like inverse kinematics, collisions, stability, and goal costs).

If a skeleton is not feasible, the algorithm backtracks. If it is, the algorithm provides a plan, which can often happen in a matter of seconds for constrained packing/stacking tasks. This means that robots can find solutions for packing, stacking, or manipulating many objects in seconds instead of minutes or hours.​

This “vectorized satisfaction” is the essence of making long-horizon problem-solving feasible in real-world applications.

How robots learn from failures using Stein variational inference

Long-horizon manipulation models can fail in novel conditions not seen during training. Fail2Progress is a framework for improving manipulation by enabling robots to learn from their own failures. This framework integrates failures into skill models through data-driven correction and simulation-based refinement. Fail2Progress uses Stein variational inference to generate targeted synthetic datasets similar to observed failures.

These generated datasets can then be used to fine-tune and re-deploy a skill-effect model, enabling fewer repeats of the same failure on long-horizon tasks.

Getting started

In this blog, we talked about perception-based TAMP, GPU-accelerated TAMP, and a simulation-based refinement framework for robot manipulation. We saw common challenges in traditional TAMP and how these research efforts aim to solve them.

Check out the following resources to learn more:

This post is part of our NVIDIA Robotics Research and Development Digest (R2D2) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.

Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgments

For their contributions to the research mentioned in this post, thanks to Ankit Goyal, Caelan Garrett, Tucker Hermans, Yixuan Huang, Leslie Pack Kaelbling, Nishanth Kumar, Tomas Lozano-Perez, Ajay Mandlekar, Fabio Ramos, Shuo Cheng, Mohanraj Devendran Shanthi, William Shen, Danfei Xu, Zhutian Yang, Novella Alvina, Dieter Fox, and Xiaohan Zhang.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器人操作 任务与运动规划 感知 GPU加速 人工智能 机器学习 NVIDIA Robot Manipulation Task and Motion Planning Perception GPU Acceleration Artificial Intelligence Machine Learning NVIDIA
相关文章