Nvidia Developer 09月26日
NVIDIA推出三项机器人学习创新技术
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了NVIDIA在机器人学习领域的最新进展,重点关注三项在CoRL 2025上展出的创新技术:NeRD(神经机器人动力学)、Dexplore和VT-Refine。NeRD通过学习动力学模型提升了机器人仿真的准确性和泛化能力,并支持真实世界微调。Dexplore则通过将动作捕捉演示作为自适应指导,实现了人类水平的灵巧操作。VT-Refine结合了视觉和触觉传感,利用新颖的“真实-模拟-真实”训练框架,攻克了双臂精密装配任务的挑战。这些技术为机器人研究人员提供了强大的工具和工作流程,加速了机器人从实验室走向现实世界的步伐。

✨ **NeRD(神经机器人动力学)提升仿真精度与泛化能力**:NeRD通过学习机器人动力学模型,显著提高了仿真环境的复杂性和准确性,能够有效预测机器人系统在接触约束下的未来状态。该模型具有良好的泛化性,能够跨任务应用,并且可以通过真实世界数据进行微调,有效缩小仿真与现实之间的差距,为机器人训练提供了更可靠的基础。

🤲 **Dexplore实现人类级灵巧操作**:Dexplore将人类的动作捕捉演示视为一种“软指导”,而非严格的地面真值。这种方法使得机器人能够直接从演示数据中学习控制策略,同时还能根据自身特点自主探索和优化运动,从而在处理复杂物体和精细操作时展现出接近人类的灵巧性。

🤝 **VT-Refine攻克双臂精密装配难题**:VT-Refine提出了一种创新的“真实-模拟-真实”训练框架,结合了视觉和触觉感知,专门用于解决机器人双臂协同完成精密装配任务的挑战。该框架通过少量真实世界演示进行预训练,然后在高度并行的仿真环境中进行强化学习微调,最后部署到真实机器人上,显著提升了任务的成功率和鲁棒性。

While today’s robots excel in controlled settings, they still struggle with the unpredictability, dexterity, and nuanced interactions required for real-world tasks—from assembling delicate components to manipulating everyday objects with human-like precision.

Robot learning has emerged as the key to bridging this gap between laboratory demonstrations and real-world deployment. Yet traditional approaches face fundamental limitations:

    Classical simulators can’t capture the full complexity of modern robotic systemsHuman demonstrations are difficult to translate across different robot embodimentsThe intricate coordination of vision and touch that humans take for granted remains elusive for machines

This edition of NVIDIA Robotics Research and Development Digest (R²D²) explores three groundbreaking neural innovations from NVIDIA Research that are transforming how robots learn and adapt, featured at CoRL 2025:

    NeRD (Neural Robot Dynamics): Enhances simulation with learned dynamics models that generalize across tasks while enabling real-world fine-tuning. Dexplore: Unlocks human-level dexterity by treating motion-captured demonstrations as adaptive guidance. VT-Refine: Combines vision and tactile sensing to master precise bimanual assembly tasks through novel real-to-sim-to-real training.

Together, these advances provide developers with techniques, libraries, and workflows to advance research. 

Teaching robots through neural simulation

Simulation plays a key role in the robotics development workflow. Robots can learn to robustly do tasks in simulation, as parameters and properties like mass and friction can be randomized during training. However, traditional simulators struggle to capture the complexity of modern robots, which often have high degrees of freedom and intricate mechanisms. Neural models can help with this challenge, as they can efficiently predict complex dynamics and adapt to real-world data. 

NeRD, for example, is a learned dynamics model used for predicting future states of a specific robot (or articulated rigid-body system) under contact constraints. It can replace low-level dynamics and contact solvers in an analytical simulator, enabling a hybrid simulation prediction framework. 

Figure 1. NeRD can efficiently predict complex dynamics and adapt to real-world data

NeRD uses a robot-centric state representation that enforces spatial invariance – this enhances the training and data efficiency of NeRD and greatly improves generalization. NeRD can be easily integrated into existing articulated rigid-body simulation frameworks.  It has been validated with integration to NVIDIA Warp, and will also serve as one of the many solvers in the Newton Physics Engine at some point in the future.  

To train a NeRD model for a given robot, 100K random trajectories with 100 timesteps each are collected as training data. NeRD is modeled using a lightweight implementation of the GPT-2 Transformer and models were trained for six different robotic systems. 

NeRD models are stable and accurate over thousands of time steps,  achieving remarkable accuracy with less than 0.1% error in accumulated reward for 1,000-step policy evaluation for an ANYmal quadruped robot. This approach has also shown zero-shot sim to real transfer with a Franka reach policy learned in the simulator that was integrated with NeRD, and NeRD can also be fine-tuned on real-world data to further close the sim-to-real gap.

Neural models like NeRD will speed up robotics research, enabling developers to accurately simulate the complex full-body training alongside classical simulation techniques.

Figure 2. Executions of robot learned policies exhibit high matching between the NeRD-integrated simulator and the classic simulator

Learning dexterous skills from human motion

Teaching robot hands human-level dexterity has historically been a difficult problem. Human hands possess an unparalleled combination of kinematic complexity, compliance and rich tactile sensing. Robotic hands have limitations with lesser degrees of freedom and actuation, limited sensing and control. This makes it difficult for robots to learn dexterous manipulation from humans. 

Hand-object motion-capture (MoCap) repositories provide abundant contact-rich human demonstrations but cannot be easily used for direct policy learning of robots. Existing workflows incorporate three major components: retargeting, tracking, and residual correction which compounds errors.

This research introduces Reference-Scoped Exploration (RSE), a unified, single-loop optimization. It integrates retargeting and tracking to train a scalable robot control policy directly from MoCap data. Demonstrations are not treated as “strict” ground truth but are instead viewed as soft guidance. 

This preserves the intent of the demonstration and enables the robot to autonomously discover motions compatible with its own embodiment. 

Figure 3. Dexterous manipulation from human demonstrations learned by first training a state-based imitation control policy with RSE to explore robot-specific manipulation strategies

The second part of the workflow, a vision-based generative control policy, is learned to distill the state-based imitation control policy. This enables the robotic hand to manipulate an object with partial observations obtained from a single-view depth image, and sparse, user-defined goals. 

During training, the policy’s objective is to have the robot hand follow the given trajectory to enable performing diverse object manipulation skills like grabbing a banana, cellphone, cup and binoculars. The model comprises an encoder, a prior network and a decoder policy. At inference time, the encoder is omitted and the latent embedding is sampled directly from the learned prior, thus producing a generative control policy capable of performing effective goal-conditioned dexterous manipulation from only partial observations.

This approach achieves almost 20% more success rates with the Inspire hand. It outperforms each baseline method consistently on both Inspire and Allegro robot hands. The state-based policy is evaluated on its ability to imitate human demonstrations and generalize in unseen scenarios whereas for the vision-based policy framework manipulation in simulation and successful transfer to real world is used. 

Combining vision and touch for precise bimanual assembly

Humans are good at manipulation and bimanual assembly tasks as they rely on visual and tactile feedback in the process. Envision performing a plug and socket assembly with both hands. First, you would visually identify and grasp the components needed. Next, during assembly of the parts, tactile feedback plays an important role as visual feedback alone with the occlusions makes it difficult to complete the task. 

Behavioral cloning with diffusion policies are useful but suffer from limited real-world demonstrations and limitations of tactile feedback on their data collection interface. 

To address this data problem, VT-Refine develops a novel real-to-sim-to-real framework that combines simulation, vision, and touch to solve this problem for bimanual assembly tasks (Figure 4). A high-level overview of the steps involves:

    Collecting a small number of real-world demonstrations (30 episodes, for example) to pretrain a bimanual visuo-tactile diffusion policy.Fine-tuning this policy in its digital twin on a parallelized simulation environment using reinforcement learning (RL).Deploying this policy back to the real world.
Figure 4. VT-Refine is a novel visuo-tactile policy learning framework for precise, contact-rich bimanual assembly tasks

The simulation for tactile sensory input is built on TacSL, a GPU-based tactile simulation library integrated with Isaac Lab. This enables better sim-to-real transferability as efficient approximation of the softness of tactile sensors can be leveraged in GPU-accelerated simulation enabling scalable training. The observations used for training include:

    Point cloud captured by an ego-centric cameraPoint cloud representation of the tactile sensor feedbackJoint positions from the arms and grippers

The data collected is then used to pretrain a diffusion policy. For scaled training in simulation a digital twin of the scene is set up with the vision and tactile sensors. The pretraining on human demonstrations provide a strong prior that guides RL exploration without the need of complex reward engineering. 

Figure 5. Robot setups with four tactile sensing pads and an ego-centric camera

The RL fine-tuned policy significantly boosts performance on high-precision assembly tasks by introducing the necessary exploration. It improves real world success rates by approximately 20% in the vision-only variant and 40% for the visuo-tactile variant. There is a slight drop in the sim to real transfer of around 5-10% which is negligible compared to the improvement in success rate by over 30% with RL fine-tuning in simulation. 

This work is one of the first successful RL sim-to-real transfer for bimanual visuo-tactile policies using large scale simulation.

Summary

Advances in robot learning are transforming how robots acquire and transfer complex skills from simulation to the real world. NeRD enables more accurate dynamics prediction, RSE streamlines learning dexterous manipulation from human demonstrations, and VT-Refine combines vision and touch for robust bimanual assembly. Together, these approaches show how scalable, data-driven learning is narrowing the gap between robotic and human capabilities.

Check out the following resources to learn more and see all the NVIDIA research being showcased at CoRL and Humanoids, happening September 27-October 2 in Seoul, Korea:

This post is part of our NVIDIA Robotics Research and Development Digest (R2D2) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.

Learn more about the research being showcased at CoRL and Humanoids, happening September 27–October 2 in Seoul, Korea.

Also, join the 2025 BEHAVIOR Challenge, a robotics benchmark for testing reasoning, locomotion, and manipulation, featuring 50 household tasks and 10,000 tele-operated demonstrations. 

Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and NVIDIA Developer Forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgments

For their contributions to the research mentioned in this post, we’d like to thank Arsalan Mousavian, Balakumar Sundaralingam, Binghao Huang, Dieter Fox, Eric Heiden, Iretiayo Akinola, Jie Xu, Liang-Yan Gui, Liuyu Bian, Miles Macklin, Rowland O’Flaherty, Sirui Xu, Wei Yang, Xiaolong Wang, Yashraj Narang, Yunzhu Li, Yu-Wei Chao, Yu-Xiong Wang.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器人学习 NVIDIA AI Robotics Machine Learning NeRD Dexplore VT-Refine Simulation Dexterity Assembly
相关文章