MarkTechPost@AI 08月12日
Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Genie Envisioner是一个统一的机器人操控平台,整合了策略学习、仿真和评估,并采用视频生成框架。其核心GE-Base是一个大规模、指令驱动的视频扩散模型,能够捕捉真实世界任务的空间、时间和语义动态。GE-Act将这些表征映射到精确的动作轨迹,而GE-Sim则提供快速、动作条件化的视频仿真。EWMBench基准评估了视觉真实性、物理准确性和指令-动作对齐度。该平台在超过一百万个实验中进行训练,能够泛化到不同的机器人和任务,为可扩展、记忆感知和物理基础的具身智能研究奠定基础。

✨ Genie Envisioner是一个创新的统一平台,旨在解决机器人操控领域数据收集、训练和评估分散的痛点,通过视频生成框架将策略学习、仿真和评估融为一体,显著提升了研究的可扩展性和可复现性。

🚀 平台的核心GE-Base是一个大规模、多视图、指令驱动的视频扩散模型,它能够学习并捕捉现实世界机器人操控任务中复杂的时间、空间和语义动态,为后续的动作生成奠定基础。

💡 GE-Act模块能够将GE-Base学习到的视频表征转化为精确的动作轨迹,实现快速、精准的电机控制,并且能够适应未在训练数据中出现的机器人类型,展示了其出色的泛化能力。

🎮 GE-Sim利用GE-Base的生成能力,构建了一个快速、动作条件化的视频仿真器,支持闭环、视频基础的策略测试,其速度远超真实硬件,为大规模策略训练和评估提供了高效的解决方案。

📊 EWMBench基准套件对整个系统进行了全面的评估,涵盖了视频真实性、物理一致性以及指令与动作之间的对齐度,确保了研究成果的科学性和可靠性,并能与人类的质量判断相匹配。

Embodied AI agents that can perceive, think, and act in the real world mark a key step toward the future of robotics. A central challenge is building scalable, reliable robotic manipulation, the skill of deliberately interacting with and controlling objects through selective contact. While progress spans analytic methods, model-based approaches, and large-scale data-driven learning, most systems still operate in disjoint stages of data collection, training, and evaluation. These stages often require custom setups, manual curation, and task-specific tweaks, creating friction that slows progress, hides failure patterns, and hampers reproducibility. This highlights the need for a unified framework to streamline learning and assessment. 

Robotic manipulation research has progressed from analytical models to neural world models that learn dynamics directly from sensory inputs, using both pixel and latent spaces. Large-scale video generation models can produce realistic visuals but often lack action conditioning, long-term temporal consistency, and multi-view reasoning needed for control. Vision-language-action models follow instructions but are limited by imitation-based learning, preventing error recovery and planning. Policy evaluation remains challenging, as physics simulators require heavy tuning, and real-world testing is resource-intensive. Existing evaluation metrics often emphasize visual quality over task success, highlighting the need for benchmarks that better capture real-world manipulation performance. 

The Genie Envisioner (GE), developed by researchers from AgiBot Genie Team, NUS LV-Lab, and BUAA, is a unified platform for robotic manipulation that combines policy learning, simulation, and evaluation in a video-generative framework. Its core, GE-Base, is a large-scale, instruction-driven video diffusion model capturing spatial, temporal, and semantic dynamics of real-world tasks. GE-Act maps these representations to precise action trajectories, while GE-Sim offers fast, action-conditioned video-based simulation. The EWMBench benchmark evaluates visual realism, physical accuracy, and instruction-action alignment. Trained on over a million episodes, GE generalizes across robots and tasks, enabling scalable, memory-aware, and physically grounded embodied intelligence research. 

GE’s design unfolds in three key parts. GE-Base is a multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes. It learns latent trajectories that capture how scenes evolve under given commands. Building on that, GE-Act translates these latent video representations into real action signals via a lightweight, flow-matching decoder, offering quick, precise motor control even on robots not in the training data. GE-Sim repurposes GE-Base’s generative power into an action-conditioned neural simulator, enabling closed-loop, video-based rollout at speeds far beyond real hardware. The EWMBench suite then evaluates the system holistically across video realism, physical consistency, and alignment between instructions and resulting actions.

In evaluations, Genie Envisioner showed strong real-world and simulated performance across varied robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. It adapted to new robot types, like Agilex Cobot Magic and Dual Franka, with only an hour of task-specific data, excelling in complex deformable object tasks. GE-Sim delivered high-fidelity, action-conditioned video simulations for scalable, closed-loop policy testing. The EWMBench benchmark confirmed GE-Base’s superior temporal alignment, motion consistency, and scene stability over state-of-the-art video models, aligning closely with human quality judgments. 

In conclusion, Genie Envisioner is a unified, scalable platform for dual-arm robotic manipulation that merges policy learning, simulation, and evaluation into one video-generative framework. Its core, GE-Base, is an instruction-guided video diffusion model capturing the spatial, temporal, and semantic patterns of real-world robot interactions. GE-Act builds on this by converting these representations into precise, adaptable action plans, even on new robot types with minimal retraining. GE-Sim offers high-fidelity, action-conditioned simulation for closed-loop policy refinement, while EWMBench provides rigorous evaluation of realism, alignment, and consistency. Extensive real-world tests highlight the system’s superior performance, making it a strong foundation for general-purpose, instruction-driven embodied intelligence. 


Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Genie Envisioner 机器人操控 具身智能 视频生成 AI平台
相关文章