MarkTechPost@AI 11月06日 10:16
GEN-θ:直接从真实机器人数据中学习物理技能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Generalist AI推出了GEN-θ,这是一系列具身基础模型,直接在真实、高保真的物理交互数据上进行训练,而非依赖模拟或互联网视频。该系统旨在为机器人领域建立类似于大型语言模型的规模法则,但其基础是来自家庭、仓库和工作场所运行的真实机器人的连续传感器-动作流。GEN-θ的核心是“谐波推理”,它使模型能够实时思考和行动,处理异步、连续的时间流。模型在不同类型的机器人上表现出跨载体能力,并且在模型规模达到一定程度(约70亿参数)时,会展现出智能的“相变”,大幅提升了学习效率和下游任务的适应性。研究还揭示了预训练数据量与下游性能之间的幂律关系,以及数据混合方式对于模型行为的重要性。

🤖 GEN-θ是一种新型具身基础模型,它摒弃了对模拟的依赖,直接从真实世界机器人收集的高保真物理交互数据中学习。这种方法使其能够理解并适应混乱的现实世界物理规律,从而构建能够执行复杂物理任务的通用AI。

🧠 核心创新“谐波推理”(Harmonic Reasoning)使模型能够同时感知和行动,处理异步的连续时间流。这对于需要实时响应和决策的机器人至关重要,因为物理世界不会停止演变,模型必须在动态环境中做出即时反应,而无需复杂的系统1-系统2架构。

📈 GEN-θ在模型规模扩展方面展现出显著的“智能相变”。当模型规模达到约70亿参数时,其学习能力会发生质的飞跃,能够有效吸收海量物理交互数据,而较小的模型则容易出现“骨化”现象,停止吸收新信息。更大的模型(7B+)在预训练后,只需极少的微调即可适应下游任务。

📊 研究发现了机器人领域的“规模法则”,即预训练数据的数量与下游任务的性能之间存在幂律关系。这使得研究人员能够预测达到特定性能目标所需的预训练数据量,或者通过增加预训练数据来减少对下游标注数据的需求,从而指导数据收集和模型训练策略。

🌍 GEN-θ建立在庞大的数据集之上,已处理超过27万小时的真实世界操作轨迹,并且每周新增约1万小时。为了支持如此大规模的数据训练,Generalist AI团队构建了定制化的硬件、数据加载器和网络基础设施,能够实现每训练一天吸收6.85年的真实世界操作经验。

How do you build a single model that can learn physical skills from chaotic real world robot data without relying on simulation? Generalist AI has unveiled GEN-θ, a family of embodied foundation models trained directly on high fidelity raw physical interaction data instead of internet video or simulation. The system is built to establish scaling laws for robotics in the same way that large language models did for text, but now grounded in continuous sensorimotor streams from real robots operating in homes, warehouses and workplaces.

Harmonic Reasoning, thinking and acting in real time

GEN-θ is introduced as an embodied foundation model architecture that builds on the strengths of vision and language models, and extends them with native support for human level reflexes and physical commonsense. The core feature is Harmonic Reasoning, where the model is trained to think and act at the same time over asynchronous, continuous time streams of sensing and acting tokens.

This design targets a robotics specific constraint. Language models can simply spend more time thinking before replying, but robots must act while physics continues to evolve. Harmonic Reasoning creates a harmonic interplay between sensing and acting streams so that GEN-θ can scale to very large model sizes without depending on  System1-System2 architectures or heavy inference time guidance controllers.

GEN-θ is explicitly cross embodiment. The same architecture runs on different robots and has been tested on 6DoF, 7DoF and 16+DoF semi humanoid systems, which lets a single pre-training run serve heterogeneous fleets.

Surpassing the intelligence threshold in robotics

The Generalist AI team reports a phase transition in capability as GEN-θ scales in a high data regime. Their scaling research experiment also show that the models must be large enough to absorb vast amounts of physical interaction data.

Their behaviors are as follows:

https://generalistai.com/blog/nov-04-2025-GEN-0

The above image plots next action validation prediction error on a completely withheld long horizon downstream task across model sizes and pre-training compute. 1B models plateau early while 6B and 7B models continue to improve as pretraining increases. The research team connect this phase transition to Moravec’s Paradox, arguing that physical commonsense and dexterity appear to require higher compute thresholds than abstract language reasoning, and that GEN-θ is operating beyond that activation point.

Generalist AI team states that GEN-θ has been scaled to 10B+ model sizes, and that larger variants adapt to new tasks with increasingly less post training.

Scaling laws for robotics

Another focus of this research is scaling laws that relate pre-training data and compute to downstream post training performance. The research team samples checkpoints from GEN-θ training runs on different subsets of the pre-training dataset, then post trains those checkpoints on multi task, language conditioned data. This supervised fine tuning stage spans 16 task sets, covering dexterity tasks such as building Lego, industry workflows such as fast food packing, and generalization tasks that include anything style instructions.

Across various tasks, more pre-training improves validation loss and next action prediction error during post training. At sufficient model scale, the relationship between pre-training dataset size and downstream validation error is well described by a power law of the form.

L(D)=(Dc​/D)αD​

where (D) is the number of action trajectories in pre-training and (L(D)) is validation error on a downstream task. This formula lets robotics teams estimate how much pre-training data is needed to reach a target next action prediction error, or how much downstream labeled data can be traded for additional pre-training.

Data engine and infrastructure at robotics scale

GEN-θ is trained on an in house dataset of 270,000 hours of real world manipulation trajectories collected in thousands of homes, warehouses and workplaces worldwide. The data operation currently adds more than 10,000 new hours per week. Generalist AI team claims that GEN-θ is trained on orders of magnitude more real world manipulation data than prior large robotics datasets as of today.

To sustain this regime, the research team has built custom hardware, data-loaders and network infrastructure, including dedicated internet lines to handle uplink bandwidth from distributed sites. The pipeline uses multi cloud contracts, custom upload machines and on the order of 10,000 compute cores for continual multimodal processing. The research team reports compression of dozens of petabytes of data and data-loading techniques from frontier video foundation models, yielding a system capable of absorbing 6.85 years of real world manipulation experience per day of training.

How you pre-train GEN-θ matters as much as how big it is?

Generalist AI team runs large ablations over 8 pre-training datasets and 10 long horizon task sets. They find that different data mixtures, not just more data, produce models with different behaviors across 3 groups of tasks, dexterity, real world applications and generalization. Performance is measured using validation mean squared error on next actions and reverse Kullback Leibler divergence between the model policy and a Gaussian around ground truth actions.

Low MSE and low reverse KL models are better candidates for supervised fine-tuning. Models with higher MSE but low reverse KL are more multimodal in their action distributions and can be better starting points for reinforcement learning.

Key Takeaways

    GEN-θ is an embodied foundation model trained on high fidelity raw physical interaction data, not simulation or internet video, and it uses Harmonic Reasoning to think and act simultaneously under real world physics.Scaling experiments show an intelligence threshold around 7B parameters, where smaller models ossify under high data load and larger models keep improving with more pretraining.GEN-θ exhibits clear scaling laws, where downstream post training performance follows a power law in the amount of pre-training data, which lets teams predict how much data and compute are needed for target error levels.The system is trained on more than 270,000 hours of real world manipulation data, growing by about 10,000 hours per week, supported by custom multi cloud infrastructure that can absorb 6.85 years of experience per training day.Large scale ablations over 8 pretraining datasets and 10 long horizon task sets show that data quality and mixture design, measured with validation MSE and reverse KL, are as important as scale, since different mixtures yield models better suited for supervised finetuning or reinforcement learning.

Editorial Comments

GEN-θ positions embodied foundation models as a serious attempt to bring scaling laws to robotics, using Harmonic Reasoning, large scale multimodal pre-training and explicit analysis of data mixtures. The research shows that 7B+ models, trained on 270,000 hours of real world manipulation data with 10,000 hours added weekly, can cross an intelligence threshold where more physical interaction data predictably improves downstream performance across dexterity, applications and generalization tasks.


Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Models Built for Multimodal Training Directly on High-Fidelity Raw Physical Interaction appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GEN-θ Embodied Foundation Models Robotics Physical Interaction Data Harmonic Reasoning Scaling Laws 具身基础模型 机器人 物理交互数据 谐波推理 规模法则
相关文章