MarkTechPost@AI 10月24日 00:44
UltraCUA:通用GUI代理与API代理的混合模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果研究人员提出了UltraCUA,一个基础模型,它构建了一个混合动作空间,允许计算机使用代理(CUA)在低级GUI操作和高级程序化工具调用之间进行交替。这种方法将工具视为一等公民,将多步操作封装为单一函数,并在没有可用程序化路径时回退到点击和按键操作。UltraCUA通过自动化管道构建其工具库,并利用合成数据引擎生成大量可验证的任务进行训练。该模型在OSWorld基准测试中显著提高了成功率并减少了步骤,并且在WindowsAgentArena上展示了跨平台泛化能力,无需特定于Windows的训练。

🎯 **混合动作空间革新代理交互**:UltraCUA引入了一个创新的混合动作空间,允许计算机使用代理(CUA)无缝地在传统的点击、输入等低级GUI操作与调用封装了多步操作的程序化工具之间切换。这种设计旨在克服长串GUI操作容易出现的累积错误问题,并有效减少完成任务所需的总步骤数,从而提高代理的鲁棒性和效率。

🛠️ **智能工具库的自动化构建与扩展**:该模型通过自动化管道构建了一个庞大的工具库,能够从软件文档中提取键盘快捷键和命令,整合开源实现,甚至利用编码代理合成新工具。研究团队报告称,其工具库覆盖了10个桌面领域,包含881个工具,例如VS Code(135个)和LibreOffice Writer(123个),为代理提供了丰富的操作能力。

📈 **高效的训练与验证机制**:UltraCUA采用了一个双重合成引擎来生成可验证的任务和轨迹,以确保训练数据的质量和稳定性。该引擎生成了跨越10个领域的17,864个可验证任务,包括Chrome、LibreOffice和VS Code等。通过监督微调和在线强化学习两阶段训练,模型学会了何时调用工具,何时执行GUI操作,从而优化了决策过程。

🚀 **显著的性能提升与跨平台泛化能力**:在OSWorld基准测试中,UltraCUA在7B和32B模型规模下均取得了显著的成功率提升(平均相对提升22%)和步骤减少(平均减少11%)。更值得注意的是,UltraCUA-7B模型在WindowsAgentArena上实现了21.7%的成功率,而无需任何Windows特定训练,这有力地证明了其混合动作策略的跨平台泛化能力。

Computer-use agents have been limited to primitives. They click, they type, they scroll. Long action chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a foundation model that builds an hybrid action space that lets an agent interleave low level GUI actions with high level programmatic tool calls. The model chooses the cheaper and more reliable move at each step. The approach improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena without Windows specific training.

https://arxiv.org/pdf/2510.17790

What hybrid action changes?

Hybrid action treats tools as first class actions. A tool call encapsulates a multi step operation as a single function with a clear signature and a docstring. A click or a key press still exists when no programmatic path is available. The agent learns to alternate between both modes. The goal is to reduce cascade errors and to cut step counts. The research team positions this as a bridge between GUI only CUAs and tool centric agent frameworks.

https://arxiv.org/pdf/2510.17790

Scaled tool acquisition

UltraCUA builds its tool library with an automated pipeline. The system extracts keyboard shortcuts and commands from software documentation. The system integrates open source implementations from agent toolkits. The system also uses coding agents to synthesize new tools. Each tool is a callable interface that hides a long GUI sequence. The research team reports coverage across 10 desktop domains with 881 tools. The largest buckets include VS Code with 135 tools and LibreOffice Writer with 123 tools. Thunderbird and GIMP also have deep coverage.

https://arxiv.org/pdf/2510.17790

Verifiable synthetic tasks and trajectories

Training requires grounded supervision and stable rewards. UltraCUA uses a dual synthetic engine. An evaluator first pipeline composes atomic verifiers for browsers, files, images, and system state, then generates tasks that satisfy those checks. An instruction first pipeline explores the OS and proposes context aligned tasks which are then verified. The result is 17,864 verifiable tasks across 10 domains such as Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 tasks. The LibreOffice suite sums to 5,885 tasks. Multi app tasks reach 2,113.

https://arxiv.org/pdf/2510.17790

A multi agent rollout produces successful hybrid trajectories. The planner uses OpenAI o3 for decision making. The grounder uses GTA1-7B for accurate visual localization. The rollout yields about 26.8K successful trajectories that show when to use a tool and when to act in the GUI. These trajectories are the core of the supervised phase.

Training Approach

Training has two stages. Stage 1 is supervised fine tuning. The models train for 3 epochs at a learning rate of 2e-5 on the successful trajectories. Loss is applied turn wise to avoid over weighting early steps. Stage 2 is online reinforcement learning. The models train for 150 steps at a learning rate of 1e-6 on verified tasks that are sampled by difficulty. The policy optimization follows a GRPO variant with clip higher, and removes KL regularization and format rewards. The reward combines sparse task outcome with a tool use term. Experiments use NVIDIA H100 GPUs. The context is kept near 32K by controlling the number of exposed tools.

Results on OSWorld

UltraCUA improves success at both 7B and 32B scales. Under 15 step budgets, UltraCUA-32B reaches 41.0 percent success. OpenCUA-32B reaches 29.7 percent. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9 percent. UI-TARS-1.5-7B reaches 23.4 percent. Gains persist under 50 step budgets. A per domain breakdown shows consistent lifts across Chrome, Writer, VS Code, and cross application tasks. Average steps decrease against baselines. These shifts indicate better action selection rather than only more attempts.

https://arxiv.org/pdf/2510.17790
https://arxiv.org/pdf/2510.17790

Cross platform transfer on WindowsAgentArena

UltraCUA trains only on Ubuntu based OSWorld data. The model is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 percent success. This exceeds UI-TARS-1.5-7B at 18.1 percent and a Qwen2 baseline trained with Windows data at 13.5 percent. The result suggests that hybrid action strategies learned on one platform transfer to other platforms. The paper highlights this as zero shot platform generalization.

https://arxiv.org/pdf/2510.17790

Key Takeaways

Editorial Comments

UltraCUA moves computer use agents from brittle primitive action chains to a hybrid action policy, integrating GUI primitives with programmatic tool calls, which reduces error propagation and step counts. It scales tools via an automated pipeline and pairs them with a synthetic data engine that yields 17,000 plus verifiable tasks, enabling supervised fine tuning and online reinforcement learning on grounded signals. Reported results include 22 percent relative improvement on OSWorld with 11 percent fewer steps, and 21.7 percent success on WindowsAgentArena without Windows specific training, which indicates cross platform transfer of the policy.


Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UltraCUA Computer-Use Agents AI Foundation Model Hybrid Action Space GUI Agents API Agents Machine Learning Deep Learning Robotics
相关文章