MarkTechPost@AI 10月10日 16:09
计算机使用代理:从网页到操作系统
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了计算机使用代理(也称GUI代理)的技术细节。这类代理是基于视觉语言模型(VLM)构建的,能够像人类用户一样在未修改的软件和操作系统中执行任务。文章介绍了其工作原理,包括屏幕截图、UI元素识别和限定动作执行的控制循环。同时,文中也对比了当前主流的计算机使用代理在OSWorld、Online-Mind2Web等基准测试中的表现,并分析了Anthropic、Google DeepMind和OpenAI等公司的最新进展。此外,文章还展望了该技术未来的发展方向,如提升OS层面的鲁棒性、缩短动作响应时间以及加强安全策略。最后,文章为开发者提供了实用的构建建议和对开源工具的介绍。

💻 **计算机使用代理(GUI代理)的核心能力**:这类代理基于视觉语言模型(VLM),能够观察屏幕界面,识别并理解UI元素,然后执行限定范围内的用户界面操作(如点击、输入、滚动等),以在未修改的应用程序和浏览器中完成用户指定的任务。它们是模拟人类用户与软件交互的关键技术。

📊 **基准测试与当前表现**:在OSWorld基准测试中,人类用户的表现达到72.36%,而早期模型仅为12.24%。Anthropic的Claude Sonnet 4.5已提升至61.4%,显示出显著进步。Google的Gemini 2.5 Computer Use在Online-Mind2Web(69.0%)和WebVoyager(88.9%)等网页基准测试中表现领先,但目前更侧重于浏览器优化,尚未完全针对操作系统级控制进行优化。

🚀 **未来发展方向**:技术发展趋势包括实现更强的OS级鲁棒性,能够处理文件对话框、多窗口交互等复杂场景;大幅缩短动作执行的延迟,以满足人机交互的实时性要求;以及加强安全策略,防止潜在的恶意操作和数据泄露,例如通过更精细化的动作约束和用户确认机制。

🛠️ **开放研究与实践**:Hugging Face的Smol2Operator项目提供了一个开源的后训练流程,可以将小型VLM升级为能够理解GUI并执行操作的代理,为研究人员和初创公司提供了可复现的训练方案,有助于推动GUI代理技术的标准化和普及。

TL;DR: Computer-use agents are VLM-driven UI agents that act like users on unmodified software. Baselines on OSWorld started at 12.24% (human 72.36%); Claude Sonnet 4.5 now reports 61.4%. Gemini 2.5 Computer Use leads several web benchmarks (Online-Mind2Web 69.0%, WebVoyager 88.9%) but is not yet OS-optimized. Next steps center on OS-level robustness, sub-second action loops, and hardened safety policies, with transparent training/evaluation recipes emerging from the open community.

Definition

Computer-use agents (a.k.a. GUI agents) are vision-language models that observe the screen, ground UI elements, and execute bounded UI actions (click, type, scroll, key-combos) to complete tasks in unmodified applications and browsers. Public implementations include Anthropic’s Computer Use, Google’s Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent powering Operator.

Control Loop

Typical runtime loop: (1) capture screenshot + state, (2) plan next action with spatial/semantic grounding, (3) act via a constrained action schema, (4) verify and retry on failure. Vendors document standardized action sets and guardrails; audited harnesses normalize comparisons.

Benchmark Landscape

Architecture Components

Enterprise Snapshot

Where They’re Headed: Web → OS

Practical Build Notes

Open Research & Tooling

Hugging Face’s Smol2Operator provides an open post-training recipe that upgrades a small VLM into a GUI-grounded operator—useful for labs/startups prioritizing reproducible training over leaderboard records.

Key Takeaways


References:

Benchmarks (OSWorld & Online-Mind2Web)

Anthropic (Computer Use & Sonnet 4.5)

Google DeepMind (Gemini 2.5 Computer Use)

OpenAI (Operator / CUA)

Open-source: Hugging Face Smol2Operator

The post What are ‘Computer-Use Agents’? From Web to OS—A Technical Explainer appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Computer-Use Agents GUI Agents VLM AI OSWorld Online-Mind2Web Anthropic Google DeepMind OpenAI Operator Gemini 2.5 Claude Sonnet 4.5 Hugging Face Smol2Operator 人工智能 人机交互 操作系统 基准测试
相关文章