计算机使用代理：从网页到操作系统

TL;DR: Computer-use agents are VLM-driven UI agents that act like users on unmodified software. Baselines on OSWorld started at 12.24% (human 72.36%); Claude Sonnet 4.5 now reports 61.4%. Gemini 2.5 Computer Use leads several web benchmarks (Online-Mind2Web 69.0%, WebVoyager 88.9%) but is not yet OS-optimized. Next steps center on OS-level robustness, sub-second action loops, and hardened safety policies, with transparent training/evaluation recipes emerging from the open community.

Definition

Computer-use agents (a.k.a. GUI agents) are vision-language models that observe the screen, ground UI elements, and execute bounded UI actions (click, type, scroll, key-combos) to complete tasks in unmodified applications and browsers. Public implementations include Anthropic’s Computer Use, Google’s Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent powering Operator.

Control Loop

Typical runtime loop: (1) capture screenshot + state, (2) plan next action with spatial/semantic grounding, (3) act via a constrained action schema, (4) verify and retry on failure. Vendors document standardized action sets and guardrails; audited harnesses normalize comparisons.

Benchmark Landscape

OSWorld (HKU, Apr 2024):

human 72.36%

best model 12.24%

State of play (2025):

Claude Sonnet 4.5

61.4% on OSWorld

Live-web benchmarks:

Gemini 2.5 Computer Use

69.0% on Online-Mind2Web (official leaderboard)

88.9% on WebVoyager

69.7% on AndroidWorld

browser-optimized

not yet optimized for OS-level control

Online-Mind2Web spec:

Architecture Components

Perception & Grounding:

Planning:

Action Schema:

click_at

type

key_combo

open_app

Evaluation Harness:

Enterprise Snapshot

Anthropic:

61.4% OSWorld

Google DeepMind:

Online-Mind2Web 69.0%

WebVoyager 88.9%

AndroidWorld 69.7%

OpenAI:

Computer-Using Agent

Where They’re Headed: Web → OS

Few-/one-shot workflow cloning:

Latency budgets for collaboration:

0.1–1 s

OS-level breadth:

Safety:

Practical Build Notes

browser-first

recoverability

metrics

Open Research & Tooling

Hugging Face’s Smol2Operator provides an open post-training recipe that upgrades a small VLM into a GUI-grounded operator—useful for labs/startups prioritizing reproducible training over leaderboard records.

Key Takeaways

Computer-use (GUI) agents are VLM-driven systems that perceive screens and emit bounded UI actions (click/type/scroll) to operate unmodified apps; current public implementations include Anthropic Computer Use, Google Gemini 2.5 Computer Use, and OpenAI’s Computer-Using Agent.OSWorld (HKU) benchmarks 369 real desktop/web tasks with execution-based evaluation; at launch humans achieved 72.36% while the best model reached 12.24%, highlighting grounding and procedural gaps.Anthropic Claude Sonnet 4.5 reports 61.4% on OSWorld—sub-human but a large jump from prior Sonnet 4 results.Gemini 2.5 Computer Use leads several live-web benchmarks—Online-Mind2Web 69.0%, WebVoyager 88.9%, AndroidWorld 69.7%—and is explicitly optimized for browsers, not yet for OS-level control. OpenAI Operator is a research preview powered by the Computer-Using Agent (CUA) model that uses screenshots to interact with GUIs; availability remains limited. Open-source trajectory: Hugging Face’s Smol2Operator provides a reproducible post-training pipeline that turns a small VLM into a GUI-grounded operator, standardizing action schemas and datasets.