VentureBeat 10月08日
谷歌推出Gemini 2.5 Pro新版本,增强AI代理的网络操作能力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌的DeepMind AI实验室发布了经过微调和定制训练的Gemini 2.5 Pro新版本,称为“Gemini 2.5 Pro Computer Use”。该模型能够使用虚拟浏览器代表用户浏览网页、检索信息、填写表单并执行网站操作,所有这些都通过单一文本提示完成。虽然该模型不直接面向消费者,但通过与Browserbase合作,允许开发者和用户进行演示和比较。新模型在界面控制基准测试中表现出色,并已在谷歌内部及合作伙伴的多个项目中使用,旨在构建更通用的AI代理。

🤖 **AI代理的进化:** Gemini 2.5 Pro Computer Use标志着大型语言模型向“代理”的演进,使其能够超越简单的多模态交互,主动代表用户在网络上执行任务。这包括浏览网站、填写表单、提取信息等,极大地扩展了AI的应用场景。

🌐 **虚拟浏览器能力:** 该模型的核心能力在于其使用虚拟浏览器来与网站进行交互。通过分析界面截图和过去的动作历史,Gemini 2.5 Pro Computer Use能够模拟人类用户在浏览器中的操作,如点击、输入、滚动等,从而实现自动化任务。

🚀 **性能与应用:** 在多项界面控制基准测试中,Gemini 2.5 Pro Computer Use展现出领先的性能,优于Anthropic的Claude Sonnet和OpenAI的代理模型。它已被谷歌内部团队和第三方合作伙伴应用于提高测试效率、数据解析和AI助手等领域。

🔒 **安全与控制:** 谷歌强调了多层面的安全措施,包括每一步操作前的安全检查、系统级指令定义以及内置的防范机制,以确保模型在执行敏感操作(如涉及支付或验证码)时,能够请求用户确认,避免安全风险。

💰 **定价与可用性:** Gemini 2.5 Pro Computer Use目前仅通过付费渠道提供,定价与标准的Gemini 2.5 Pro模型相似,按token计费。与Gemini 2.5 Pro不同的是,它不提供免费试用层级,并且其输出不用于改进谷歌产品。

Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots — extending their models out into "agents" that can actually take more actions on behalf of the user across websites. Recall OpenAI's ChatGPT Agent (formerly known as "Operator") and Anthropic's Computer Use, both released over the last two years.

Now, Google is getting into that same game as well. Today, the search giant's DeepMind AI lab subsidiary unveiled a new, fine-tuned and custom-trained version of its powerful Gemini 2.5 Pro LLM known as "Gemini 2.5 Pro Computer Use," which can use a virtual browser to surf the web on your behalf, retrieve information, fill out forms, and even take actions on websites — all from a user's single text prompt.

"These are early days, but the model’s ability to interact with the web – like scrolling, filling forms + navigating dropdowns – is an important next step in building general-purpose agents," said Google CEO Sundar Pichai, as part of a longer statement on the social network, X.

The model is not available for consumers directly from Google, though.

Instead, Google partnered with another company, Browserbase, founded by former Twilio engineer Paul Klein in early 2024, which offers virtual "headless" web browser specifically for use by AI agents and applications. (A "headless" browser is one that doesn't require a graphical user interface, or GUI, to navigate the web, though in this case and others, Browserbase does show a graphical representation for the user).

Users can demo the new Gemini 2.5 Computer Use model directly on Browserbase here and even compare it side-by-side with the older, rival offerings from OpenAI and Anthropic in a new "Browser Arena" launched by the startup (though only one additional model can be selected alongside Gemini at a time).

For AI builders and developers, it's being made as a raw, albeit propreitary LLM through the Gemini API in Google AI Studio for rapid prototyping, and Google Cloud's Vertex AI model selector and applications building platform.

The new offering builds on the capabilities of Gemini 2.5 Pro, released back in March 2025 but which has been updated significantly several times since then, with a specific focus on enabling AI agents to perform direct interactions with user interfaces, including browsers and mobile applications.

Overall, it appears Gemini 2.5 Computer Use is designed to let developers create agents that can complete interface-driven tasks autonomously — such as clicking, typing, scrolling, filling out forms, and navigating behind login screens.

Rather than relying solely on APIs or structured inputs, this model allows AI systems to interact with software visually and functionally, much like a human would.

Brief User Hands-On Tests

In my brief, unscientific initial hands-on tests on the Browserbase website, Gemini 2.5 Computer Use successfully navigate to Taylor Swift's official website as instructed and provided me a summary of what was being sold or promoted at the top — a special edition of her newest album, "The Life of A Showgirl."

In another test, I asked Gemini 2.5 Computer Use to search Amazon for highly rated and well-reviewed solar lights I could stake into my back yard, and I was delighted to watch as it successfully completed a Google Search Captcha designed to weed out non-human users ("Select all the boxes with a motorcycle.") It did so in a matter of seconds.

However, once it got through there, it stalled and was unable to complete the task, despite serving up a "task competed" message.

I should also note here that while the ChatGPT agent from OpenAI and Anthropic's Claude can create and edit local files — such as PowerPoint presentations, spreadsheets, or text documents — on the user’s behalf, Gemini 2.5 Computer Use does not currently offer direct file system access or native file creation capabilities.

Instead, it is designed to control and navigate web and mobile user interfaces through actions like clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; any structured output like a document or file must be handled separately by the developer, often through custom code or third-party integrations.

Performance Benchmarks

Google says Gemini 2.5 Computer Use has demonstrated leading results in multiple interface control benchmarks, particularly when compared to other major AI systems including Claude Sonnet and OpenAI’s agent-based models.

Evaluations were conducted via Browserbase and Google’s own testing.

Some highlights include:

In addition to strong accuracy, Google reports that the model operates at lower latency than other browser control solutions — a key factor in production use cases like UI automation and testing.

How It Works

Agents powered by the Computer Use model operate within an interaction loop. They receive:

The model analyzes this input and produces a recommended UI action, such as clicking a button or typing into a field.

If needed, it can request confirmation from the end user for riskier tasks, such as making a purchase.

Once the action is executed, the interface state is updated and a new screenshot is sent back to the model. The loop continues until the task is completed or halted due to an error or a safety decision.

The model uses a specialized tool called computer_use, and it can be integrated into custom environments using tools like Playwright or via the Browserbase demo sandbox.

Use Cases and Adoption

According to Google, teams internally and externally have already started using the model across several domains:

The model is also being used in Google’s own product development efforts, including in Project Mariner, the Firebase Testing Agent, and AI Mode in Search.

Safety Measures

Because this model directly controls software interfaces, Google emphasizes a multi-layered approach to safety:

For example, if the model encounters a CAPTCHA, it will generate an action to click the checkbox but flag it as requiring user confirmation, ensuring the system does not proceed without human oversight.

Technical Capabilities

The model supports a wide array of built-in UI actions such as:

It accepts image and text input and outputs text responses or function calls to perform tasks. The recommended screen resolution for optimal results is 1440x900, though it can work with other sizes.

API Pricing Remains Almost Identical to Gemini 2.5 Pro

The pricing for Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model. Both follow the same per-token billing structure: input tokens are priced at $1.25 per one million tokens for prompts under 200,000 tokens, and $2.50 per million tokens for prompts longer than that.

Output tokens follow a similar split, priced at $10.00 per million for smaller responses and $15.00 for larger ones.

Where the models diverge is in availability and additional features.

Gemini 2.5 Pro includes a free tier that allows developers to use the model at no cost, with no explicit token cap published, though usage may be subject to rate limits or quota constraints depending on the platform (e.g. Google AI Studio).

This free access includes both input and output tokens. Once developers exceed their allotted quota or switch to the paid tier, standard per-token pricing applies.

In contrast, Gemini 2.5 Computer Use is available exclusively through the paid tier. There is no free access currently offered for this model, and all usage incurs token-based charges from the outset.

Feature-wise, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 per 1,000 additional requests). These are not available for Computer Use at this time.

Another distinction is in data handling: output from the Computer Use model is not used to improve Google products in the paid tier, while free-tier usage of Gemini 2.5 Pro contributes to model improvement unless explicitly opted out.

Overall, developers can expect similar token-based costs across both models, but they should consider tier access, included capabilities, and data use policies when deciding which model fits their needs.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gemini 2.5 Pro AI Agents Virtual Browser Google DeepMind Web Automation LLMs 人工智能 AI代理 虚拟浏览器 谷歌DeepMind 网页自动化 大型语言模型
相关文章