Watch & Learn：从视频中自动提取训练数据，赋能计算机智能体

A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer use agents (CUAs): Gathering high-quality training examples at scale.

The framework, dubbed Watch & Learn (W&L), addresses the problem of training data generation in a way that doesn’t require human annotation and can automatically extract demonstrations from raw videos.

Their experiments show that data generated W&L can be used to train or fine-tune existing computer use and foundation models to improve their performance on computer-use tasks. But equally important, the same approach can be used to create in-context learning (ICL) examples for computer use agents, enabling companies to create CUAs for bespoke internal tasks without the need for costly training of specialized models.

The data bottleneck of CUA

The web is rich with video tutorials and screencasts that describe complex workflows for using applications. These videos are a gold mine that can provide computer use agents with domain knowledge and instructions for accomplishing different tasks through user interface interactions.

However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (that is, a set of task descriptions, screenshots and actions), a process that is prohibitively expensive and time-consuming when done manually.

Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, which usually result in low precision and faulty examples. A different approach uses self-play agents that autonomously explore user interfaces to collect trajectories. However, techniques using this approach usually create simple examples that are not useful in unpredictable real-world situations.

As the researchers note in their paper, “Overall, these approaches either rely on brittle heuristics, are costly as they rely on explorations in real environments or generate low-complexity demonstrations misaligned with human intent.”

Watch & Learn

The Watch & Learn framework tries to address the challenges of creating CUA demonstrations by rethinking the problem formulation.

Instead of directly generating trajectories or depending on complex multi-stage pipelines, the researchers frame the problem as an “inverse dynamics objective”: Given two consecutive observations, predict the intermediate action that produced the transition.

According to the researchers, this formulation is “easier to learn, avoids hand-crafted heuristics and generalizes robustly across applications.”

The W&L framework can be broken down into three key stages: Training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.

In the first phase, the researchers used agents to interact with live web pages to create a large corpus of 500,000 state transitions (two consecutive observations and the action that resulted in the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes in two consecutive observations and predicts the transition action. Their trained IDM, which is a small transformer model, outperformed off-the-shelf foundation models in predicting transition actions.

The researchers then designed a pipeline that retrieves videos from platforms such as YouTube and runs them through IDM to generate high-quality trajectories. The IDM takes in consecutive video frames and determines the actions (scroll, click) that caused the changes in the environment, which are then packaged into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.

These examples can be used to train effective computer use models for specific tasks. But the researchers also found that trajectories extracted through IDM can serve as in-context learning examples to improve the performance of CUAs on bespoke tasks at inference time. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to the observation/action examples in the trajectories, which can then be inserted into the CUA agent’s prompt (usually 3-5 examples) during inference.

“This dual role (training and in-context guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

W&L in action

To test the usefulness of W&L, the researchers ran a series of experiments with closed and open source models on the OSWorld benchmark, which evaluates agents in real desktop and operating system environments across different tasks, including productivity, programming and design.

For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a strong, open source vision-language-action model designed specifically for computer use, and Qwen 2.5-VL, an open-weight multimodal LLM.

For in-context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3 and Claude Sonnet 4.

W&L resulted in improvements on OSWorld in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.

More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for advancing CUAs towards real-world deployment,” the researchers write.

This could have important implications for real-world applications, enabling enterprises to turn their existing corpora of videos and conference recordings into training data for CUAs. It also makes it easier to generate new training trajectories. All you will need to do is record videos of performing different tasks and have them annotated by an IDM. And with frontier models constantly improving and becoming cheaper, you can expect to get more from your existing data and the field continues to progress.

The data bottleneck of CUA

Watch & Learn

W&L in action

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签