VentureBeat 10月28日 00:05
Watch & Learn:从视频中自动提取训练数据,赋能计算机智能体
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google Cloud和DeepMind的研究人员开发了一种名为Watch & Learn (W&L)的新框架,旨在解决计算机使用智能体(CUAs)训练数据稀缺的难题。该框架能够从原始视频中自动提取高质量的训练示例,无需人工标注。实验证明,W&L生成的数据不仅能有效训练和微调现有的计算机使用模型,还能用于创建上下文学习(ICL)示例,帮助企业为内部特定任务构建CUA,而无需昂贵的模型训练。该方法通过将问题视为“逆动力学目标”来预测两个连续观测之间的动作,从而简化了数据提取过程,并能生成更可靠、更符合人类意图的训练数据。

✨ **自动化训练数据生成**: Watch & Learn (W&L)框架能够从原始视频中自动提取计算机使用智能体(CUAs)的训练数据,解决了大规模获取高质量训练样本的难题。其核心创新在于将数据提取问题转化为“逆动力学目标”,即根据两个连续的观测状态预测中间发生的动作,这种方法比依赖人工标注或复杂的启发式方法更高效且泛化能力更强。

🚀 **赋能模型训练与优化**: W&L框架生成的训练数据可用于训练或微调现有的计算机使用模型和基础模型,显著提升它们在各项计算机任务上的性能。研究表明,通过W&L生成的53,125个高质量轨迹数据,能够使UI-TARS-1.5和Qwen 2.5-VL等开源模型在OSWorld基准测试中获得高达11个点的性能提升。

💡 **实现高效上下文学习**: 除了作为训练数据,W&L提取的轨迹数据还能作为上下文学习(ICL)的示例,用于增强通用多模态模型在特定任务上的表现。通过为这些示例添加额外的推理标注,并将它们插入到CUA的提示中,可以在推理时即时提升智能体的能力,而无需重新训练模型,为企业定制化CUA提供了灵活且低成本的解决方案。

🌐 **无需人工标注,降低成本**: W&L框架的最大优势之一是其自动化特性,完全消除了对人工标注的需求。这不仅大大降低了训练CUAs的成本和时间投入,还避免了传统方法中由于人工标注带来的低精度和错误示例问题,为将CUAs推向实际应用提供了更具可行性的途径。

A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer use agents (CUAs): Gathering high-quality training examples at scale.

The framework, dubbed Watch & Learn (W&L), addresses the problem of training data generation in a way that doesn’t require human annotation and can automatically extract demonstrations from raw videos.

Their experiments show that data generated W&L can be used to train or fine-tune existing computer use and foundation models to improve their performance on computer-use tasks. But equally important, the same approach can be used to create in-context learning (ICL) examples for computer use agents, enabling companies to create CUAs for bespoke internal tasks without the need for costly training of specialized models.

The data bottleneck of CUA

The web is rich with video tutorials and screencasts that describe complex workflows for using applications. These videos are a gold mine that can provide computer use agents with domain knowledge and instructions for accomplishing different tasks through user interface interactions.

However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (that is, a set of task descriptions, screenshots and actions), a process that is prohibitively expensive and time-consuming when done manually.

Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, which usually result in low precision and faulty examples. A different approach uses self-play agents that autonomously explore user interfaces to collect trajectories. However, techniques using this approach usually create simple examples that are not useful in unpredictable real-world situations.

As the researchers note in their paper, “Overall, these approaches either rely on brittle heuristics, are costly as they rely on explorations in real environments or generate low-complexity demonstrations misaligned with human intent.”

Watch & Learn

The Watch & Learn framework tries to address the challenges of creating CUA demonstrations by rethinking the problem formulation.

Instead of directly generating trajectories or depending on complex multi-stage pipelines, the researchers frame the problem as an “inverse dynamics objective”: Given two consecutive observations, predict the intermediate action that produced the transition.

According to the researchers, this formulation is “easier to learn, avoids hand-crafted heuristics and generalizes robustly across applications.”

The W&L framework can be broken down into three key stages: Training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.

In the first phase, the researchers used agents to interact with live web pages to create a large corpus of 500,000 state transitions (two consecutive observations and the action that resulted in the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes in two consecutive observations and predicts the transition action. Their trained IDM, which is a small transformer model, outperformed off-the-shelf foundation models in predicting transition actions.

The researchers then designed a pipeline that retrieves videos from platforms such as YouTube and runs them through IDM to generate high-quality trajectories. The IDM takes in consecutive video frames and determines the actions (scroll, click) that caused the changes in the environment, which are then packaged into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.

These examples can be used to train effective computer use models for specific tasks. But the researchers also found that trajectories extracted through IDM can serve as in-context learning examples to improve the performance of CUAs on bespoke tasks at inference time. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to the observation/action examples in the trajectories, which can then be inserted into the CUA agent’s prompt (usually 3-5 examples) during inference.

“This dual role (training and in-context guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

W&L in action

To test the usefulness of W&L, the researchers ran a series of experiments with closed and open source models on the OSWorld benchmark, which evaluates agents in real desktop and operating system environments across different tasks, including productivity, programming and design.

For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a strong, open source vision-language-action model designed specifically for computer use, and Qwen 2.5-VL, an open-weight multimodal LLM. 

For in-context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3 and Claude Sonnet 4. 

W&L resulted in improvements on OSWorld in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.

More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for advancing CUAs towards real-world deployment,” the researchers write.

This could have important implications for real-world applications, enabling enterprises to turn their existing corpora of videos and conference recordings into training data for CUAs. It also makes it easier to generate new training trajectories. All you will need to do is record videos of performing different tasks and have them annotated by an IDM. And with frontier models constantly improving and becoming cheaper, you can expect to get more from your existing data and the field continues to progress.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Watch & Learn 计算机使用智能体 CUAs 训练数据 人工智能 机器学习 Watch & Learn Computer Use Agents CUAs Training Data Artificial Intelligence Machine Learning
相关文章