树引导偏好优化：提升Web代理强化学习效率

cs.AI updates on arXiv.org 09月18日

树引导偏好优化：提升Web代理强化学习效率

本文提出一种名为Tree-Guided Preference Optimization (TGPO)的离线强化学习框架，用于解决Web代理强化学习中的信用分配、标注成本高和奖励稀疏等问题。通过合并语义相同的状态、自动生成细粒度奖励以及动态权重机制，TGPO在Online-Mind2Web和C-WebShop数据集上显著优于现有方法。

arXiv:2509.14172v1 Announce Type: cross Abstract: With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Web代理强化学习偏好优化离线学习数据集

相关文章

MS MARCO Web Search: A Large-Scale Information-Rich Web Dataset Featuring Millions of Real Clicked Query-Document Labels

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

AI Trends 2024: Reinforcement Learning in the Age of LLMs with Kamyar Azizzadenesheli - #670

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Reinforcement Learning for Personalization at Spotify with Tony Jebara - #609

Deep Learning, Transformers, and the Consequences of Scale with Oriol Vinyals - #546

The Benefit of Bottlenecks in Evolving Artificial Intelligence with David Ha - #535

Advancing Deep Reinforcement Learning with NetHack, w/ Tim Rocktäschel - #527

Applying RL to Real-World Robotics with Abhishek Gupta - #466

Off-Line, Off-Policy RL for Real-World Decision Making at Facebook - #448