Sandwiched Policy Gradient提升dLLMs性能

cs.AI updates on arXiv.org 10月13日 12:14

Sandwiched Policy Gradient提升dLLMs性能

本文提出 Sandwiched Policy Gradient (SPG) 方法，解决扩散大语言模型 (dLLMs) 在强化学习中的挑战，通过使用真实对数似然的上界和下界，显著提升了dLLMs的准确率。

arXiv:2510.09541v1 Announce Type: cross Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Sandwiched Policy Gradient dLLMs 强化学习性能提升语言模型

相关文章

Coalition of news publishers sue Microsoft and OpenAI

This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

OLMo: Everything You Need to Train an Open Source LLM with Akshita Bhagia - #674

AI Trends 2024: Reinforcement Learning in the Age of LLMs with Kamyar Azizzadenesheli - #670

Multilingual LLMs and the Values Divide in AI with Sara Hooker - #651

BloombergGPT - an LLM for Finance with David Rosenberg - #639

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Reinforcement Learning for Personalization at Spotify with Tony Jebara - #609

Scaling BERT and GPT for Financial Services with Jennifer Glore - #561