PDiT架构：视觉语言任务中的深度强化学习新进展

cs.AI updates on arXiv.org 10月28日 12:14

PDiT架构：视觉语言任务中的深度强化学习新进展

本文介绍了一种名为PDiT的新型深度强化学习架构，该架构通过在单个transformer中交替使用感知和决策层来提高视觉语言任务的性能。与传统的分离架构相比，PDiT通过动态调整感知特征来提升决策效果，并在BabyAI GoToLocal环境中取得了优于标准PPO基线的表现。

arXiv:2510.23148v1 Announce Type: cross Abstract: Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and find that the approach achieves more stable rewards and stronger alignment compared to a standard PPO baseline. The results suggest that interleaved transformer encoders are a promising direction for developing more integrated autonomous agents.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

深度学习强化学习视觉语言任务 PDiT架构 Transformer

相关文章

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Import AI 363: ByteDance’s 10k GPU training run; PPO vs REINFORCE; and generative everything

xLSTM: Enhancing Long Short-Term Memory LSTM Capabilities for Advanced Language Modeling and Beyond

Optimizing Graph Neural Network Training with DiskGNN: A Leap Toward Efficient Large-Scale Learning

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

V-JEPA, AI Reasoning from a Non-Generative Architecture with Mido Assran - #677

AI Trends 2024: Reinforcement Learning in the Age of LLMs with Kamyar Azizzadenesheli - #670

Transformers On Large-Scale Graphs with Bayan Bruss - #641

Towards Improved Transfer Learning with Hugo Larochelle - #631

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612