Artificial Fintelligence 09月25日
人工智能与强化学习的发展趋势
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能领域强化学习(RL)的发展现状与未来趋势。文章指出,尽管强化学习在游戏领域取得显著成果,但其在实际应用中仍面临奖励信号质量、探索与利用平衡等挑战。大型语言模型(LLMs)的加入为RL提供了新的可能性,但同时也加剧了多模态融合的难度。文章强调,未来研究应聚焦于提升奖励预测能力和探索效率,以推动RL在更广泛领域的应用。

🔬 强化学习在游戏领域(如AlphaGo、AlphaZero)取得突破性进展,但实际应用中仍面临奖励信号质量低、探索效率不足等问题。

💡 大型语言模型(LLMs)的加入为强化学习提供了新的数据基础和泛化能力,但多模态融合导致文本性能下降,增加了技术难度。

🚀 未来研究应聚焦于提升奖励预测能力,通过预训练模型和可验证奖励机制优化强化学习策略,以实现更高效的探索与利用。

🔄 探索与利用的平衡是强化学习的核心挑战,未来需结合蒙特卡洛树搜索(MCTS)等技术提升探索效率,避免陷入局部最优解。

🤝 研究实验室应加大对基础强化学习研究的投入,特别是在探索问题上的突破,以推动RL在更广泛领域的实际应用。

A disclaimer: nothing that I say here is representing any organization other than Artificial Fintelligence. These are my views, and mine alone, although I hope that you share them after reading.

Frontier labs are spending, in the aggregate, $100s of millions of dollars annually on data acquisition, leading to a number of startups selling data to them (Mercor, Scale, Surge, etc). The novel data, combined with reinforcement learning (RL) techniques, represents the most clear avenue to improvement, and to AGI. I am firmly convinced that scaling up RL techniques will lead to excellent products, and, eventually, AGI. A primary source of improvement over the last decade has been scale, as the industry has discovered one method after another that allows us to convert money into intelligence. First, bigger models. Then, more data (thereby making Alexandr Wang very rich). And now, RL.

Thanks for reading Artificial Fintelligence! Subscribe for free to receive new posts and support my work.

RL is the subfield of machine learning that studies algorithms which discover new knowledge. Reinforcement learning agents take actions in environments to systematically discover the optimal strategy (called a policy). An example environment is Atari: you have an environment (the Atari game) where the agent can take actions (moving in different directions, pressing the “fire” button) and the agent receives a scalar reward signal that it wants to maximize (the score). Without providing any data on how to play Atari games, RL algorithms are able to discover policies which get optimal scores in most Atari games.

The key problem in RL is the exploration/exploitation tradeoff. At each point that the agent is asked to choose an action, the agent has to decide between choosing the action which they currently think is best (”exploiting”) or trying a new action which might be better (”exploring”). This is an extremely difficult decision to get right. Consider a complicated game like Starcraft, or Dota. For any individual situation that the agent is in, how can we know what the optimal action is? It’s only after making an entire game’s worth of decisions that we are able to know if our strategy is sound. and it is only after playing many games that we are able to conclude how good we are in comparison to other players.

Large language models help significantly here, as they are much, much more sample efficient because they have incredibly strong priors. By encoding a significant fraction of human knowledge, the models are able to behave well in a variety of environments before they’ve actually received any training data.

When it comes to language modelling, most use of RL to date has been for RLHF, which is mostly used for behaviour modification. As there is (typically) no live data involved, RLHF isn’t “real” RL and does not face the exploration/exploitation tradeoff, nor does it allow for the discovery of new knowledge.

Knowledge discovery is the main unsolved problem in modern machine learning. While we've become proficient at supervised learning, we haven't yet cracked the code on how to systematically discover new knowledge, especially superhuman knowledge. For AlphaStar, for instance, they spent a lot of compute discovering new policies, as it is an extraordinarily hard problem to discover good strategies in Starcraft without prior knowledge.

Therein lies the rub; RL is simultaneously the most promising and most challenging approach we have. DeepMind invested billions of dollars in RL research with little commercial success to show for it (the Nobel prize, for instance, was for AlphaFold, which didn’t use RL). While RL is often the only solution for certain hard problems, it is notoriously difficult to implement effectively. Consider a game with discrete turns, like Chess or Go. In Go, you have on average 250 different choices at each turn, and the game lasts for 150 moves. Consequently, the game tree has approximately 250^150 nodes, or ~10^360. If searching randomly (which is how many RL algorithms explore), it is exceedingly difficult to find a reasonable trajectory in the game, which is why AlphaZero style selfplay is needed, or an AlphaGo style supervised learning phase. When we consider the LLM setting, in which typical vocabulary sizes are in the 10s to 100s of thousands of tokens, and sequence lengths can be in the 10s to 100s of thousands, the problem is made much worse. The result is a situation where RL is both necessary and yet should be considered a last resort.

Put differently, one way to think of deep learning is that it’s all about learning a good, generalizable, function approximation. In deep RL, we are approximating a value function, i.e. a function that tells us exactly how good or how bad a given state of the world would be. To improve the accuracy of the value function, we need to be able to receive data with non-trivial answers. If all we receive is the same reward (and it’s really bad), we can’t do anything. Consider a coding assistant, like Cursor’s newly released background agent. One way to train the agent would be to give it a reward of 1 if it returns code which is merged into a pull request, and 0 otherwise. If you took a randomly initialized network, it would output gibberish, and would thus always receive a signal of 0. Once you get a model that is actually good enough to sometimes be useful to users, you can start getting meaningful signal and rapidly improve.

As an illustrative example, I have a friend who works at a large video game publisher doing RL research for games (think: EA, Sony, Microsoft, etc.). He consults with teams at the publisher’s studios that want to use RL. His first response, despite being an experienced RL practitioner with more than 2 decades of RL experience, is usually to ask if they've tried everything else, because it’s so difficult to get RL to work in practical settings.

The great question with reinforcement learning and language models is whether or not we’ll see results transfer to other domains, like we have seen with next token prediction. The great boon of autoregressive language models has been that it generalizes well, that is, you can train a model to predict the next token and it learns to generate text that is useful in a number of other situations. It is absolutely not clear whether that will be the case with models trained largely with RL, as RL policies tend to be overly specialized to the exact problem they were trained on. AlphaZero notoriously had problems with catastrophic forgetting; a paper that I wrote while at DeepMind showed that simple exploits existed which could consistently beat AlphaZero. This has been replicated consistently in a number of other papers. To get around this, many RL algorithms require repeatedly looking at the training data via replay buffers, which is awkward and unwieldy.

With LLMs, this is a major problem. Setting aside RL, in the open research space, we see a lot of VLMs that are trained separately from their LLMs equivalents. DeepSeek-VL2 is a separate family of models from V3, which is text-only, despite all the major closed source models accepting multimodal inputs. The main reason for the separation being that, in the published literature, adding multimodal capacities to LLMs sacrifices pure text performance. When we go to add in RL, we should expect the problem to become much worse, and more research to be dedicated to improving the inherent tradeoffs here.

In my experience as a practitioner, RL lives or dies based on the quality of the reward signal. One of the most able RL practitioners that I know, Adam White, begins all of his RL projects by first learning to predict the reward signal; and only then will try to optimize it (first predict, and then control). Systems that are optimizing complex, overfit reward models will struggle. Systems like the Allen Institute's Tulu 3, which used verifiable rewards to do RL, seem like the answer, and provide motivation for the hundreds of millions of dollars that the frontier labs are spending on acquiring data.

The development of AlphaGo illustrates this paradox perfectly:

We're now facing a similar situation with language models:

    We've largely exhausted the easily accessible training data

    We need to discover new knowledge to progress further

    For superhuman knowledge in particular, we can't rely on human supervision by definition

    RL appears to be the only framework general enough to handle this challenge

In short, this is a call for research labs to start investing in fundamental RL research again, and in particular, on finally making progress on the exploration problem.

Subscribe now

1

I actually can’t think of any successful applications of MCTS to solve real world problems. Other than the AlphaGo/AlphaZero/MuZero line of work, it doesn’t seem to have led to anything, which 2017 Finbarr would have found extremely surprising.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 强化学习 大型语言模型 探索与利用 游戏AI
相关文章