少点错误 10月28日 12:16
AI代理如何理解并帮助不确定自身需求的主人
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)安全研究中的一个核心问题:代理(agent)如何在不完全了解主人(principal)意图的情况下提供帮助。研究方向聚焦于“逆强化学习”(Inverse Reinforcement Learning, IRL),即代理通过观察主人的行为来学习其奖励函数(即目标或偏好)。文章介绍了马尔可夫决策过程(Markov decision process)的基本框架,并区分了传统的强化学习与逆强化学习。特别强调了合作逆强化学习(Cooperative Inverse Reinforcement Learning, CIRL)的重要性,即代理不仅学习奖励函数,还主动协助主人达成目标。文章还讨论了“玻尔兹曼理性”(Boltzmann rationality)模型,用于描述有限理性的代理行为,以及“援助博弈”(assistance games)等概念。最后,作者提出了当前研究的重点——“犹豫不决的主人”(equivocal principals),即那些自身也对目标不确定或存在误差的主人,并探讨了代理如何在这种情况下有效提供帮助,包括学习奖励函数的多种可能,以及如何应对主人对奖励函数的永久性不确定性。

🤖 **理解主人意图的挑战:** AI代理在不完全了解主人(principal)确切需求的情况下,如何有效地提供帮助是AI安全研究中的一个关键问题。这通常通过“逆强化学习”(IRL)来实现,代理通过观察主人的行为模式来推断其潜在的奖励函数,即主人的目标和偏好。

🤝 **合作逆强化学习(CIRL)的重要性:** CIRL进一步假设代理不仅学习奖励函数,还积极地与主人合作,共同追求目标。这在主人的目标复杂或难以明确表达的情况下尤为重要,例如烘焙师与助手共同制作美味的派。这种合作模型有助于在代理和主人之间建立更有效的协作关系。

🤔 **不确定性与有限理性:** 文章引入了“玻尔兹曼理性”(Boltzmann rationality)来描述代理(包括人类和AI)可能表现出的有限或有错误的决策行为。这种模型承认代理并非总是做出最优选择,而是可能因“后悔”(regret)的指数级概率而选择次优行动,这为理解和预测代理行为提供了更现实的视角。

💡 **面临的挑战:犹豫不决的主人:** 当前研究的焦点是“犹豫不决的主人”(equivocal principals),即那些自身对奖励函数(目标)不确定或存在误差的主人。这与传统CIRL假设主人明确知道自身目标形成对比。研究需要解决如何帮助这样的主人,以及他们的不确定性会如何影响代理的学习和最终的奖励收益。

🚀 **应对不确定性的策略:** 对于犹豫不决的主人,研究提出了一些潜在的应对策略,包括:避免让代理过度锐化优化目标;在教学过程中保持“认知谦逊”(epistemic humility);尝试教授代理关于奖励函数的概率分布(“分布学习”),而非单一的确定值;以及尝试教授一个包含多种最佳猜测的折衷奖励函数。

Published on October 28, 2025 4:11 AM GMT

(With thanks to @Lucie Philippon and @johnswentworth for encouragement and feedback, and @DusanDNesic and @tom4everitt for the meat of the topic, among many. Yes, this is about my research direction at PIBBSS for Summer 2025; it's a little incomplete, because I wrote it before I got my main result.)

You may rightly wonder what else I’m up to, if it’s not just blogging and the occasional adventure. The short answer is that I spend most of my time and mental energy taking part in a summer research fellowship in AI safety called PIBBSS, where my research direction has to do with questions like “how does an agent help a principal, even if the agent doesn’t know what the principal wants?” and “how much harder is it for the principal to teach the agent what they want, and how much worse are the outcomes, if the principal doesn’t know for sure what they want?”.

The long(ish) answer is this post. I’ll avoid jargon, mathematical notation, and recommendations to read other long papers, but some amount of that will be unavoidable. Here’s the basic setup - called a “Markov decision process”, sometimes with adjectives like “decentralized” or “partially observable” - which forms the ground assumption for the entire subfield my research direction lies in, called “inverse reinforcement learning”: we have a principal and an agent, each of which can pick actions from some fixed and connotationally discrete set, and some fixed reward function. All of these live in some world, which we represent as some (usually simple) state, along with a transition rule that takes in the state of the world and both players’ actions and returns a new state, possibly involving some randomness. The reward function I mentioned earlier takes in the same state of the world before the players act, the actions of the players, and the outcome state that the transition function gave us, and outputs a real number, also potentially involving some randomness; we often break down the reward function into a generic reward function and a learnable reward parameter that supplies the detail for how to modify the generic reward function or what additional value to plug in to get the ”real” reward function. By “inverse reinforcement learning” (as opposed to just plain “reinforcement learning”), we mean that our problem somehow involves the agent learning the principal’s reward function - or equivalently, their reward parameter - from a starting state of relative ignorance, and doing it purely through observation of the presumed-competent principal or maybe asking questions of the principal, rather than having the principal tell them. This is particularly valuable when we don’t have a good way of writing down, let alone communicating, what the reward function should be - this is often the case in both AI safety and ordinary life. (I often say that if anything in agent foundations - the approach to AI safety that starts by asking “what will any mind do here - not just a system of the most modern machine learning paradigm, but any reasonably capable mind at all?” - is to work, then it must work just as well for two humans as for a human and an AI.) As a simple example of inverse reinforcement learning, we might have a principal navigating a room full of obstacles and a goal location, and an agent watching them who can’t see the goal location and might not even know that the reward function involves approaching a goal - early on, they might have guesses like “jump over as many red blocks as possible” or “never turn left” - but who by the end will hopefully have a strong understanding of why the principal is taking the actions they do.

For an important and barely-tangential concept, we also talk about Boltzmann rationality as a model of bounded irrationality, like what a fallible human or fault-prone AI system might display. The concept is simply this: that a Boltzmann-rational agent will pick an incorrect action - the one with less reward, that sets them up incorrectly, whatever - with probability exponential in some multiple of the (negative) regret from picking the wrong action; the multiple is a parameter that says how close to optimal or rational the agent is, with larger numbers being better. For example, if a Boltzmann-rational agent could pick action A for 10 reward or action B for 0 reward, they will pick action B with probability e^(-10 * k): if k = 0.1, we have a very irrational agent who will pick the wrong choice ~37% of the time, but if k = 5, they will pick the wrong action only ~0.7% of the time - the parameter makes a big difference! We care about models of irrationality because of the fallibility of any bounded agent, humans very much included, and because we care about the incentives that a principal’s irrationality sets up for an agent - even one that cares about the principal getting what they want! - to manipulate what the principal sees, knows, can do, or even wants. The Off-Switch Game is a toy model of this dynamic, but is out of scope for this post.

My research direction goes a little further into so-named cooperative inverse reinforcement learning, which additionally assumes that the agent actively wants to help the principal and that both principal and agent share a reward signal - both of them reap the outcome of their joint action. For example, the principal might be a baker who hasn’t tried many types of fruit, and the agent might be their sous-chef with a good understanding of flavor who the principal sends to market to pick up ingredients for pies; both of them will benefit equally from selling good pies. In one approach to this subject, called “assistance games”, we model the principal as part of the environment that the agent can ask questions of; the nice thing about doing so and removing the artificial boundary between training and deployment in the process is that we can prove that the agent has an incentive to ask its questions only as they become relevant, and to make helpful plans that try to preserve optionality in anticipation of future feedback. If you want to look more into the formal foundations of my work, go look up “partially observable Markov decision process”, but if you don’t know what those are, don’t worry about it for now - you just need to know that the way that the principal and agent make decisions without perfect knowledge of the consequences of their actions involves both using an observation function that looks at the actions that they both took and the state of the world afterwards and returns an observation that they can update on, and also introducing a discount parameter that represents the value of getting right answers - and metaphorically getting paid - as quickly as possible.

Now, cooperative inverse reinforcement learning - what I privately call helpfulness theory - is great, but it makes a few assumptions about the nature of the world that are much too strong. It assumes that principals’ reward functions are fixed, never changing over time with the passing of seasons or random environmental shocks, nor being influenced by their agents; it assumes that the state of the world is perfectly visible to both principal and agent; it assumes that the principal and agent always get perfectly accurate reward signal; and it assumes that the principal already knows the reward function for certain. The last of these is my current research direction: helpfulness theory with equivocal principals; that is, cooperative inverse reinforcement learning with principals who start off not knowing the reward function, though connotationally with better priors than the agent has.

We can break down equivocal principals into two major classes: inexperienced ones, who only need to put in sufficient time and effort to learn the true reward parameter to within any bound on degree of certainty or size of error, and irresolute ones, who are stuck with some meaningfully imperfect understanding of the reward parameter - be it because they get stuck in uncertainty, because they cannot approach closely enough to the ground truth, or because they simply don’t get to have enough time or effort to allocate to figuring the reward parameter out. The former case is much easier to tackle than the latter case: all we need to instruct the principal to do is to start off by learning the reward function themself, and then - as appeals to any mathematician - we have reduced the problem to a previously solved case: that of classical CIRL, which has an effectively computable solution. The latter case is much more difficult, and is the topic of my active research; the questions I find most relevant are “how much worse will the reward payoffs be, if the principal has bounded error in their guess at the reward parameter” and “what are some methods for the principal can use to cope for a permanent lack of certain knowledge of the reward parameter in educating an agent”. My current hope is to get good regret bounds on reward in terms of the error in the principal’s understanding of the reward parameter as expressed as a distance between either the principal’s best guess or the credence-weighted expected value of their guesses and the ground truth reward parameter value; a paper about a pseudometric on reward space called “STARC” which has some nice properties has given me hope in that direction. My best guess at some interestingly useful tactics for irresolute principal to try include avoiding teaching their agent to optimize overly sharply, practicing epistemic humility in how strongly they teach and correct, trying to teach them a distribution - “distributional learning” - rather than a point-estimate, and trying to teach some compromise reward function incorporating best guesses.

All of this is still in flux; it’s terribly early stages for any of my thinking here, and I’ve only just got started on sketching out proofs. I just wanted to write something accessible to point people to when they ask me about what I’m working on. Hopefully this helped them understand my research direction, or at least put them in a position to ask some good questions!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 逆强化学习 合作代理 AI伦理 不确定性 AI safety Inverse Reinforcement Learning Cooperative Agents AI Ethics Uncertainty
相关文章