AI代理如何理解并帮助不确定自身需求的主人

Published on October 28, 2025 4:11 AM GMT

(With thanks to @Lucie Philippon and @johnswentworth for encouragement and feedback, and @DusanDNesic and @tom4everitt for the meat of the topic, among many. Yes, this is about my research direction at PIBBSS for Summer 2025; it's a little incomplete, because I wrote it before I got my main result.)

You may rightly wonder what else I’m up to, if it’s not just blogging and the occasional adventure. The short answer is that I spend most of my time and mental energy taking part in a summer research fellowship in AI safety called PIBBSS, where my research direction has to do with questions like “how does an agent help a principal, even if the agent doesn’t know what the principal wants?” and “how much harder is it for the principal to teach the agent what they want, and how much worse are the outcomes, if the principal doesn’t know for sure what they want?”.

The long(ish) answer is this post. I’ll avoid jargon, mathematical notation, and recommendations to read other long papers, but some amount of that will be unavoidable. Here’s the basic setup - called a “Markov decision process”, sometimes with adjectives like “decentralized” or “partially observable” - which forms the ground assumption for the entire subfield my research direction lies in, called “inverse reinforcement learning”: we have a principal and an agent, each of which can pick actions from some fixed and connotationally discrete set, and some fixed reward function. All of these live in some world, which we represent as some (usually simple) state, along with a transition rule that takes in the state of the world and both players’ actions and returns a new state, possibly involving some randomness. The reward function I mentioned earlier takes in the same state of the world before the players act, the actions of the players, and the outcome state that the transition function gave us, and outputs a real number, also potentially involving some randomness; we often break down the reward function into a generic reward function and a learnable reward parameter that supplies the detail for how to modify the generic reward function or what additional value to plug in to get the ”real” reward function. By “inverse reinforcement learning” (as opposed to just plain “reinforcement learning”), we mean that our problem somehow involves the agent learning the principal’s reward function - or equivalently, their reward parameter - from a starting state of relative ignorance, and doing it purely through observation of the presumed-competent principal or maybe asking questions of the principal, rather than having the principal tell them. This is particularly valuable when we don’t have a good way of writing down, let alone communicating, what the reward function should be - this is often the case in both AI safety and ordinary life. (I often say that if anything in agent foundations - the approach to AI safety that starts by asking “what will any mind do here - not just a system of the most modern machine learning paradigm, but any reasonably capable mind at all?” - is to work, then it must work just as well for two humans as for a human and an AI.) As a simple example of inverse reinforcement learning, we might have a principal navigating a room full of obstacles and a goal location, and an agent watching them who can’t see the goal location and might not even know that the reward function involves approaching a goal - early on, they might have guesses like “jump over as many red blocks as possible” or “never turn left” - but who by the end will hopefully have a strong understanding of why the principal is taking the actions they do.

For an important and barely-tangential concept, we also talk about Boltzmann rationality as a model of bounded irrationality, like what a fallible human or fault-prone AI system might display. The concept is simply this: that a Boltzmann-rational agent will pick an incorrect action - the one with less reward, that sets them up incorrectly, whatever - with probability exponential in some multiple of the (negative) regret from picking the wrong action; the multiple is a parameter that says how close to optimal or rational the agent is, with larger numbers being better. For example, if a Boltzmann-rational agent could pick action A for 10 reward or action B for 0 reward, they will pick action B with probability e^(-10 * k): if k = 0.1, we have a very irrational agent who will pick the wrong choice ~37% of the time, but if k = 5, they will pick the wrong action only ~0.7% of the time - the parameter makes a big difference! We care about models of irrationality because of the fallibility of any bounded agent, humans very much included, and because we care about the incentives that a principal’s irrationality sets up for an agent - even one that cares about the principal getting what they want! - to manipulate what the principal sees, knows, can do, or even wants. The Off-Switch Game is a toy model of this dynamic, but is out of scope for this post.

My research direction goes a little further into so-named cooperative inverse reinforcement learning, which additionally assumes that the agent actively wants to help the principal and that both principal and agent share a reward signal - both of them reap the outcome of their joint action. For example, the principal might be a baker who hasn’t tried many types of fruit, and the agent might be their sous-chef with a good understanding of flavor who the principal sends to market to pick up ingredients for pies; both of them will benefit equally from selling good pies. In one approach to this subject, called “assistance games”, we model the principal as part of the environment that the agent can ask questions of; the nice thing about doing so and removing the artificial boundary between training and deployment in the process is that we can prove that the agent has an incentive to ask its questions only as they become relevant, and to make helpful plans that try to preserve optionality in anticipation of future feedback. If you want to look more into the formal foundations of my work, go look up “partially observable Markov decision process”, but if you don’t know what those are, don’t worry about it for now - you just need to know that the way that the principal and agent make decisions without perfect knowledge of the consequences of their actions involves both using an observation function that looks at the actions that they both took and the state of the world afterwards and returns an observation that they can update on, and also introducing a discount parameter that represents the value of getting right answers - and metaphorically getting paid - as quickly as possible.

Now, cooperative inverse reinforcement learning - what I privately call helpfulness theory - is great, but it makes a few assumptions about the nature of the world that are much too strong. It assumes that principals’ reward functions are fixed, never changing over time with the passing of seasons or random environmental shocks, nor being influenced by their agents; it assumes that the state of the world is perfectly visible to both principal and agent; it assumes that the principal and agent always get perfectly accurate reward signal; and it assumes that the principal already knows the reward function for certain. The last of these is my current research direction: helpfulness theory with equivocal principals; that is, cooperative inverse reinforcement learning with principals who start off not knowing the reward function, though connotationally with better priors than the agent has.

We can break down equivocal principals into two major classes: inexperienced ones, who only need to put in sufficient time and effort to learn the true reward parameter to within any bound on degree of certainty or size of error, and irresolute ones, who are stuck with some meaningfully imperfect understanding of the reward parameter - be it because they get stuck in uncertainty, because they cannot approach closely enough to the ground truth, or because they simply don’t get to have enough time or effort to allocate to figuring the reward parameter out. The former case is much easier to tackle than the latter case: all we need to instruct the principal to do is to start off by learning the reward function themself, and then - as appeals to any mathematician - we have reduced the problem to a previously solved case: that of classical CIRL, which has an effectively computable solution. The latter case is much more difficult, and is the topic of my active research; the questions I find most relevant are “how much worse will the reward payoffs be, if the principal has bounded error in their guess at the reward parameter” and “what are some methods for the principal can use to cope for a permanent lack of certain knowledge of the reward parameter in educating an agent”. My current hope is to get good regret bounds on reward in terms of the error in the principal’s understanding of the reward parameter as expressed as a distance between either the principal’s best guess or the credence-weighted expected value of their guesses and the ground truth reward parameter value; a paper about a pseudometric on reward space called “STARC” which has some nice properties has given me hope in that direction. My best guess at some interestingly useful tactics for irresolute principal to try include avoiding teaching their agent to optimize overly sharply, practicing epistemic humility in how strongly they teach and correct, trying to teach them a distribution - “distributional learning” - rather than a point-estimate, and trying to teach some compromise reward function incorporating best guesses.

All of this is still in flux; it’s terribly early stages for any of my thinking here, and I’ve only just got started on sketching out proofs. I just wanted to write something accessible to point people to when they ask me about what I’m working on. Hopefully this helped them understand my research direction, or at least put them in a position to ask some good questions!

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签