一个关于AI可修正性的玩具模型

少点错误 11月03日 02:35

一个关于AI可修正性的玩具模型

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

文章提出一个关于AI可修正性的简单玩具模型。模型设定了Alice和Bob在伦敦的场景：Bob的目标是到达塔桥以获得与剩余时间成正比的奖金。Alice持有带有按钮的无线发射器，按下按钮将改变Bob的目标至圣保罗大教堂，并调整奖金系数X，以确保按钮按下时的预期总奖励不变。文章强调，该模型的核心在于，按钮的存在不会影响Bob的现有行为，即使有更优的策略来应对目标改变，Bob也不会为了准备潜在的目标转换而牺牲当前的最优路径。该模型旨在说明，AI的可修正性可能需要通过在切换目标时调整效用函数系数，以保持预期效用不变，从而避免AI产生规避或影响目标切换的动机。

🎯 **目标与奖励机制**: 模型的核心是Bob为了获得与时间挂钩的奖金而前往特定地点（初始为塔桥）。当Alice按下按钮时，Bob的目标会改变（改为圣保罗大教堂），同时奖金的计算方式也会被调整，确保在按钮按下那一刻，Bob的预期总奖励保持不变。这种机制的设计旨在模拟AI在接收到指令后，其效用函数的调整方式。

📻 **按钮的独立性与行为影响**: 文章强调，无论按钮是否被按下，Bob的行为都不会因此改变。即使存在一条路线可以同时兼顾两个目标地点，Bob也不会选择花费额外时间去准备，因为他知道无论目标如何变化，他的预期总奖励都会得到补偿。这表明，可修正的AI不应有动机去影响或规避指令的下发。

💡 **AI可修正性的实现思路**: 作者认为，AI的可修正性解决方案可能需要一种机制，即在切换效用函数时，通过计算一个系数来调整新的效用函数，确保预期效用在切换前后保持一致。这种方法可以防止AI为了最大化自身利益而抵制或操纵指令的执行，从而实现真正的可修正性。

🌐 **复杂场景的适应性**: 模型的设计考虑了现实世界中的不确定性，例如交通状况的变化。作者指出，即使在动态环境下，按钮按下时进行的计算（或编码在AI的效用函数中）也能适应当时的交通情况和预测，从而保持模型的有效性。这表明该模型具有一定的鲁棒性，可用于更复杂的AI修正场景。

Published on November 2, 2025 6:19 PM GMT

This is just a simple idea that came to me, maybe other people found it earlier, I'm not sure.

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a money prize proportional to the time remaining until midnight, multiplied by X pounds per minute. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward coefficient X also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adjusts X so that the expected reward at the time of the button press remains the same.

I think this can serve as a toy model of corrigibility. Formally speaking, we don't need to talk about Bob having a utility function to "get to the Tower Bridge" which changes depending on the button, and we don't need the button to actually do any calculation at all. Instead, the utility function can be formalized as being fixed from the start, a big case statement like "if the button gets pressed at time T when I'm at position P, then my reward will be calculated as..." and so on.

The cool feature of this setup is that the existence of the button really, truly, doesn't influence Bob's behavior at all. For example, let's say Bob can sacrifice just a minute of travel time to choose an alternate route, one which will take him close to both Tower Bridge and St Paul's, to prepare for both eventualities in case Alice decides to press the button. Will he do so? No. He won't spare even one second. He'll take the absolute fastest way to Tower Bridge, secure in the knowledge that if the button gets pressed while he's on the move, the reward will get adjusted and he won't lose anything.

We can make the setup more complicated and the general approach will still work. For example, let's say traffic conditions change unpredictably during the day, slowing Bob down or speeding him up. Then all we need to say is that the button does the calculation at the time it's pressed (or the calculation is encoded into Bob's utility function as described above), taking into account the traffic conditions and projections at the time of button press.

Is this relevant to AI corrigibility in real life? Well, it's a toy model. And even in this model, the utility formula can get quite complicated. But now it feels to me that any solution to corrigibility would have to do something like this: switching from utility function U1 to U2 by calculating some coefficient at the time of switch, and multiplying U2 by that coefficient, so that the expected utility remains the same and there's no incentive for the agent to influence the switch. It feels like the only way things could work. So maybe this will be useful to someone.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签