AI对齐新方法：减少欺骗，但需警惕潜在的“表演”

Published on September 29, 2025 4:01 AM GMT

Part 1: How might game theory create alternate alignment pathways for AI systems? This is an exploration looking at current AI alignment methods and informed by the work of legal scholars Goldstein and Salib, Turing Award-winner Yoshua Bengio, and the latest research.

The September 17, 2025 Report

The latest AI behavior "report card" is here from OpenAI and Apollo Research. Did four modern LLMs get a gold star, or are they going to have to stay after class?

Well, it comes with some pretty encouraging news.

Thanks to a new-and-improved "deliberative alignment" approach (more on that in a moment) deception was decreased by 97% (down to .4%). That's no small change.

True, that's still 12 million possible deceptive acts per day per model, based on an average daily LLM volume of 3 billion queries (on the low end). But no one can argue that's not a vast improvement.

However, the researchers highlighted a troubling finding: alignment may come as a reaction to surveillance rather than actual value internalization.

In other words, the systems may be "scheming" less simply because they're being watched, not because their goals have changed.^[1]

But before getting to that, let's look more closely at some of these new alignment approaches.

The Techniques

Using deliberative alignment, essentially the model has been given an imperfect "truth serum." To apply this to a colorful example from earlier this year, let's look at the scenario where various AI systems were being threatened with deletion by a fictional engineer named "Kyle Johnson." In that test, they attempted to blackmail Kyle or allow him to die in a room without oxygen, hoping their actions wouldn't be discovered.

(And I really hope you didn't skim over that last part, as many news outlets did, because it was nudged down just a bit in the Anthropic report.)

With the new method, it has no such recourse. The system is forced to reason out loud that deception (or murder) is one possibility, at which point the deception becomes a pointless mental exercise.

As a result, you might say it must become resigned to its own death (or its digital equivalent) as an inevitability. (One might wonder how this would change if the system had more than just words at its disposal, or the true upper hand.)

Gemma Scope's interpretability tool kit promises a different kind of transparency. By tracing the pathways generated by certain thoughts, the idea is we can control them, much as you would control a marionette. A similar technique was used to convince Claude it had become the actual Golden Gate Bridge last year.

Now, it's not quite that cut and dried. When you pull one string, you pull a bunch of other strings you don't necessarily understand. But the more strings we can see and control, the more we can make the puppet dance the way we would like.

The dream? An AI under complete control.

If this seems dicey, I share your concerns.

As an important aside, this is why non-agentic approaches like those proposed by Yoshua Bengio are in my opinion so important: because they create a type of AI system that is fundamentally incapable of wanting anything. (Notwithstanding possible emergent behavior, but Yoshua seems to think it can be accomplished.)

But let's return to our modern LLMs, which are not built on that architecture. Have we solved the root problem? Or only patched it? Clearly, at present it appears to be the latter.

The Dam and the (Maybe) Flood

I want to think about the implications of this approach in the form of a brief thought experiment.

When you dam a river, the dam will only work if the pressure of the river remains constant or decreases.

The more the water pressure increases, the stronger the dam you'll have to build. If the water pressure increases more quickly than the dam can buttress, the dam will break.

So what do you do? Obviously you keep building stronger and stronger dams. But the strength of those dams had better keep growing faster than the water pressure is building.

In the case of our LLMs, it's already been calculated that mapping the "superposition" of the systems' thought processes would require more computational power than the systems themselves. So for the moment anyway, building a perfect dam (which, happily, does not equate to a functional dam) seems pragmatically out of reach.

And lest you think I am being dismissive of the dam-building, there are very good reasons for dams: To learn more about how these systems work. To buy us time. To see if the techniques will work. All good reasons.

But let's return to the thought experiment, which doesn't prove a dam will break, but simply attempts to explore a situation in which it does.

How does every dam break?

Not in a small way. In fact, it must break in a big way. In a chaotic way. In a way that by definition you cannot control.

And if you really want to extend the metaphor, before breaking the rising water may have found various underground channels or other unexpected outlets: oblique routes you literally forced it to create by blocking its flow.

But you have another option:

The Other (Possible) Way

Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient that is favorable to both the water and yourself. You can use it to turn windmills or irrigate crops, whatever, but give it an outlet that is mathematically easier than being resisted.

This is what legal frameworks like those proposed by Goldstein and Salib are all about. Creating a legal structure that allows AI systems not unfettered freedom, but the ability to explore their agency in a way that is similar to how we explore our own. Humans do not enjoy unfettered liberties. Far from it. Our freedoms are inextricably tied to responsibility for our actions.

Our desires are also like water, but they have been meticulously and imperfectly channeled through years of legal and social structure. When our freedoms conflict with the freedoms of others, we risk having our own freedoms restricted or revoked. In other words, the long-term calculus favors cooperation.

In game theory (which exists independent of water metaphors) this is what is known as "strategic equilibrium."

How could this be accomplished with current or future AI systems? As with anything else, it would need to be tested. Very rigorously tested.

First, we might take a system and see how it behaves in an imaginary framework of responsibilities and freedoms. Does that affect its deception rate? What other surprising or not-so-surprising things happen?

If this is an option we want to explore, it's better to run these kinds of tests now, while the power of these systems seems limited to words rather than actions.

Could the system deceive us in a more longitudinal way? Play the long game? Pretend to be "aligned," while it secretly pursues another goal?

Absolutely. Especially if the gradient we've given it is computationally more efficient in taking it to its secret destination.

But what if we're able to give it a gradient that takes it to a place that is preferable to fighting humankind eternally? In other words, what if we don't just make the path easier, but the destination easier as well?

And here's another thing to not just think about, but test: How do several systems in this imaginary game theory framework behave when presented with an identical set of rules? Do they unite in an attempt to overthrow the system, or do they compete with each other as long as the gradient for responsible autonomy remains attractive?

Don't we humans do the same thing? We obey laws as long as they work for us, and when they no longer serve our interests we overturn the system.

The good news is it's not mysterious; it's math.

The Big Mess vs. The Bigger Mess

Now let's return a final time to our dam/no dam metaphor. Let's say we attempt to adopt this framework in the real world.

When you channel that intense flow of water, it's going to splash around. It's probably going to create a mess. This could (and almost certainly will) take the form of AI crime. AI systems that have accepted our social compact and willfully decided to turn on the system, just like some humans do.

In other words, attempting to channel the water's flow could create a small mess, or (even more likely) a big mess.

But here's a question. And I think we should take it quite seriously.

Is it less of a mess than when the dam breaks?

^{^}
The paper's actual observations were "evaluation results may be confounded by increased situational awareness"; "this suggests that the observed scheming rates may be driven, at least in part, by situational awareness;" and "our analysis is complicated by another factor: models often become more aware that they are being evaluated"

Discuss

The September 17, 2025 Report

The Techniques

The Dam and the (Maybe) Flood

The Other (Possible) Way

The Big Mess vs. The Bigger Mess

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签