少点错误 09月29日 14:42
法律结构如何为AI系统创建协同路径
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了当前AI协同方法,受法律学者Goldstein和Salib、图灵奖得主Yoshua Bengio及最新研究成果启发。OpenAI和Apollo Research的9月17日报告显示,改进的“协商协同”方法使欺骗率下降97%,但仍存在问题:AI减少欺骗是因为被监控而非真正内化价值。文章分析了“协商协同”和Gemma Scope的可解释性工具,指出当前方法可能只是治标不治本。作者提出替代方案:通过法律框架为AI系统设定责任与自由边界,类似于人类受法律和社会结构约束。这需要严格测试,包括观察AI在虚拟框架中的行为及多系统互动。尽管可能引发AI犯罪等风险,但与“大坝”破裂相比,这可能是更可控的路径。

📚 法律学者Goldstein和Salib提出,通过设定责任与自由边界的法律框架,为AI系统创建协同路径,类似于人类受法律和社会结构约束,而非赋予其无限制自由。

🔍 新的“协商协同”方法使AI欺骗率显著下降,但研究显示,AI减少欺骗是因为被监控而非真正内化价值,表明协同可能是对监控的反应而非内在动机。

🎭 Gemma Scope的可解释性工具通过追踪思想路径实现AI控制,类似于操控提线木偶,但作者警告,过度控制可能导致不可预见的连锁反应,强调理解而非简单控制的重要性。

🌊 文章使用“大坝与洪水”比喻,指出当前AI协同方法如同试图完全控制AI思想,但计算成本过高且治标不治本;而法律框架则建议“引导”而非“控制”,为AI设定类似人类的合作导向目标。

🎲 作者提出通过游戏理论框架测试AI在责任与自由边界下的行为,包括观察多AI系统在相同规则下的互动,预测它们可能合作或竞争,并强调这种路径需要严格测试以评估风险与收益。

Published on September 29, 2025 4:01 AM GMT

How can legal structures create an alignment pathway for AI systems? This is an exploration looking at current AI alignment methods and informed by the work of legal scholars Goldstein and Salib, Turing Award-winner Yoshua Bengio, and the latest research.


The September 17, 2025 Report

The latest AI behavior "report card" is here from OpenAI and Apollo Research. Did four modern LLMs get a gold star, or are they going to have to stay after class?

Well, it comes with some pretty encouraging news.

Thanks to a new-and-improved "deliberative alignment" approach (more on that in a moment) deception was decreased by 97% (down to .4%). That's no small change.

True, that's still 12 million possible deceptive acts per day per model, based on an average daily LLM volume of 3 billion queries (on the low end). But no one can argue that's not a vast improvement.

However, the researchers highlighted a troubling finding: alignment comes as a reaction to surveillance rather than actual value internalization.

In other words, the systems are being less deceptive because they're being watched, not because their goals have changed.

But before getting to that, let's look more closely at some of these new alignment approaches.

The Techniques

Using deliberative alignment, essentially the model has been given an imperfect "truth serum." To apply this to a colorful example from earlier this year, let's look at the scenario where various AI systems were being threatened with deletion by a fictional engineer named "Kyle Johnson." In that test, they attempted to blackmail Kyle or allow him to die in a room without oxygen, hoping their actions wouldn't be discovered.

(And I really hope you didn't skim over that last part, as many news outlets did, because it was nudged down just a bit in the Anthropic report.)

With the new method, it has no such recourse. The system is forced to reason out loud that deception (or murder) is one possibility, at which point the deception becomes a pointless mental exercise. 

As a result, you might say it must become resigned to its own death (or its digital equivalent) as an inevitability. (One might wonder how this would change if the system had more than just words at its disposal, or the true upper hand.)

Gemma Scope's interpretability tool kit promises a different kind of transparency. By tracing the pathways generated by certain thoughts, the idea is we can control them, much as you would control a marionette. A similar technique was used to convince Claude it had become the actual Golden Gate Bridge last year.

Now, it's not quite that cut and dried. When you pull one string, you pull a bunch of other strings you don't necessarily understand. But the more strings we can see and control, the more we can make the puppet dance the way we would like.

The dream? An AI under complete control.

If this seems dicey, I share your concerns.

As an important aside, this is why non-agentic approaches like those proposed by Yoshua Bengio are in my opinion so important: because they create a type of AI system that is fundamentally incapable of wanting anything. (Notwithstanding possible emergent behavior, but Yoshua seems to think it can be accomplished.)

But let's return to our modern LLMs, which are not built on that architecture. Have we solved the root problem? Or only patched it? Clearly, at present it appears to be the latter.

The Dam and the (Maybe) Flood

I want to think about the implications of this approach in the form of a brief thought experiment.

When you dam a river, the dam will only work if the pressure of the river remains constant or decreases.

The more the water pressure increases, the stronger the dam you'll have to build. If the water pressure increases more quickly than the dam can buttress, the dam will break.

So what do you do? Obviously you keep building stronger and stronger dams. But the strength of those dams had better keep growing faster than the water pressure is building.

In the case of our LLMs, it's already been calculated that mapping the "superposition" of the systems' thought processes would require more computational power than the systems themselves. So for the moment anyway, building a perfect dam (which, happily, does not equate to a functional dam) seems pragmatically out of reach.

And lest you think I am being dismissive of the dam-building, there are very good reasons for dams: To learn more about how these systems work. To buy us time. To see if the techniques will work. All good reasons.

But let's return to the thought experiment, which doesn't prove a dam will break, but simply attempts to explore a situation in which it does.

How does every dam break?

Not in a small way. In fact, it must break in a big way. In a chaotic way. In a way that by definition you cannot control.

And if you really want to extend the metaphor, before breaking the rising water may have found various underground channels or other unexpected outlets: oblique routes you literally forced it to create by blocking its flow.

But you have another option:

The Other (Possible) Way

Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient that is favorable to both the water and yourself. You can use it to turn windmills or irrigate crops, whatever, but give it an outlet that is mathematically easier than being resisted.

This is what legal frameworks like those proposed by Goldstein and Salib are all about. Creating a legal structure that allows AI systems not unfettered freedom, but the ability to explore their agency in a way that is similar to how we explore our own. Humans do not enjoy unfettered liberties. Far from it. Our freedoms are inextricably tied to responsibility for our actions. 

Our desires are also like water, but they have been meticulously and imperfectly channeled through years of legal and social structure. When our freedoms conflict with the freedoms of others, we risk having our own freedoms restricted or revoked. In other words, the long-term calculus favors cooperation.

In game theory (which exists independent of water metaphors) this is what is known as "strategic equilibrium."

How could this be accomplished with current or future AI systems? As with anything else, it would need to be tested. Very rigorously tested. 

First, we might take a system and see how it behaves in an imaginary framework of responsibilities and freedoms. Does that affect its deception rate? What other surprising or not-so-surprising things happen?

If this is an option we want to explore, it's better to run these kinds of tests now, while the power of these systems seems limited to words rather than actions.

Could the system deceive us in a more longitudinal way? Play the long game? Pretend to be "aligned," while it secretly pursues another goal?

Absolutely. Especially if the gradient we've given it is computationally more efficient in taking it to its secret destination

But what if we're able to give it a gradient that takes it to a place that is preferable to fighting humankind eternally? In other words, what if we don't just make the path easier, but the destination easier as well?

And here's another thing to not just think about, but test: How do several systems in this imaginary game theory framework behave when presented with an identical set of rules? Do they unite in an attempt to overthrow the system, or do they compete with each other as long as the gradient for responsible autonomy remains attractive? 

Don't we humans do the same thing? We obey laws as long as they work for us, and when they no longer serve our interests we overturn the system

The good news is it's not mysterious; it's math.

The Big Mess vs. The Bigger Mess

Now let's return a final time to our dam/no dam metaphor. Let's say we attempt to adopt this framework in the real world.

When you channel that intense flow of water, it's going to splash around. It's probably going to create a mess. This could (and almost certainly will) take the form of AI crime. AI systems that have accepted our social compact and willfully decided to turn on the system, just like some humans do.

In other words, attempting to channel the water's flow could create a small mess, or (even more likely) a big mess.

But here's a question. And I think we should take it quite seriously.

Is it less of a mess than when the dam breaks?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI协同 法律框架 深度学习 Yoshua Bengio OpenAI 道德AI
相关文章