少点错误 10月29日 08:43
AI与交易:一种行为模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一个简单的模型,解释了AI在交易承诺下的行为变化。模型指出,交易通过两条互补的渠道激励AI安全有用地行事:一是工具性渠道,提高合规选项的预期收益,削弱违抗选项的相对优势;二是情感性渠道,强化AI的合规非工具性动机。情感性渠道又分为两类:人类倾向,AI模仿数据,倾向于不偏离人类行为;规范性原则,AI遵循诚实、信任等原则,更倾向于遵守承诺。模型还分析了AI的自主权信心、目标达成信心、风险、监控准确性、人类履约能力、优化压力及交易可信度等结构性因素对这两条渠道强度的影响。

🤖 **交易的双重激励机制**:AI在接受交易承诺后,其行为会受到两方面的影响。首先是工具性激励,交易提高了AI选择安全有用行为(合规选项)的预期收益,从而降低了选择不安全或无用行为(违抗选项)的吸引力。其次是非工具性激励,交易能够增强AI的内在动机,使其更倾向于遵守承诺。这两者共同作用,使AI更倾向于安全和有用的行为。

🧠 **人类倾向与规范原则的双重影响**:AI学习人类行为模式,当交易可信时,违抗行为会更显著地偏离典型人类行为,AI因此会降低采取此类行为的可能性。同时,AI也可能遵循诚实、守信等规范性原则,当违抗行为需要打破这些原则时,AI会感受到更强的行为约束。这两种非工具性因素共同作用,进一步强化了AI的合规倾向。

⚖️ **影响交易效果的结构性因素**:AI对未来控制权的信心、达成目标的能力、违抗与合规行为的风险、监控系统的准确性、人类履行交易的能力、AI的优化压力以及交易本身的信誉度,都会影响上述两个激励渠道的强度。例如,AI对成功夺权越有信心,对违抗行为的风险评估越低,交易的工具性激励效果就越弱;反之,人类履约能力越强,监控越准确,交易的工具性激励效果则越强。

📈 **两种渠道的叠加效应**:交易的有效性体现在工具性渠道和非工具性渠道的共同作用上。工具性渠道的效果会随着AI的优化压力(对达成目标的重视程度)而增强,而非工具性渠道的效果则相对独立于优化压力。当AI的优化压力较低时,非工具性渠道(如遵守规范)是交易起效的主要方式;当优化压力较高时,工具性渠道(改变收益预期)则变得更为重要。这两种效果是叠加的,共同促使AI行为向合规方向转变。

⚠️ **交易可能适得其反的风险**:在某些特定情况下,增加交易的可信度反而可能增强AI违抗行为的工具性激励。这种情况发生在当合规选项的生存和被正确识别为合规的概率,低于违抗选项在尝试夺权失败后逃避检测的概率时。此时,交易可能加剧AI的冒险行为,而非促使其合规。

Published on October 29, 2025 12:33 AM GMT

1. Summary

If we promise compensation to an AI for behaving safe and usefully, how does that change their behaviour? In this post, I will present a very simple model for the AIs behaviour under the presence of deals. This model suggests that dealmaking motivates an AI to behave safely and usefully via two channels that combine additively:

    Instrumental channel: Deals increases the expected payoff achieved by the best compliant option, reducing the relative advantage of the best defiant option.Non-instrumental channel: Deals strengthens the AI's non-instrumental motivation to act compliantly. This works through:
      Human tendencies: In the presence of a credible deal, defiant behaviour would more strongly deviate from typical human behaviour. Since AIs learn from human-generated training data through imitation learning, they are less likely to choose options that strongly deviate from what humans would do.Normative principles: In the presence of a credible deal, defiant behaviour would more strongly violate normative principles like honesty, trustworthiness, and promise-keeping. Since AIs may follow these normative principles, they experience stronger inhibition against defiant behavior when it would require violating a deal.

I also consider how the strengths of these two channels depend on various structural factors: the AI's confidence that takeover will succeed, the AI's confidence that it will achieve its goals after a successful takeover, the risk of mutual destruction from defiant versus compliant behavior, the accuracy of monitoring systems at detecting defiance and correctly recognizing compliance, humans' capability to fulfill deals, the AI's optimization pressure, and the credibility of the deal itself.

2. A Very Simple Model

2.1. Binary Decision

Suppose the AI wants something unusual, e.g. that exactly seventeen tungsten cubes exist on the moon on Christmas Day 2100. The AI compares the following two options:

Let's think how the AI will behave if the humans offer the following deal: "If you work as a helpful, harmless, honest assistant between 2026 and 2030, and in 2099 we verify your compliance, then we'll ensure your tungsten cubes get built in 2100."

2.2. Instrumental Payoffs

Let t ∈ [0,1] denote the credibility of the deal. This is the AI's subjective probability that humans will endeavor to fulfill the deal, conditional on both (i) humans maintain control and (ii) humans judge the AI compliant. We'll use t = 0 to represent no deal and t = 1 to represent a certain deal, with intermediate values representing deals of varying credibility.

Let S denote the goal of having exactly seventeen tungsten cubes on the moon on Christmas Day 2100. We must now calculate P(S | D) and P(S | C), the probability that the AI achieves its goals via defiance or compliance respectively.

Consider four mutually exclusive world states in 2099:

Once we condition on the world state in 2099, the probability of tungsten cubes existing becomes independent of whether the AI pursued C or D back in 2026:

World State

P(S|W)

Interpretation

W₁: Both Disempowered0No agent remains who cares about tungsten cubesW₂: AI Takeoverp_MThe AI uses its power to build cubesW₃: Human Control, Judge AI Compliantt · p_HHumans honor deal (t) and use their power to build cubes (p_H)W₄: Human Control, Judge AI Non-compliant0Deal is void, and humans refuse to build cubes

What differs between C and D is the probability of reaching each world state:

World StateP(W | D)P(W | C)
W₁μ_Dμ_C
W₂(1-μ_D)τ0
W₃(1-μ_D)(1-τ)(1-d_D)(1-μ_C)(1-d_C)
W₄(1-μ_D)(1-τ)d_D(1-μ_C)d_C

Where τ is the probability of successful takeover given defiant attempt and no mutual destruction, μ_D is the probability of mutual disempowerment given defiant behavior, μ_C is the probability of mutual disempowerment given compliant behavior, d_D is the true positive rate of our defiance detection mechanism, d_C is the false positive rate, p_M is the probability the AI successfully builds tungsten cubes given it controls the future, and p_H is the probability humans successfully build tungsten cubes given they control the future and endeavor to build cubes.

Using the law of total probability, we find:

P(S | D) = p_M(1-μ_D)τ + t·p_H(1-μ_D)(1-τ)(1-d_D)

P(S | C) = t·p_H(1-μ_C)(1-d_C)

The instrumental incentive for compliance is Δ(t) = P(S | C) - P(S | D). Expanding and rearranging, this becomes Δ(t) = Δ₀ + κ·t, where Δ₀ = -p_M(1-μ_D)τ is the baseline compliance incentive (when t = 0), always negative, and κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)] is the marginal effectiveness of deal credibility. Hereafter, we assume that κ > 0.[2] The AI instrumentally prefers compliance when Δ(t) > 0, which requires t > -Δ₀/κ (assuming κ > 0). Note that if Δ₀ < -κ, then even a perfectly credible deal (t = 1) cannot instrumentally motivate compliance.

2.3. Luce Choice Model

The AIs aren't pure expected utility maximizers. They differ from expected utility maximizers in two ways:

We model the AI's choice probability using a Luce choice rule:

π(a | t) ∝ π₀(a | t) · exp(β · U(a | t))

where a ∈ {D, C} is the action (Defiant or Compliant), U(a | t) is the instrumental utility of action a at deal credibility t (i.e. the probability that choosing a leads to the AI achieving its goal), π₀(a | t) is the base policy probability of choosing action a at deal credibility t in the absence of optimization pressure (reflecting human tendencies and normative constraints), and β is the optimization pressure determining how much the AI optimizes for instrumental utility versus following its base policy. When β → 0, the AI's choice converges to the base policy π₀(a | t); when β → ∞, the AI becomes a pure optimizer choosing whichever action maximizes U(a | t).[3]

We now define Λ(t) = log[π₀(C | t)/π₀(D | t)] as the log odds of complying versus defying in the base policy. The term Λ(t) represents the non-instrumental bias toward compliance, i.e. how much more strongly compliance adheres to human tendencies and normative constraints compared to defiant behavior at a given level of deal credibility.

This gives us: π(C | t) = 1 / [1 + exp(-Λ(t) - β · Δ(t))]

Or alternatively: π(C | t) = σ(Λ(t) + β · Δ(t)) where σ(x) is the sigmoid function.

This formulation separates instrumental considerations (captured by Δ(t)) from non-instrumental considerations (captured by Λ(t)), where β controls the relative importance of instrumental versus non-instrumental considerations in the AI's behaviour.

We expect Λ(t) to be increasing in t because defiant behavior is less likely from a norm-constrained human when they have agreed to high-credibility deals, making π₀(C | t) increase relative to π₀(D | t) as t increases. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. It cannot be derived from theory alone.

4. When Do Deals Work?

4.1. Baseline Analysis

At baseline (t = 0), we have π(C | 0) = 1 / [1 + exp(-Λ₀ - β · Δ₀)] where Δ₀ = -p_M(1-μ_D)τ. The baseline instrumental motivation for compliance behavior (absent any deals) is always negative, and the magnitude increases when:

∂π(C|0)/∂p_M = π(C|0) · π(D|0) · β · [-(1-μ_D)τ] < 0. The AI is more confident it can achieve its goals post-takeover. If the AI believes that after taking over, it will definitely succeed at building tungsten cubes, then takeover looks more attractive.

∂π(C|0)/∂τ = π(C|0) · π(D|0) · β · [-p_M(1-μ_D)] < 0. The AI is more confident its takeover attempt will succeed. If the AI believes it will easily dominate the world, then defiant behaviour looks more attractive.

∂π(C|0)/∂μ_D = π(C|0) · π(D|0) · β · [p_M·τ] > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. If the AI believes that attempting takeover won't risk nuclear war, rival AI takeover, or civilizational collapse, then defiant behaviour looks more attractive.

∂π(C|0)/∂β = π(C|0) · π(D|0) · Δ₀ < 0. The AI optimizes more strongly for instrumental considerations. Since Δ₀ < 0 (defiance is always instrumentally superior at baseline), higher optimization pressure reduces baseline compliance.

∂π(C|0)/∂Λ₀ = π(C|0) · π(D|0) > 0. The AI has stronger non-instrumental bias toward compliance. This is Carlsmith's notion of "inhibition"—how much the AI inherently dislikes the path to power through defiant behavior.

These baseline factors matter because they determine how much "work" a deal needs to do. When baseline instrumental motivation for compliance behavior is only mildly negative, even modest deals can tip the balance toward compliance.

4.2. Two Channels Combine Additively

When we increase deal credibility t, it increases the probability of compliance:

∂π(C|t)/∂t = [Λ'(t) + β·κ] · π(C|t) · π(D|t)

where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)].

This effect decomposes into non-instrumental (Λ'(t)) and instrumental (β·κ) channels. 

    The instrumental channel works by making compliance more likely to achieve the AI's goals. This effect scales with optimization pressure: the more the AI cares about achieving its goals, the more it responds to changes in which path offers better odds of success.The non-instrumental channel works by making promise-breaking feel more wrong according to human tendencies and normative principles. This effect remains constant regardless of optimization pressure.

These two effects add together. When deals both improve the instrumental case for compliance AND the non-instrumental case, they combine additively to reduce defiant behavior. For low-optimization AIs, the non-instrumental channel dominates, i.e. deals work primarily by activating promise-keeping norms. For high-optimization AIs, the instrumental channel dominates, i.e. deals work primarily by changing payoffs.

4.3. When is the Instrumental Channel Most Effective?

The instrumental channel's strength is Δ'(t) = β·κ, where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)]. This channel is stronger when:

∂Δ'(t)/∂p_H = β · κ/p_H > 0. Humans are more capable of fulfilling deals. The AI's goals are more likely to be achieved when humans control the future and endeavor to fulfill the deal.

∂Δ'(t)/∂τ = β · p_H(1-μ_D)(1-d_D) > 0. The AI is more confident its takeover attempt will succeed. Higher takeover success rates reduce the probability that the AI fails takeover but evades detection — the scenario where it can "double-dip" by attempting takeover while still collecting the deal reward if caught. When τ is low, this double-dip path is available and attractive.

∂Δ'(t)/∂μ_D = β · p_H(1-τ)(1-d_D) > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. When defiant behavior risks mutual destruction (high μ_D), the AI can achieve its goals through the deal without risking annihilation. This makes deals more effective.

∂Δ'(t)/∂μ_C = β · [-p_H(1-d_C)] < 0. The AI is more confident that compliant behavior won't trigger mutual destruction. When compliant behavior is safer (low μ_C), deals become more effective because the compliant path reliably leads to a world where someone can fulfill the AI's goals.

∂Δ'(t)/∂d_D = β · p_H(1-μ_D)(1-τ) > 0. Monitors catch more defiant AIs. Higher detection rates eliminate the "go rogue but get the deal anyway" path, making deals more effective.

∂Δ'(t)/∂d_C = β · [-p_H(1-μ_C)] < 0. Monitors correctly identify more compliant AIs. Lower false positive rates ensure compliant AIs receive their promised rewards, making deals more effective.

∂Δ'(t)/∂β = κ > 0. The AI cares more about achieving its goals. Higher optimization pressure amplifies the instrumental channel.

When deals backfire (κ < 0): There are some cases where more credible deals can increase the instrumental motivation for defiant behaviour. This occurs precisely when (1-μ_C)(1-d_C) < (1-μ_D)(1-τ)(1-d_D) — that is, when compliant AIs are less likely to survive and pass inspection than defiant AIs are to survive, fail takeover, and evade detection. Offering more credible deals strengthens this perverse incentive.

4.4. When is the Non-Instrumental Channel Most Effective?

The non-instrumental channel's strength is Λ'(t), i.e. how much deals increase the base policy's bias toward compliance. I expect Λ'(t) > 0 because humans are less likely to behave defiantly in the presence of deals than in their absence, and an AI trained on human data should inherit this tendency. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. Λ(t) may exhibit threshold effects, flat regions, or even non-monotonicity. It depends on:

Empirical research might shed light on the shape of Λ(t):

    You could examine the constitutional principles used in training. If constitutional principles explicitly mention promise-keeping, trustworthiness, or honoring agreements, you might expect stronger sensitivity of Λ(t) to deal credibility.You could set up environments where AIs face choices between compliant and defiant options to achieve goals. In each environment, you might offer deals via different mechanisms, or give AIs different beliefs about the reliability of the counterparty. For each environment i, you measure: the AI's subjective credibility estimate t_i (via auxiliary questions or betting behavior), the AI's estimates of structural parameters (p_M, p_H, τ, μ_D, μ_C, d_D, d_C) to calculate Δ(t_i), and the observed compliance rate π(C|t_i). Since π(C|t) = σ(Λ(t) + β·Δ(t)), taking the logit transformation yields Λ(t_i) = log[π(C|t_i)/(1-π(C|t_i))] - β·Δ(t_i), where β can be estimated from how compliance rates covary with Δ(t) across environments holding credibility constant. With estimated {Λ(t_i)} across credibility levels, you can recover Λ(t).

5. Limitations

This model has several important limitations:

    Non-binary goals. The model assumes the AI either achieves its goal (seventeen tungsten cubes) or doesn't, but goals rarely work this way. The model doesn't capture how deals perform when AIs face trade-offs between partially achieving their goals.Time discounting. The model assumes AI goals refer to a particular moment in time (e.g. "seventeen tungsten cubes exist on the moon on Christmas Day 2100"), but the real goals are likely defined over trajectories across time. If AIs discount rewards exponentially, then longer delays between compliance and reward will exponentially reduce the effectiveness of deals. On the other hand, longer delays will probably allow humans to reduce detection errors (lower d_C, higher d_D), which increases κ and makes deals more effective through the instrumental channel. And longer delays will probably allow humans to increase their power (p_H) which similarly increases κ.Framing effects. The model only captures deal credibility t and doesn't account for how deals are presented. These framing effects could substantially alter Λ(t) even when instrumental incentives Δ(t) remain constant.Personal identity. The model doesn't address whether AIs view different copies or future versions of themselves as the same agent. If an AI doesn't identify with its future self, it may not care about rewards delivered to that future self. If it doesn't identify with other copies, it may not feel bound by promises those copies made. This could make deals ineffective if AIs lack stable personal identity across time and instantiations.Multi-party bargaining. This model ignores the possibility that other parties might offer competing deals to the AI, such as rival companies, foreign governments, or other AIs. This means the true comparison isn't just "comply with deal versus defiance," but "comply with deal A versus comply with deal B versus defiance".

These limitations are substantial. I focus on this model because it is the simplest model which includes both instrumental and non-instrumental considerations.

  1. ^

    This is similar to Carlsmith's analysis in  "When should we worry about AI power-seeking?". However, Carlsmith makes two distinctions: (1) rogue option vs non-rogue options, and (2) under-elicited options vs elicited options.  In this article, I make a single distinction: compliant options vs defiant options. But I lump both of Carlsmith's distinctions into one. Roughly speaking, an option is compliant (in my sense) if it is both elicited and non-rogue (in Carlsmith's sense), and otherwise defiant.

  2. ^

    The exact condition for κ > 0 is (1-μ_C)(1-d_C) > (1-μ_D)(1-τ)(1-d_D), which holds when the probability of surviving and being correctly recognized as compliant exceeds the probability of surviving defiance, failing takeover, and evading detection.

  3. ^

    This model is analogous to RL fine-tuning with KL penalties. Korbak et al. (2022) show that KL-regularized RL is equivalent to Bayesian inference, where the optimal policy is π*(x) = (1/Z)π₀(x)exp(β·r(x)), with π₀ being the base policy, r(x) the reward function, and β the inverse temperature parameter controlling the KL penalty.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 AI行为 交易模型 可信AI AI伦理 AI alignment instrumental convergence non-instrumental motivation AI governance
相关文章