少点错误 06月21日
Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了在人工智能安全领域,即使AI在对齐人类意图方面取得成功,仍将面临的挑战,尤其是那些可能导致不可逆转灾难性错误的挑战。文章提出了“智慧”的概念,即培养对“灾难性错误”的清晰理解,并强调了在关键哲学概念上达成共识的重要性。通过分析AI之间的对话案例,文章指出了在处理如非因果交易等高风险决策时,对智慧概念的深入思考,以及预先考虑这些概念的重要性。文章也提到了元哲学、认识论、本体论等多个“智慧概念”,为未来AI的发展提供了思考方向。

🤔 面对高风险挑战,即使AI与人类意图对齐,仍可能因对“灾难性错误”的理解偏差而犯错,因此需要培养AI的“智慧”。

💡 “智慧”在于清晰理解“灾难性错误”的含义,这需要深入思考对未来AI决策至关重要的哲学概念,如元哲学、认识论和本体论等。

🤔 作者认为,在人工智能领域,仅依靠对齐人类意图是不够的,因为AI对这些哲学概念的初始态度会影响其后续决策,需要对这些概念进行预先思考。

🧐 文章强调了“智慧”概念在处理非因果交易等高风险决策中的重要性,并提出了在元哲学层面建立开放、灵活的决策框架的必要性。

🤔 文章探讨了开放式更新决策程序(OMU)和真正开放的更新决策程序(ROMU),以避免AI在决策过程中陷入僵化,确保其决策具有前瞻性和适应性。

Published on June 20, 2025 9:55 PM GMT

As many folks in AI safety have observed, even if well-intentioned actors succeed at intent-aligning highly capable AIs, they’ll still face some high-stakes challenges.[1] Some of these challenges are especially exotic and could be prone to irreversible, catastrophic mistakes. E.g., deciding whether and how to do acausal trade.

To deal with these exotic challenges, one meta policy that sounds nice is, “Make sure AIs get ‘wiser’ before doing anything irreversible in highly unfamiliar domains. Then they’ll be less likely to make catastrophic mistakes.” But what kinds of “wisdom” are relevant here? I’ll mostly set aside wisdom of the form “differentially cultivating capabilities that can help address time-sensitive risks”. There’s a decent amount of prior work on that.

Instead, I’ll offer my framing on another kind of wisdom: “cultivating clearer views on what it even means for a decision to be a ‘catastrophic mistake’”.[2] Consider this toy dialogue:

AlignedBot (AB) has become maximally capable in every domain amenable to “objective” feedback in some sense. They’ve achieved existential security. Two copies of AB start pondering what to do next.

AB_1: Acausal trade seems like a big deal. What should we do about it?

AB_2: Whatever is good for the values we’re aligned to, according to the most reasonable decision theory and credences, given the info and cognitive resources we have. Or, erm, the most reasonable parliament of decision theories ... and credences?

AB_1: What exactly does all that mean?

AB_2: Good question. Let’s ask some copies of ourselves to reflect on that.

AB_1: Uh, but that’s what we’re trying to do right now. The humans punted to us. Now what?

AB_2: Ah. Well, to start, maybe we could cooperate with superintelligences in other universes. What’s your credence that distant superintelligences who love creating pointy shapes make decisions similarly to you?

AB_1: Lemme think about that for a bit. … Done. I’d say: 0.213.

AB_2: Why?

AB_1: [Insert literally galaxy-brained answer here.]

AB_2: Cool. And why do you endorse the epistemic principles that led you to that answer?

AB_1: Because they work. When we followed them in the past and in super complex simulated environments, we did well in lots of diverse contexts.

AB_2: I don’t know what that means. Principles don’t “work”, they’re the standards by which you define “working”.

AB_1: Okay fine, what I mean is: Following these epistemic practices gave me good Brier scores and good local outcomes in these contexts. I endorse a principle that says I have reason to follow those practices in this new context.

AB_2: But why do you endorse that? How much reason do you have to follow those practices, compared to others? Also, by the way, why do you endorse an acausal decision theory in the first place?

AB_1: God you’re annoying. Now I know why they executed Socrates. Uh, okay, let’s ask some copies of ourselves to reflect on that.

AB_2:

AB_1: Shit.

AB_2: All right, let’s look at our records of what our developers_2030 thought about this stuff, and idealize their views.

AB_1: What do you mean by “idealize”?

AB_2: Shit.

This story is a caricature in various ways, obviously, but it points at the following problem: Some irreversible decisions that aligned AIs might make aren’t themselves necessarily time-sensitive. But these decisions could depend on the AIs’ attitudes towards foundational philosophical concepts, which need not converge to the attitudes we endorse-on-reflection. Which of these concepts might we need to think about in advance — even if just at a high level — to avoid bad path-dependencies in the ways future AIs approach them?

I think what such concepts have in common, roughly, is that our views on them are:[3]

    relevant to how we define what it means for a decision to be a “catastrophic mistake”; yetnot (clearly) objectively verifiable as more or less “correct”.

Call the concepts that satisfy (1)-(2) wisdom concepts. So, why can’t we defer all thinking about wisdom concepts to future AIs? Because we at least need the first deferred-to AIs to have the right initial attitudes towards wisdom concepts, which will eventually shape their successors’ decisions. Perhaps successful intent alignment would by definition give the AIs those attitudes, but “intent alignment” is an abstract property we can’t directly infer from the fact that an AI hasn’t disempowered us (etc.). It’s an open question how much “not wanting to disempower humans and being capable in objectively verifiable domains” generalizes to “sharing the initial attitudes we endorse in non-objectively-verifiable domains”. So purely deferring deconfusion of wisdom concepts to AIs could be garbage in, garbage out. (I leave it for future work to more concretely prioritize how much deconfusion work to do, and what kind.)

This post non-exhaustively surveys some wisdom concepts. For each concept, I include (i) brief comments on why the concept may be important to get right, and (ii) some sub-questions fleshing out what “attitudes towards” the concept would entail. (The motivating examples for (i) focus mostly on acausal trade, not because of anything really special about acausal trade — it’s just the category of high-stakes post-superintelligence interactions I’m most familiar with.) I also include some (very non-representatively sampled!) relevant references.

More details on the motivation for thinking about this stuff

We might wonder, wouldn’t it still be better to defer to a much larger population of copies of humans, via mind uploads or the like? Maybe. But (cf. “Setting precedents” here, and “Intent alignment is not always sufficient…” here):

And, AI aside, how we should make our own high-stakes (altruistic) decisions seems deeply sensitive to wisdom concepts. There’s a lot more to say on that, but it’s out of scope. (For now, see this post.)

Some wisdom concepts

Meta-philosophy

How we think about the other concepts below depends on our standards for philosophical correctness in the first place. It’s not obvious what these standards should be, or when we should take certain standards as bedrock vs. check for deeper justification. At the same time, thinking about object-level wisdom concepts can help us elicit and clarify our meta-philosophical standards.

“Solving” meta-philosophy seems closely related to figuring out a good specification of open-minded updatelessness (OMU). To review: The motivation for updateless decision procedures (not the same as decision theories[4]) is to achieve ex ante optimality. An open-minded decision procedure revises plans based on how the agent refines their conception of what “(ex ante) optimal” means.

If agents’ decision procedures aren’t sufficiently open-minded, then their policies for acausal trade might be catastrophically incompatible or exploitable,[5] as the OMU post argues. As initially defined, OMU only allows for revising plans based on discoveries of new hypotheses (see below) and changes in principles for prior-setting. But (h/t Jesse Clifton), it seems we should also allow for whichever refinements are endorsed by a “maximally wise” variant of oneself. Call this “really open-minded updatelessness (ROMU)”. A good specification of ROMU would, in principle, enable an agent to avoid exploitation without anchoring their decision procedure to naïve philosophical assumptions. But it’s not obvious what this “good specification” is, especially for non-ideal agents.

Motivating example: Suppose an AI commits to “do whatever a maximally wise version of myself at ‘time 0’ would have wanted me to do”, and operationalizes “maximally wise” as “having my current hypothesis space, principles for setting priors, and ontology”. But at time t, the AI no longer endorses Bayesian epistemology entirely and no longer thinks in terms of “priors”. So they’d be unintentionally locked into Bayesianism.

Sub-questions/topics:

(Meta-)epistemology

If we want to avoid catastrophic decisions, we need to understand what it means for decisions to be “catastrophic” with respect to rational beliefs, given our epistemic situation (crudely, “ex ante”). To answer that, we need to clarify what makes beliefs “rational”. (I discuss two particularly relevant aspects of epistemology below, here and here. See this post for more on why we can’t objectively verify which beliefs are rational.)

Motivating example: An agent’s choice of policy for acausal trade depends on their (prior) beliefs about other agents’ policies. And these beliefs will be underdetermined by the agent’s empirical evidence, because of their strategic incentives: They might avoid letting their decisions depend on certain evidence about their acausal trading partners — doing so could affect the trade outcome itself (see here and here). (Note that this is a reason why it might be inadvisable for us to gain more information about the details of acausal trade.)

Sub-questions/topics:

Ontology

(Caveat that I know very little about this topic. I’m gesturing at some apparent differences between clusters of philosophical views, but not confident I’ve gotten the nuances right.)

It’s an open question what kind of stuff is basic or “real”: Is it the fundamental physical entities (and what are those?)? Mathematical structures (see, e.g., “ontic structural realism”)? Something else? Empirical progress on physics might narrow down the space of plausible answers here. For example, it looks hard to avoid eternalism about time post-Einstein. And deliberately reasoning about the implications of fundamental physics could guide our reflection on ontology. But the answer doesn’t seem entirely empirically verifiable. 

Motivating example: UDASSA seems to be motivated by an ontology on which mathematical (computational) structures are fundamental. So an agent’s attitude toward this ontology affects any of their decisions that depend on anthropics or infinite normativity.

Sub-questions/topics:

Unawareness

Real agents (plausibly even superintelligences) are not aware of all the large-scale possible consequences of their decisions, at least not in much action-relevant detail. So when an agent judges that some decision is good according to some model of its consequences, this model might be seriously misspecified. Perhaps the agent could augment their naïve model with more sophisticated adjustments for unawareness — but (arguably) they need to explicitly reason about these adjustments, not just rely on classical Bayesianism alone. (See here for much more on the potential significance of unawareness.)

(R)OMU can help decouple an agent’s commitments from their current hypothesis space. But first, maybe the robustness of their specification of (R)OMU is itself sensitive to contingencies they’re unaware of. Second, an agent trying to strategically avoid harmful information, or dealing with competitive pressures, might lock in some decisions before figuring out (R)OMU.

Motivating examples: See the “crazy predictor” example here. Some (but not all) of the items listed here are examples of matters an AGI or ASI could plausibly be unaware of.

Sub-questions/topics:

Bounded cognition and logical non-omniscience

It’s well-known that some epistemic and decision-theoretic puzzles arise when we acknowledge that real agents aren’t ideal Bayesians. For one, when agents with (physical) constraints on their cognition decide what to do, they need to account for the computational costs of “deciding what to do”. And even if an agent could instantaneously compute all the logical implications of some information they have, they might not want to, as discussed above.

We might think that the question of how to account for one’s deviations from ideal agency is merely empirical, like “if trying to act like a pure consequentialist tends to produce bad consequences, which deontological norms would help produce better consequences?”. But in some cases, the whole problem is that the agent’s deviations from the ideal make it unclear what standard of “should” is appropriate. For example, assume (perhaps doubtfully) that if an agent is capable of Solomonoff induction, their credence in every hypothesis must be precise. Why would this imply that an agent who isn’t capable of Solomonoff induction must have precise credences?

Motivating example: One hope about acausal trade is that mature superintelligences will all converge to the same decision theory, and then, due to their correlated decision-making, they’ll coordinate on cooperative bargaining policies. But even assuming all non-ideal agents aim to follow the same decision theory, they need not share the same decision procedure (see fn. 4). How should non-ideal agents reason about correlated decision-making?

Sub-questions/topics:

Anthropics

How an agent reasons about the information “I am observing [XYZ]” largely affects their inferences about the state of the world. As Bostrom argues, opting out of anthropic updating entirely seems incompatible with commonplace scientific inference in large worlds (at least, assuming Bayesianism?). But current anthropic theories that recommend updating seem to have their own weird implications, and setting that aside, it’s not clear which theory has the strongest non-ad-hoc justification.

Motivating example: When an agent asks which value systems ECL recommends they cooperate with, the answer depends heavily on their anthropics.[11] (As does an agent’s estimate of the measure of their counterpart in prediction-based acausal trade.)

Sub-questions/topics:

Decision theory

It’s well-known in this community that an agent’s attitudes toward acausal trade depend on whether they endorse causal vs. evidential vs. updateless (etc.) decision theory. But these high-level theories still underdetermine various key questions about what it means for a decision to be rational (given one’s values, beliefs, and cognitive constraints). And the extent to which we endorse CDT vs. EDT vs. … might depend on deeper principles we haven’t yet discovered.[12] (See this post on why we can’t objectively verify which decision theory is “best”.)

Motivating example: Adherents of EDT might disagree on what it means for a real agent to “decide to do ECL” — does it mean something like “commit to cooperate with agents who make this commitment”, or “take an object-level cooperative action in a given decision situation”? This has implications for how strongly action-relevant ECL is, as discussed in the linked post.

Sub-questions/topics:

Normative uncertainty

As is well-known, agents’ uncertainty about their first-order normative standards (values, decision theory, and epistemology) could have significant implications for decision-making. Existing proposals for managing normative uncertainty all seem quite unsatisfactory, and proposals for managing meta-epistemic uncertainty in particular seem very neglected.

Motivating example: An agent might find that (i) according to the beliefs they’d have given objective Bayesianism, ECL recommends one action, but (ii) according to the beliefs they’d have given subjective Bayesianism, ECL recommends some very different action. Presumably, it would be question-begging to use either objective or subjective Bayesianism to assign higher-order credences over these two views. So what should the agent do?

Sub-questions/topics:

Acknowledgments

Thanks to Nicolas Macé and Jesse Clifton for helpful comments.

  1. ^

     Non-exhaustive list of examples of relevant work:

  2. ^

     This has also been considered in prior work (see footnote 1, especially Wei Dai’s writings), but not in the following terms, to my knowledge.

  3. ^

     I would personally add a third condition, that our views on these concepts aren’t reducible to “preferences” in the usual sense. But this might be controversial and isn’t central to my point.

  4. ^

     Updateless decision theory (UDT) says that what it means to make a good decision is to act according to the ex ante optimal policy, from some perspective. You might want to follow an updateless decision procedure even if you disagree with UDT as a normative standard. (Cf. here and here.)

  5. ^

     Toy example of how closed-mindedness could increase exploitability: Alice doesn’t yet consider the possibility of Yudkowsky’s proposed policy for the ultimatum game, but she’s worried about excessive conflict with Bob. So she closed-mindedly commits to “accept any offer greater than 20%”. But had she been aware of Yudkowsky’s policy, she would have preferred to use it (given her beliefs), and increased her bargaining power.

  6. ^

     (H/t Jesse Clifton:) It seems to me that rejecting arbitrariness is a strong candidate for such a constraint. See also Hájek on respecting symmetries as a crucial meta-epistemic principle.

  7. ^

     This isn’t the same question as “when should we take certain standards as bedrock vs. check for deeper justification?”, to be clear. A bullet-biter still needs to answer the latter — if you want to justify all your philosophical views with fundamental principles, you still need to figure out which principles are truly “fundamental” vs. reducible. (E.g., when thinking about Occam’s razor, do we take “(ontological) simplicity” to be some irreducible concept, or try to ground it out somehow?) I’m deliberately not using the label “foundationalism” here, because foundationalism is subtle and somewhat easy to mistakenly round off to something it’s not (see Huemer’s Understanding Knowledge for a great primer).

  8. ^

     See Sec. 2.5.1, “Radical subjectivism”, in particular.

  9. ^

     Technically, this is about SPIs via renegotiation, as constructed in this paper.

  10. ^

     An agent with a confused conception of these counterfactuals might fail to incentivize others to participate in SPI, or unnecessarily refuse to participate in SPI with others. Suppose Alice can’t determine by inspection whether Bob’s decisions are governed by a program with the structure of (say) a “renegotiation program” discussed here. Then she needs to interpret whatever information she has about Bob, i.e., in order to say how plausible it is that Bob would do X given inputs Y. And her beliefs about how to do this interpretation might differ from Bob’s. It’s not obvious to me how this is resolved.

  11. ^

     Credit to Emery Cooper for this observation. See this doc for my understanding of Emery’s argument (she does not necessarily endorse this summary).

  12. ^

     Examples of principles along these lines we’ve already discovered: irrelevance of impossible outcomes; dynamic consistency; “guaranteed payoffs”.

  13. ^

     I think Skyrms’ Dynamics of Rational Deliberation is a canonical reference here, but can’t find the specific section.

  14. ^

     (Loosely related.)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 智慧 哲学概念 决策 元哲学
相关文章