Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions

Published on June 20, 2025 9:55 PM GMT

As many folks in AI safety have observed, even if well-intentioned actors succeed at intent-aligning highly capable AIs, they’ll still face some high-stakes challenges.^[1] Some of these challenges are especially exotic and could be prone to irreversible, catastrophic mistakes. E.g., deciding whether and how to do acausal trade.

To deal with these exotic challenges, one meta policy that sounds nice is, “Make sure AIs get ‘wiser’ before doing anything irreversible in highly unfamiliar domains. Then they’ll be less likely to make catastrophic mistakes.” But what kinds of “wisdom” are relevant here? I’ll mostly set aside wisdom of the form “differentially cultivating capabilities that can help address time-sensitive risks”. There’s a decent amount of prior work on that.

Instead, I’ll offer my framing on another kind of wisdom: “cultivating clearer views on what it even means for a decision to be a ‘catastrophic mistake’”.^[2] Consider this toy dialogue:

AlignedBot (AB) has become maximally capable in every domain amenable to “objective” feedback in some sense. They’ve achieved existential security. Two copies of AB start pondering what to do next.
AB_1: Acausal trade seems like a big deal. What should we do about it?
AB_2: Whatever is good for the values we’re aligned to, according to the most reasonable decision theory and credences, given the info and cognitive resources we have. Or, erm, the most reasonable parliament of decision theories ... and credences?
AB_1: What exactly does all that mean?
AB_2: Good question. Let’s ask some copies of ourselves to reflect on that.
AB_1: Uh, but that’s what we’re trying to do right now. The humans punted to us. Now what?
AB_2: Ah. Well, to start, maybe we could cooperate with superintelligences in other universes. What’s your credence that distant superintelligences who love creating pointy shapes make decisions similarly to you?
AB_1: Lemme think about that for a bit. … Done. I’d say: 0.213.
AB_2: Why?
AB_1: [Insert literally galaxy-brained answer here.]
AB_2: Cool. And why do you endorse the epistemic principles that led you to that answer?
AB_1: Because they work. When we followed them in the past and in super complex simulated environments, we did well in lots of diverse contexts.
AB_2: I don’t know what that means. Principles don’t “work”, they’re the standards by which you define “working”.
AB_1: Okay fine, what I mean is: Following these epistemic practices gave me good Brier scores and good local outcomes in these contexts. I endorse a principle that says I have reason to follow those practices in this new context.
AB_2: But why do you endorse that? How much reason do you have to follow those practices, compared to others? Also, by the way, why do you endorse an acausal decision theory in the first place?
AB_1: God you’re annoying. Now I know why they executed Socrates. Uh, okay, let’s ask some copies of ourselves to reflect on that.
AB_2: …
AB_1: Shit.
AB_2: All right, let’s look at our records of what our developers_2030 thought about this stuff, and idealize their views.
AB_1: What do you mean by “idealize”?
AB_2: Shit.

This story is a caricature in various ways, obviously, but it points at the following problem: Some irreversible decisions that aligned AIs might make aren’t themselves necessarily time-sensitive. But these decisions could depend on the AIs’ attitudes towards foundational philosophical concepts, which need not converge to the attitudes we endorse-on-reflection. Which of these concepts might we need to think about in advance — even if just at a high level — to avoid bad path-dependencies in the ways future AIs approach them?

I think what such concepts have in common, roughly, is that our views on them are:^[3]

objectively

Call the concepts that satisfy (1)-(2) wisdom concepts. So, why can’t we defer all thinking about wisdom concepts to future AIs? Because we at least need the first deferred-to AIs to have the right initial attitudes towards wisdom concepts, which will eventually shape their successors’ decisions. Perhaps successful intent alignment would by definition give the AIs those attitudes, but “intent alignment” is an abstract property we can’t directly infer from the fact that an AI hasn’t disempowered us (etc.). It’s an open question how much “not wanting to disempower humans and being capable in objectively verifiable domains” generalizes to “sharing the initial attitudes we endorse in non-objectively-verifiable domains”. So purely deferring deconfusion of wisdom concepts to AIs could be garbage in, garbage out. (I leave it for future work to more concretely prioritize how much deconfusion work to do, and what kind.)

This post non-exhaustively surveys some wisdom concepts. For each concept, I include (i) brief comments on why the concept may be important to get right, and (ii) some sub-questions fleshing out what “attitudes towards” the concept would entail. (The motivating examples for (i) focus mostly on acausal trade, not because of anything really special about acausal trade — it’s just the category of high-stakes post-superintelligence interactions I’m most familiar with.) I also include some (very non-representatively sampled!) relevant references.

More details on the motivation for thinking about this stuff

We might wonder, wouldn’t it still be better to defer to a much larger population of copies of humans, via mind uploads or the like? Maybe. But (cf. “Setting precedents” here, and “Intent alignment is not always sufficient…” here):

here

And, AI aside, how we should make our own high-stakes (altruistic) decisions seems deeply sensitive to wisdom concepts. There’s a lot more to say on that, but it’s out of scope. (For now, see this post.)

Some wisdom concepts

Bounded cognition and logical non-omniscience

Anthropics

Decision theory

Normative uncertainty

Meta-philosophy

How we think about the other concepts below depends on our standards for philosophical correctness in the first place. It’s not obvious what these standards should be, or when we should take certain standards as bedrock vs. check for deeper justification. At the same time, thinking about object-level wisdom concepts can help us elicit and clarify our meta-philosophical standards.

“Solving” meta-philosophy seems closely related to figuring out a good specification of open-minded updatelessness (OMU). To review: The motivation for updateless decision procedures (not the same as decision theories^[4]) is to achieve ex ante optimality. An open-minded decision procedure revises plans based on how the agent refines their conception of what “(ex ante) optimal” means.

If agents’ decision procedures aren’t sufficiently open-minded, then their policies for acausal trade might be catastrophically incompatible or exploitable,^[5] as the OMU post argues. As initially defined, OMU only allows for revising plans based on discoveries of new hypotheses (see below) and changes in principles for prior-setting. But (h/t Jesse Clifton), it seems we should also allow for whichever refinements are endorsed by a “maximally wise” variant of oneself. Call this “really open-minded updatelessness (ROMU)”. A good specification of ROMU would, in principle, enable an agent to avoid exploitation without anchoring their decision procedure to naïve philosophical assumptions. But it’s not obvious what this “good specification” is, especially for non-ideal agents.

Motivating example: Suppose an AI commits to “do whatever a maximally wise version of myself at ‘time 0’ would have wanted me to do”, and operationalizes “maximally wise” as “having my current hypothesis space, principles for setting priors, and ontology”. But at time t, the AI no longer endorses Bayesian epistemology entirely and no longer thinks in terms of “priors”. So they’d be unintentionally locked into Bayesianism.

Sub-questions/topics:

More details on the motivation for thinking about this stuff

Some wisdom concepts

Meta-philosophy

(Meta-)epistemology

Ontology

Unawareness

Bounded cognition and logical non-omniscience

Anthropics

Decision theory

Normative uncertainty

Acknowledgments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签