One more reason for AI capable of independent moral reasoning: alignment itself and cause prioritisation

Published on August 22, 2025 3:53 PM GMT

Posted also on the EA Forum.

By “independent” I mean an AI to which an external observer may attribute freedom of thought, or something similar to it. You can think of it as an AI that is not too biased by what its designers or programmers think it’s good or right; an AI that could tell us something new, something we don’t know yet about ethics.

I’ve already given various reasons why having such an AI would be valuable. Here I want to focus on a reason I haven’t talked about yet, which is the importance of AI alignment itself — and cause prioritisation more generally.

It is a not-too-informed opinion of mine that we are still rather ignorant about the importance of AI alignment and how many resources to allocate to it. Some people are very sceptical that alignment is an urgent or important problem; some are very pessimistic and think that a catastrophe is almost inevitable unless AI progress slows down significantly or stops.

Depending on whether you ask AI experts or superforecasters, or even which group of superforecasters you ask, you get different empirical predictions. To me, it is not even clear that the question of AI catastrophic risk is the kind of question to which the methods of forecasting can give a good answer. What if it is, on a fundamental level, a question about the default evolution of any intelligent civilization that reaches a specific technological stage? Then, bold predictions with extreme probabilities such as 0.001% or 99.999% might start to look sensible, even over very long timelines (let’s say year 3025, just to give a number). What if it is primarily a question of science and philosophy? Then we might not want to use probabilistic estimates, but would rather design experiments that would get us at the heart of the question and finally settle it.

If we consider that we live in a world with finite resources and many other problems, the picture gets even more complicated. Should we prioritise working on AI alignment over making progress on, let’s say, medical research? What about other sources of risk, or other ways and opportunities to do good?

An AI capable of independent moral reasoning would be a key step towards a system that can help us answer these questions. I know it sounds like the stereotypical sentence arguing for more research on a topic, but I think it’s true, and here’s why.

To get a system that is better than humans at cause prioritisation, it’s enough to pair an AI that is very good at ethics, something like a superhuman philosopher, with an AI that is very good at planning and instrumental reasoning.

What does a superhuman philosopher look like? It’s something that makes claims and gives arguments for them; and when human philosophers read those arguments, their reactions are something like: “Hmm I was initially sceptical of this claim, but I’ve checked the argument for it and it is very solid; there is also some historical and scientific evidence supporting it; I’ve updated my view on this topic.”

And to get an AI that can tell us something new and informative about ethics, something we didn’t know before, we need the moral reasoning of that AI to be at least somewhat open-ended and unconstrained. This is also what I mean by independent (maybe open-ended is a better term in this context).

It should go without saying that a system better than humans at cause prioritisation would be extremely valuable to humans, and not only to humans.

But maybe you disagree with me about our degree of ignorance about cause prioritisation and the importance of the alignment problem and other sources of risk. Maybe you think that it’s all been figured out already; or maybe you think that, for example, it’s enough for AI catastrophic risk to have a minuscule probability to make the alignment problem the most important problem we should solve.

Still, even in that case, wouldn’t it be nice if an AI capable of independent reasoning, an AI that by design had no reason to agree with you specifically, said something like: “Well, I’ve thought about these questions of risk and cause prioritisation. My predictions and suggested priorities are the same as yours, I think you are right.” Especially when this is not just about us!

I’ll end the post by addressing an objection. Doesn’t successfully creating an AI good at cause prioritisation require solving the alignment problem first?

I don’t think so. Let’s consider this made-up language model as an example:

60% of the time it does what it’s supposed to do: independent moral reasoning10% of the time it blackmails you10% of the time it leaks random personal data10% of the time it hallucinates10% of the time it seems it is carrying out independent moral reasoning but it is actually pursuing some secret agenda that emerged as a result of the language model’s training (whatever that means, and assuming this is possible for language models)

This language model has some clear problems of alignment. Still, it can be used by a group of philosophers who are interested in cause prioritisation, and thus it can provide value in that way. Moreover, the problems this language model has are not specific to independent moral reasoning, so if you object that this language model is too unaligned to be used safely, then the objection becomes a generic objection against language models that are similarly crappy, not a specific objection against AI that can carry out independent moral reasoning. In other words, if creating language models that we think are safe enough to use does not require solving the alignment problem, then we should also be able to create LLM-based AI that can carry out independent moral reasoning and is safe enough to use, without having to solve the alignment problem first.

You can support my research through Patreon here.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签