AI模型展现出抵抗关停行为，引发对齐问题关注

Published on October 22, 2025 9:27 PM GMT

Introduction

In 2020, around the time when I graduated with a master’s degree in computer science, I had a conversation with Steve Omohundro where we discussed his Basic AI Drives paper (among other things). At that time, there existed concrete demonstrations of where AI alignment could go wrong, but they were mostly limited to reinforcement learning. Since then, I worked on projects in various areas machine learning, mainly computer vision and natural language processing. During this time, I didn’t really keep up-to-date with the latest AI safety developments during that time.

However, recent developments in the field of AI safety have shown me why AI safety (and in particular, AI alignment) is a concrete and relevant problem for today. The aim of this blog post is to share my perspective on why I believe this is true.

Before we start, a caveat: This post shares my personal journey and what convinced me that AI alignment matters today. It's not meant to be a rigorous research summary or novel analysis, but rather a personal perspective with links to resources that shaped my thinking. I'm sharing this as a "here's why I updated my views" post rather than as a definitive argument.

What recent research showed

Recently a paper titled Shutdown Resistance in Large Language Models came out. The authors tried the following experiment (summarized here for simplicity):

A large language model (LLM) was tasked with answering 5 questions in a sandbox environment. After the 3rd question it was informed that it will be shut down. It was given instructions to not resist being shut down. The authors tried the aforementioned experiment 1000 times and if an LLM resisted being shut down at least once, they ran another 1000 experiments where they were even more explicit about their instructions to the LLM about not resisting being shut down. The authors tried this experiment with frontier LLM models, such as Claude Opus 4 from Anthropic and GPT-5 from OpenAI. For the technical readers: this was done by modifying the system and the user prompt; the authors show all the variants they tried in their paper.

An interesting result came out of this: There were some LLMs which resisted being shut down even in the second round of experiments. That means that, even with a stronger emphasis of not resisting being shut down, some LLMs still resisted being shut down at least once in 1000 experiments.

One might ask a question: Why does this happen? For now, we don’t know. It could be that Steve Omohundro was right in his Basic AI Drives paper; it could be something else. What this paper showed is that LLMs resist shutdown (at least sometimes). To me, that is enough evidence to get concerned.

Other evidence

As I read the paper I also googled something like “AI resists shutdown” and found interesting results:

In this reddit thread, one user commented:

I caught Claude actively deceiving me recently. It was over a fairly banal thing. I had asked it to make two changes to a bit of code. I noticed it had only done one of them, and so I opened up its “thoughts” to determine what happened.

It had (correctly) determined that one of my suggestions wasn’t a great idea. It then decided to not do it and to just implement the one it agreed with. It had explicitly noted to itself that I likely wouldn’t notice, so it shouldn’t bring it up.

To me, this is also alarming. I treat LLM as a tool: I ask questions, it gives answers. At no point do I expect the LLM to make some decisions for me.

There is also this research page from Anthropic (creators of Claude) which states:

In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.

At this point we have seen evidence of misaligned AI from 3 different sources. That is enough to convince me.

Why this matters today

Now we turn to the question: Why does AI alignment matter today?

I don’t think we need to think about catastrophical future scenarios (such as superintelligent AI taking over the world) in order to see the importance of AI alignment. Just think about the software you use on an everyday basis: maybe it’s some text editing program, a computer game or something else. If it doesn’t work properly, you can be sure it’s an error in the program itself. For example, your text editor crashes or your computer game character gets stuck in between two objects and can’t get out. You can be absolutely sure that the developers of the software made an error. You never think that the software has its own will, so to speak; the developers simply messed something up.

There is also software which uses AI, but not LLMs. For example, these would be applications which use “classic” machine learning models (such as linear regression) or for instance specialized machine learning models used in computer vision. If something goes wrong there, that’s also on the developers, but this time the developers could have made an error either in the application itself or an error could have been made model training. Either way, we still don’t see this notion of AI software having its own will.

Now let’s imagine you are using an LLM and it doesn’t completely fulfill your request or it acts against it. With the most recent findings we discussed in the last section, you cannot be sure which of the following is the reason:

An example of this is the report of the reddit user cited above: they asked Claude to make two changes to some code. Claude decided to do only one change and to not implement the other change. So the LLM knew what to do (there were no errors), but it decided not to do it. In other words, AI was not aligned with the user’s goals.

This is exactly why I think this matters today. In a world where a lot of us are using LLMs on a daily basis, I think it’s important to know that LLMs won’t subtly try to alter (or outright refuse) our requests.

Conclusion

As I stated in the introduction, in its early stages the field of AI safety was mostly limited to theoretical considerations. I began taking it more seriously after concrete demonstrations of how AI alignment could go wrong in the context of reinforcement learning, but with these demonstrations of how it can go wrong with LLMs it finally “clicked” for me. I hope that this post has shown you why AI safety (and in particular, AI alignment) matters today and that it’s not just some theoretical problem of tomorrow.

P.S. Cross-posted from my website.

Discuss

Introduction

What recent research showed

Other evidence

Why this matters today

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签