AI安全新方法：通过“规避”和“升级”减少模型风险

Published on October 24, 2025 9:18 AM GMT

This is the first research release from Wiser Human, an organisation we set up from the Catalyze Impact AI Safety Incubator. Our backgrounds are in risk management and software engineering, and we believe we are best placed to contribute by building practical tools for improving layered defenses in AI systems i.e. contributing to a “Swiss cheese model” for reducing harm from the actions of agentic AI systems, where multiple safeguards work together to reduce risk even when each layer is imperfect.

Why we explored steering controls

For this project, we focused on how models might be steered toward safer actions, such as escalating problems, rather than taking useful but harmful paths, such as blackmail.

This idea comes from human insider-risk management, which recognises that even trusted employees can face pressures or conflicts that tempt misuse of access. Instead of assuming perfect behaviour, well-designed systems provide safe, instrumentally useful alternatives, like escalation routes or grievance channels, so that when those moments arise, people have a constructive path to act responsibly.

We wanted to see if a similar principle could work for AI agents when they face scenarios where harmful behaviour may be useful for completing a task.

What we did

We combined:

Situational Crime Prevention

We tested these interventions using Anthropic’s Agentic Misalignment blackmail scenario, because it provided a simple, replicable instance of harmful behaviour across models.
Even though the scenario is simplified, its open-source design made it possible for us to test ten different models to evaluate how well mitigations generalised across model families.

What we found

Our results were encouraging:

Blackmail rates fell from 38.7% to 0.85% for the best-performing mitigation (across all models).Escalation channels and compliance cues successfully steered agents toward safe actions without retraining: key findings generalised across all 10 models we tested.Because these mitigations generalised across models, they could form a low-cost, model-agnostic defence layer that reduces how often harmful actions need to be caught by monitoring.

Our motivation

We see this work as complementary to alignment and AI control (specifically, monitoring for and constraining misaligned model actions), not a replacement for them.
We hope mitigations like the ones we test could:

Help developers deploy economically useful models more safely.Generate new data on how models escalate, self-regulate, or seek oversight.Contribute to a layered defence-in-depth architecture combining training, evaluation, monitoring, and steering controls.

Looking further ahead to when models are capable of completing longer tasks autonomously, we are concerned about how environment shaping patterns may occur. Specifically actions which are consistent with achieving a given goal, but which subtly tip the environment in favour of the AI system over time e.g. by selectively escalating individuals who may pose a future threat to its autonomy, or by increasing its resiliency by having technical recovery plans implemented. We hope to further explore this threat model within realistic settings in the future.

Looking ahead to a time when models can autonomously complete longer and more complex tasks, we’re concerned about how environment-shaping patterns might emerge. These are actions that remain consistent with a model’s immediate goal yet subtly alter its environment in ways that strengthen its own position, for example, by selectively escalating or discrediting individuals who might later threaten its autonomy, or by improving its technical resiliency through pre-emptive recovery planning. We hope to explore this long-horizon, environment-shaping threat model in future work we do.

We’ve released the code to allow others to replicate our experiments or test their own mitigations.

📄 Research page
✍️ Blog summary
📘 Paper (preprint)
💻 Code + dataset

Discuss

Why we explored steering controls

What we did

What we found

Our motivation

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签