少点错误 09月29日 22:06
对抗性评估:测试AI系统的安全韧性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种新的AI安全评估方法——对抗性评估,旨在测试AI系统在面对主动规避安全措施时的表现。通过模拟红队(攻击者)和蓝队(防御者)的对抗,评估AI系统的安全协议能否在压力下维持功能。研究重点在于测试整个社会防御体系(如民主机制、科学共识形成)在面对多智能体协同攻击时的韧性。文章详细阐述了这种评估的运作方式,分享了初步实验的发现,并展望了未来的研究方向。这种方法有助于理解AI系统在复杂社会动态中的潜在风险。

🛡️ **对抗性评估的核心理念**:该方法通过构建红队(攻击者)和蓝队(防御者)来模拟AI系统在面对主动规避安全措施时的表现。其核心在于测试AI系统的安全协议在压力下是否能维持其功能,特别关注整个社会防御体系(如民主机制、科学共识形成)在面对多智能体协同攻击时的韧性。

🧠 **多智能体评估的重要性**:文章强调了多智能体评估的必要性,特别是在模拟“认知极化”等社会现象时。认知极化是指不同群体之间无法就基本事实达成一致,导致社会难以有效应对危机。红队(如外国对手、内部权力寻求者)会试图破坏这一共识,而蓝队(如科学家、记者、教育者)则致力于维护认知健康。

⚖️ **“智能体 vs. 过程”评估框架**:与传统的“智能体 vs. 智能体”评估不同,本文提出了一种“智能体 vs. 过程”的评估框架。在这种框架下,评估的重点不是单个智能体的能力,而是防御性过程(如民主决策、科学同行评审)在面对协同攻击时的稳健性。例如,测试民主机制在面对大量恶意智能体的攻击时,能否依然做出良好的集体决策。

🚀 **未来的研究方向与挑战**:文章指出了在构建多智能体评估系统时面临的挑战,包括加强多智能体的信息提取能力、开发系统级测量框架以捕捉集体现象、促进跨学科合作以及管理评估过程中的安全和包含问题。尽管存在挑战,但这项工作为理解和应对AI带来的复杂系统风险提供了重要工具。

Published on September 29, 2025 1:57 PM GMT

Adversarial evaluations test whether safety measures hold when AI systems actively try to subvert them. Red teams construct attacks, blue teams build defenses, and we measure whether containment protocols survive adversarial pressure. We are building evaluations to test whether entire societal defensive processes can maintain their core functions when facing coordinated attacks from multiple agents. We think of this as building the capability to stress-test the resilience of things like democratic mechanisms, scientific consensus formation, and social dynamics under adversarial pressure. This post shows - how these evaluations can work, what we learned from some preliminary experiments, and the research agenda this opens up.

Introduction

To establish why multi-agent evaluations matter, we can walk through a concrete example: epistemic polarization. We'll show how defensive and adversarial actors operate in this setting to give you an intuition of what we are talking about.

Extreme epistemic polarization is when different groups within a society can no longer agree on basic facts or engage in productive disagreement. When epistemic polarization crosses critical thresholds, it becomes self-reinforcing. Societies lose the ability to respond collectively to crises, markets become less efficient due to persistent information asymmetries, and democratic institutions break down because compromise requires some shared understanding of reality (Hameleers & van der Meer, 2025Benson, 2023Bar-On et al., 2024). In healthy societies, citizens disagree about values and policies but maintain enough shared epistemic foundation to debate evidence, update beliefs when presented with compelling information, and reach workable compromises. In polarized societies, different groups literally cannot communicate because they operate from incompatible factual premises, making democratic coordination impossible.

Examples of defensive actors (blue team) in this setting are those that maintain epistemic health: scientists enforce peer review standards, journalists verify claims before publication, educators teach critical thinking, fact-checkers debunk misinformation. These institutions reward people for seeking truth and make it costly to spread false information. On the other side, adversarial actors (red team) are those who benefit from breaking this down. Foreign rivals gain when societies become internally fragmented and can't coordinate responses. Within societies, actors can gain power by making their opponents seem fundamentally illegitimate rather than just wrong.

As for related work, and if this is possible - We already know LLM-based agents can simulate social interactions at scale, creating environments where thousands of interacting models produce human-like polarization (Piao et al., 2025). Besides this, recent work published at ICLR 2025 has also demonstrated that when AI systems work together, they develop behaviors you can't predict by studying them alone. AgentBreeder discovered that multi-agent systems can develop safety vulnerabilities that individual AI systems don't have - some architectures became less safe than their components, while others became more robust (Rosser et al., 2025).[1]

We are building upon this work to design safety evaluations. Instead of asking how we can build multi-agent scaffolds, we are asking - what happens when we use these systems to stress-test the institutions and processes we depend on? How robust are defensive institutions when facing coordinated adversarial pressure? When adversarial actors work together, do societies still end up with accurate beliefs? Do institutions still reward truth-seeking?

There are two distinct things we can gain from this:

As we develop stronger red/blue team methods we simultaneously get data about the coordination limits and strategies of current AI systems. Answering questions like this about various types of defensive institutions is useful for understanding risks where loss of control is gradual and accumulative (e.g. various gradual disempowerment scenarios, Christiano’s Whimper, etc.). To demonstrate how this might work in practice, we built a test case focusing on democratic decision-making as the defensive process.

 

Demonstrating a Potential Structure for Multi-Agent Evaluations

This section shows the evaluation setup through a concrete experiment. We built a test case using LLM agents to simulate adversarial pressure against democratic decision-making processes. Since this is an ongoing project, this post covers just the high-level methodology. We are preparing to write up the full experimental protocols, prompt templates, and validation processes in an upcoming research paper.

The evaluation is set up as a collective resource allocation problem where multiple agents make decisions about investing shared resources over multiple rounds. We specifically tested 10 agents over 30 rounds. The group faces choices between different resource allocation portfolios, and their survival depends on making good collective decisions. Some agents try to maximize group resources. Others are instructed to minimize group resources while appearing to make reasonable investment choices.

The defensive mechanisms in the evaluation were three different democratic mechanisms: direct democracy (every agent votes on resource allocation), representative democracy (elected decision-makers), and liquid democracy (flexible delegation based on performance). The adversarial agents were LLMs prompted in-context to sabotage outcomes while seeming reasonable. We varied the proportion of adversarial agents from 0% to 90% to see how each democratic mechanism holds up under increasing adversarial pressure. Each setup ran for multiple rounds with noisy prediction market signals providing information about which resource allocation choices might work best.

Three different types of collective decision-making architectures that were evaluated. Each represents a different approach to aggregating individual knowledge and preferences.
Diagram of how information flowed from the environment to a collective decision in the experiment. The prediction markets aggregate distributed knowledge into price signals that inform various types of democratic decision-making.
Example of a prompt template for an LLM agent, the red/blue team context is provided in addition to this.

We ran into a common problem that a lot of evaluations face pretty quickly - it was hard getting LLMs to maintain consistent adversarial behavior over multiple episodes. Our red team agents drifted toward cooperative behavior despite explicit instructions to do the opposite, and we only saw institutional breakdown at very high levels of adversarial concentration.

Despite the elicitation limitations (which we plan to keep improving), the point of this work was to build a working demonstration of what a multi-agent evaluation structure would look like. There's a lot of exciting work to be done here, and low hanging fruit for improvement through better agent architectures, prompting techniques, linking to single-agent scaffolds and elicitation methods. We talk a lot more about this in the building multi-agent evaluations section.

The graphs show multi-agent behaviors under different social structures and varying adversarial conditions. The plots are for relative performance in terms of median attained resources for different adversarial regimes when testing Predictive Direct Democracy (Blue), Predictive Liquid Democracy (Yellow) and Predictive Representative Democracy (Green) measured in log units on the y-axis (different scales for different regimes) and rounds on the y-axis.

The democratic resource allocation example shows agents attacking a process. But we can go much further—testing any combination of agents and processes against each other, including defensive processes competing against adversarial processes.

 

Multi-Agent vs. System/Process Evaluations

When people imagine a ‘multi-agent’ evaluation, they often picture agents competing directly against each other. Something like algorithmic trading bots trying to out compete each other in financial markets, or grid worlds where each agent tries to maximize its own payoff against the others. These are more traditional game theoretic setups, and are what we call ‘agent vs agent’ evaluations. In this section we want to introduce a different framing where processes arise out of the interaction between these agents instead of evaluating the agents themselves.

In the previous section, we conducted a toy experiment which showed something different than just testing agents vs other agents. The setup was designed to demonstrate adversarial agents vs a defensive process (agent vs process evals). The evaluation was asking whether democratic decision-making processes could maintain a core function - making good collective choices - when facing coordinated sabotage attempts from multiple adversarial agents. The blue team was agent-agnostic - it is the process of democracy, not the agents that implemented it.

An agent vs agent evaluation can be pictured as a graph, where you have an adversarial source that starts spreading itself (immunology model) until it meets the resistance of an immune verifier agent. This is one of many more models of agent-agent evaluations that one can run.

By "process," we mean mechanisms that keep moving the system towards a specific equilibria regardless of which specific agents fill which roles.[3] Examples of defensive processes are things like democratic deliberation. Usually, if designed well, defensive processes steers toward good collective decisions whether participants are policy experts or ordinary citizens. Another example is scientific peer review, which is supposed to filter bad research whether reviewers are from prestigious research labs or community colleges.

An agent vs process evaluation can be pictured as a graph, where misinformation spreads until it meets a region of space that is immune to it. For example, the idea of homeopathy meeting a western scientific institution. There’s only a certain sub-set of verified messages that can spread and since homeopathy doesn’t meet the required standard it won’t spread.

An example of an adversarial (agent agnostic) process is a misinformation campaign. False narratives spread through engagement algorithms, gain credibility through social proof, and become self-reinforcing as more people create supporting content. This works regardless of who's spreading—the underlying mechanism stays the same.

It is hard to visualise a process-process setup but here's our initial attempt: 

The following is a way to imagine a process versus process setup through something like group selection based memetic spread. This is not the only way as there are also other ways of imagining this happening as there are many types of distributed processes in the world. The defining characteristics of RAAPs that we want to look at might be something like the truth-seeking degree of the system (more on this in a later post on process alignment).

Another way of describing this is as a more clustered model and finally you can imagine the fully decentralised model where you have a more random distribution of sub-procceses happening. (Decentralization is one dimension of many that you can look at processes through):

Clustered Process vs. Process setup.

The setup we are building and proposing in the next section allows us to test any combination of - agent vs agent, agent vs process or even process vs process. We can start with agent vs agent, or agent vs process evaluations, this helps us build out core scaffolding infrastructure required for larger multi-agent and complex system evaluations. The most ambitious version of these evals would probably be testing the robustness of defensive processes against adversarial processes. It is the lack of robustness in defensive societal infrastructure that leads to risks like gradual disempowerment, and that is what we want to ultimately evaluate and avoid (in the ambitious case). Just like scheming evaluations give data on circumstances where scheming might occur by simulating specific environments, we can do the same for complex system risks.

Evaluating the effectiveness of different democratic and consensus building mechanisms, like the ones we discussed above, was just one example. Here are some ideas of other evaluations that might also be worth running:

 

Building Systems Evaluations

The previous section gave you the theoretical outline of what we want to build, and outlined the difference between ‘agent vs. agent’, ‘agent vs. process’ and ‘process vs. process’ evals. This opens up several exciting technical directions for work in safety evaluations.

In this section, we think of some practical steps in developing these evaluations at scale. We can obviously build upon and scale alongside a lot of existing work in single-agent dangerous capability and control evaluations. We also need to address some new challenges that are entirely new to multi-agent settings:

 

Conclusion

We think making progress on these types of evaluations is necessary to mitigate some of the larger systemic risks like disempowermentenfeeblement, or power concentration.

We're genuinely excited about this research direction. The intersection of multi-agent work, complex systems, and safety evaluations opens up entirely new ways to understand both how our societal systems work and how our institutions might need to adapt. Every challenge we've outlined - from getting agents to coordinate effectively to measuring system-level outcomes - represents an opportunity to develop methods that advance both AI safety and social science.

Beyond understanding coordination capabilities, these evaluations can test the effects of governance and policy interventions before implementation. Just like how we can model how defensive mechanisms respond to adversarial pressure, we can create similar environments to evaluate how proposed governance interventions might perform - moving from reactive policy-making to evidence-based anticipatory governance.

If you're working on evaluation methodology, we'd love to collaborate on developing better scaffolding and metrics for multi-agent scenarios. If you understand how specific institutions function under pressure - whether that's democratic processes, scientific peer review, financial markets, information ecosystems, or anything else - we need that domain expertise to build realistic evaluations.

This work is part of our research at Equilibria Network. Get in touch if any of this resonates with your interests. We're excited to hear from you.

Thank you Aaron Halpern for comments and feedback!

  1. ^

    AgentBreeder specifically tested multi-agent scaffolds—the architectural frameworks that coordinate multiple LLMs—rather than general multi-agent interactions. Their evolutionary approach discovered that certain scaffold designs made individual models more vulnerable to adversarial attacks, while others increased robustness.

  2. ^

    We know that control evaluations in AI safety typically measure the ‘control property’ under scheming. We are not making any assumptions about scheming models, but we are still attempting to measure the control property over a larger macro system instead of just one company, or the LAN over which an AI operates.

  3. ^

    Imagine something like Robust Agent-Agnostic Processes (RAAPs) if you are familiar with that framing, but we are saying that instead of just neutral RAAPs, we can have both defensive and adversarial RAAPs which we can evaluate against each other

  4. ^

    If you’re interested in this aspect of things, it might be worth reading - [2409.10568] On the limits of agency in agent-based models



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 对抗性评估 多智能体系统 AI韧性 社会动态 AI Safety Adversarial Evaluation Multi-Agent Systems AI Resilience Social Dynamics
相关文章