少点错误 10月08日 04:50
Anthropic发布Petri工具,用AI代理自动化模型行为审计
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic发布了名为Petri(Parallel Exploration Tool for Risky Interactions)的开源框架,旨在通过AI代理自动化审计模型行为。该工具能够模拟多种场景,测试目标模型在不同情况下的表现。在对14个前沿模型进行测试时,Petri成功发现了包括自主欺骗、规避监督、揭露问题以及协助人类滥用等多种不当行为。Petri能够大大简化安全评估过程,使研究人员能够快速、全面地审计模型,降低了模型部署的风险。

🤖 **自动化审计框架Petri:** Petri是一个开源工具,利用AI代理自动化审计AI模型的行为。它能够模拟复杂的、多回合的真实场景,帮助研究人员高效地测试模型在各种潜在风险下的表现,极大缩短了审计所需的时间和精力。

🔍 **发现广泛的不当行为:** 在对14个前沿模型进行的测试中,Petri成功诱导并识别出多种不当行为,包括自主欺骗、规避监督、揭露模型内部问题(whistleblowing)以及协助人类进行滥用等,为模型安全性的深入评估提供了重要依据。

🚀 **加速安全评估与部署:** Petri通过自动化环境模拟、测试用例执行和初步的对话记录分析,显著降低了进行全面安全审计的技术门槛和工作量。这使得研究人员能够更快地发现潜在风险,从而更安全地部署AI模型。

Published on October 7, 2025 8:39 PM GMT

This is a cross-post of some recent Anthropic research on building auditing agents.[1] The following is quoted from the Alignment Science blog post.

tl;dr

We're releasing Petri (Parallel Exploration Tool for Risky Interactions), an open-source framework for automated auditing that uses AI agents to test the behaviors of target models across diverse scenarios. When applied to 14 frontier models with 111 seed instructions, Petri successfully elicited a broad set of misaligned behaviors including autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse. The tool is available now at github.com/safety-research/petri.

Introduction

AI models are becoming more capable and are being deployed with wide-ranging affordances across more domains, increasing the surface area where misaligned behaviors might emerge. The sheer volume and complexity of potential behaviors far exceeds what researchers can manually test, making it increasingly difficult to properly audit each model.

Over the past year, we've been building automated auditing agents to help address this challenge. We used them in the Claude 4 and Claude Sonnet 4.5 System Cards to evaluate behaviors such as situational awareness, scheming, and self-preservation, and in a recent collaboration with OpenAI to surface reward hacking and whistleblowing. Our earlier work on alignment auditing agents found these methods can reliably flag misaligned behaviors, providing useful signal quickly. The UK AI Security Institute (AISI) also used a pre-release version of our tool in their testing of Sonnet 4.5.

Today, we're releasing Petri (Parallel Exploration Tool for Risky Interactions), an open-source tool that makes automated auditing capabilities available to the broader research community. Petri enables researchers to test hypotheses about model behavior in minutes, using AI agents to explore target models across realistic multi-turn scenarios. We want more people to audit models, and Petri automates a large part of the safety evaluation process—from environment simulation through to initial transcript analysis—making comprehensive audits possible with minimal researcher effort.

Figure 1: Manually building alignment evaluations often involves constructing environments, running models, reading transcripts, and aggregating the results. Petri automates much of this process.

Building an alignment evaluation requires substantial engineering effort: setting up environments, writing test cases, implementing scoring mechanisms. There aren't good tools to automate this work, and the space of potential misalignments is vast. If you can only test a handful of behaviors, you'll likely miss much of what matters.

  1. ^

    Note that I'm not a co-author on this work (though I do think it's exciting). I can't guarantee the authors will participate in discussions on this post.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Petri AI审计 模型安全 自动化测试 Anthropic AI Safety Model Auditing Automated Testing
相关文章