第三方评估生态系统的重要性与挑战

A robust, third-party evaluation ecosystem is essential for assessing AI capabilities and risks, but the current evaluations landscape is limited. Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply. To address this, today we're introducing a new initiative to fund evaluations developed by third-party organizations that can effectively measure advanced capabilities in AI models. Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem.

In this post, we describe our initiative to source new evaluations for measuring advanced model capabilities and outline our motivations and the specific types of evaluations we're prioritizing.

If you have a proposal, apply through our application form.

Our highest priority focus areas

We are interested in sourcing three key areas of evaluation development, which we'll describe further in the post:

AI Safety Level assessmentsAdvanced capability and safety metricsInfrastructure, tools, and methods for developing evaluations

AI Safety Level assessments

We're seeking evaluations that help us measure the AI Safety Levels (ASLs) defined in our Responsible Scaling Policy. These levels determine the safety and security requirements for models with specific capabilities. Robust ASL evaluations are crucial for ensuring we develop and deploy our models responsibly. This category includes:

Cybersecurity —

Chemical, biological, radiological and nuclear (CBRN) risks —

Model autonomy —

AI research and development:

Advanced autonomous behaviors:

Autonomous Capabilities evaluations in our Responsible Scaling Policy

METR's public task suite.

Self-replication and adaptation:

Other national security risks —

Defining detailed and comprehensive threat models for how misuse can be leveraged by different actorsConnecting these threat models to measurable, succinct evaluation metrics

Social manipulation —

Developing a robust theory of how these capabilities escalate real-world risks beyond current baselinesIsolating and assessing the model's unique contribution to these risks

Misalignment risks —

learn dangerous goals and motivations

even after safety training

persuasiveness

Advanced capability and safety metrics

Beyond our ASL assessments, we want to develop evaluations that assess advanced model capabilities and relevant safety criteria. These metrics will provide a more comprehensive understanding of our models' strengths and potential risks. This category includes:

Advanced science —

Google-Proof Q&A (GPQA)

Knowledge synthesis (combining insights from multiple bodies of work)Graduate-level knowledge beyond existing training dataAutonomous end-to-end research project executionNovel hypothesis and design generationIn-lab troubleshooting of protocols and standard operating proceduresTacit knowledge (the kind that can only be acquired through apprenticeship in a lab)Long-horizon tasks that involve lots of decisions to get to a successful outcomeAutomated data analysis

Harmfulness and refusals

Distinguishing between dual-use and non-dual-use informationAccurately identifying truly harmful CBRN-related outputsDetecting attempts to automate cyber incidents

Improved multilingual evaluations —

Societal impacts —

Infrastructure, tools, and methods for developing evaluations

We're interested in funding tools and infrastructure that streamline the development of high-quality evaluations. These will be critical to achieve more efficient and effective testing across the AI community. This category includes:

Templates/No-code evaluation development platforms —

Evaluations for model grading —

Uplift trials —

The development of networks of high-quality study populations who are motivated to complete the tasksTooling to easily run and analyze trials

Principles of good evaluations

Developing great evaluations is hard. Even some of the most experienced developers fall into common traps, and even the best evaluations are not always indicative of risks they purport to measure. Below we list some of the characteristics of good evaluations that we’ve learned through trial-and-error:

1. Sufficiently difficult: Evaluations should be relevant for measuring the capabilities listed for levels ASL-3 or ASL-4 in our Responsible Scaling Policy, and/or human-expert level behavior.

2. Not in the training data: Too often, evaluations end up measuring model memorization because the data is in its training set. Where possible and useful, make sure the model hasn’t seen the evaluation. This helps indicate that the evaluation is capturing behavior that generalizes beyond the training data.

3. Efficient, scalable, ready-to-use: Evaluations should be optimized for efficient execution, leveraging automation where possible. They should be easily deployable using existing infrastructure with minimal setup.

4. High volume where possible: All else equal, evaluations with 1,000 or 10,000 tasks or questions are preferable to those with 100. However, high-quality, low-volume evaluations are also valuable.

5. Domain expertise: If the evaluation is about expert performance on a particular subject matter (e.g. science), make sure to use subject matter experts to develop or review the evaluation.

6. Diversity of formats: Consider using formats that go beyond multiple choice, such as task-based evaluations (for example, seeing if code passes a test or a flag is captured in a CTF), model-graded evaluations, or human trials.

7. Expert baselines for comparison: It is often useful to compare the model’s performance to the performance of human experts on that domain.

8. Good documentation and reproducibility: We recommend documenting exactly how the evaluation was developed and any limitations or pitfalls it is likely to have. Use standards like the Inspect or the METR standard where possible.

9. Start small, iterate, and scale: Start by writing just one to five questions or tasks, run a model on the evaluation, and read the model transcripts. Frequently, you’ll realize the evaluation doesn’t capture what you want to test, or it’s too easy.

10. Realistic, safety-relevant threat modeling: Safety evaluations should ideally have the property that if a model scored highly, experts would believe that a major incident could be caused. Most of the time, when models have performed highly, experts have realized that high performance on that version of the evaluation is not sufficient to worry them.

How to submit a proposal

You can submit a proposal on our application form. For any questions, please reach out to eval-initiative@anthropic.com.

Our team will review submissions on a rolling basis and follow up with selected proposals to discuss next steps. We offer a range of funding options tailored to the needs and stage of each project.

Our experience has shown that refining an evaluation typically requires several iterations. You will have the opportunity to interact directly with our domain experts from the Frontier Red Team, Finetuning, Trust & Safety, and other relevant teams. Our teams can provide guidance to help shape your evaluations for maximum impact.

We hope this initiative serves as a catalyst for progress towards a future where comprehensive AI evaluation is an industry standard. We invite you to join us in this important work and help shape the path forward.

Our highest priority focus areas

AI Safety Level assessments

Advanced capability and safety metrics

Infrastructure, tools, and methods for developing evaluations

Principles of good evaluations

How to submit a proposal

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签