AI Snake Oil 09月12日
AI 科学家面临挑战:自动化科学发现的局限性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期,Sakana AI 推出的“AI 科学家”旨在实现全自动科学发现,但实际应用中暴露出诸多不足。该系统缺乏对研究新颖性的检验,可能导致重复劳动;且未进行人工或专家同行评审,其生成内容的质量存疑。这些问题并非 Sakana 独有,多数 AI 代理在评估真实世界影响时都面临挑战。本文介绍了一个衡量 AI 复现计算研究能力的新基准 CORE-Bench,并探讨了其对“通用智能”和 AI 经济影响的启示。自动化复现研究而非期望 AI 完全独立发现科学,是当前更可行的途径。

🤖 **AI 科学发现的现实挑战:** Sakana AI 的“AI 科学家”虽然目标宏大,但在实际应用中存在重大缺陷。它未能有效检查研究的新颖性,可能导致生成的内容重复已有的工作。此外,缺乏人工或专家同行评审机制,使得其研究成果的质量难以保证,实际效果并不如预期。这表明完全自动化的科学发现之路仍充满障碍。

🎯 **CORE-Bench:衡量计算研究复现能力的新基准:** 为了更准确地评估 AI 在科研领域的实际能力,研究者们引入了 CORE-Bench。这是一个专门用于衡量 AI 自动复现已有计算研究成果的基准测试。通过模拟真实的研究复现过程,CORE-Bench 能够更客观地评估 AI 在理解和执行科研计算任务方面的表现,为评估 AI 在科学研究中的潜力提供了一个可靠的工具。

💡 **通用智能的重新思考与经济影响:** CORE-Bench 的研究促使人们重新审视“通用智能”的概念。研究发现,通过对通用 AI 代理(如 AutoGPT)进行针对性的任务优化,可以显著提升其在特定任务上的表现。这表明,与其追求一步到位的通用智能,不如关注如何通过优化使 AI 能够高效地完成特定、有价值的任务。从经济角度看,即使需要一定的开发投入,能够自动化每年耗费研究者数百万小时的重复性工作(如计算复现),其经济效益依然是巨大的。

Last month, Sakana AI released an "AI scientist", which the company called "the first comprehensive system for fully automatic scientific discovery". It was touted as being able to accelerate science without suffering from human limitations. 

Unfortunately, the "AI Scientist" has many shortcomings. It has no checks for novelty, so generated papers could rehash earlier work. And Sakana did not perform any human review (let alone expert “peer” review) of the generated papers—so it is unclear if the papers are any good (apparently they are not). While these flaws are particularly flagrant in Sakana's case, the lack of good evaluation affects most AI agents, making it hard to measure their real-world impact.

Today, we introduce a new benchmark for measuring how well AI can reproduce existing computational research. We also share how this project has changed our thinking about “general intelligence” and the potential economic impact of AI. Read the paper.

CORE-Bench: A new benchmark for evaluating AI for reproducing research

Visions of AI automating science are enticing, but aren’t within reach, and lead to flawed science. In contrast, using AI for well-scoped tasks such as verifying computational reproducibility can save a lot of time and redirect effort towards more productive scientific activity. AI could also help find relevant literature, write code to rapidly test ideas, and perform other computational tasks.

In a new paper, we introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark for measuring how well AI can automate computational reproducibility, that is, reproducing a paper’s findings when the code and data are available. The authors are Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-Bench is a first step in a larger project to rigorously evaluate progress in automating research tasks of increasing difficulty.

Computationally reproducing a study is a far more limited task than replication, which requires re-running experiments that might involve human subjects. Even the limited reproducibility task is hard: In the 2022 Machine Learning Reproducibility Challenge, over a third of the papers could not be reproduced even when experts reproducing the papers had the code and data. 

If AI could automate this mundane yet important task, researchers could automate the implementation of baselines, reviewers could more easily assess if a paper has flaws, and journals and conferences could more easily verify if submitted and published papers are reproducible.

We created CORE-Bench using scientific papers and their accompanying code and data repositories. We used Code Ocean to source papers that were likely to be reproducible. We manually reproduced 90 papers from computer science, medicine, and social science, and curated a set of questions for each paper to be able to verify the answers. 

We release CORE-Bench with three difficulty levels. Tasks in all three levels require the use of both language and vision capabilities. The hardest version closely resembles real-world reproduction attempts, and we expect that improvements on the benchmark will translate to agents that are actually useful to scientists.

To implement baselines, we tested the generalist AutoGPT agent and also implemented a task-specific modification to AutoGPT, which we call CORE-Agent. While the task-specific version improved accuracy significantly, there is still massive room for improvement: the best agent (CORE-Agent with GPT-4o) has an accuracy of 22% on CORE-Bench-Hard.

Rethinking generality

Computational reproducibility requires setting up the code environment correctly, running the code, and seeing if it produces the same results as reported in the paper. Using the shell and other tools correctly is still tricky for LLMs. When we evaluated generalist agents like AutoGPT, we weren't surprised by their poor accuracy (less than 10% on CORE-Bench-Hard). 

Yet, with a few person-days of effort, we were able to build CORE-Agent by modifying AutoGPT, which more than doubled accuracy on the hardest level. We also built a task-specific agent from scratch, but modifying AutoGPT was far less time consuming while also resulting in a stronger agent. We are cautiously optimistic that this approach can be pushed to yield agents that perform well enough to be useful in practice. 

Simple task-specific modifications allow CORE-Agent to outperform AutoGPT. 

If this pattern of being able to easily adapt a generalist agent to produce a task-specific agent holds in other areas, it should make us rethink generality. Generality roughly translates to being able to use the same model or agent without modification to perform a variety of tasks. This notion of generality underpins how Artificial General Intelligence (or AGI) is usually understood and the hopes and fears that accompany it. 

But at least from the point of view of economic impacts, generality might be a red herring. For a task such as computational reproducibility on which expert humans collectively spend millions of hours every year, being able to automate it would be hugely impactful — regardless of whether the AI system did so out of the box, or after a few person days (or even a person year) of programmer effort. 

In the AI Snake Oil book, we define generality as the inverse of task-specificity, and analyze how the history of AI (and computing) can be seen as the pursuit of gradually increasing generality. Increasing generality means decreasing the human effort it takes to build an AI system to perform a given task. From this perspective, systems like AutoGPT may be more general than most people (including us) gave them credit for.

Yet, definitions of AGI typically insist that a single system be able to do everything out of the box. There is no systematic effort to track how the human effort needed to build task-specific AI is changing over time. Just as we’ve argued against flawed conceptions of generality that overestimate AI progress, we should avoid flawed conceptions of generality that underestimate it. 

Read the CORE-Bench paper here.

Further reading

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 科学家 自动化科学发现 CORE-Bench 计算复现 通用智能 AI 评估 AI 代理 AI Scientist Automated Scientific Discovery CORE-Bench Computational Reproducibility General Intelligence AI Evaluation AI Agents
相关文章