少点错误 09月30日
GDPval基准评估AI实际能力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI推出GDPval基准,通过真实世界任务评估AI能力,涵盖对美GDP贡献最大的行业。人类专家对比AI与行业专家的输出,选择更优者。模型表现以胜率报告。研究显示,前沿模型已接近超越人类专家,预计2026年中AI胜率将达75%,2027年末达95%。报告同时指出,美国大部分GDP可能最早在2027年实现自动化,但需警惕未来新经济任务的出现。

🔍 GDPval基准通过人类专家对比AI与行业专家输出,以胜率评估AI在真实世界任务中的能力,涵盖对美GDP贡献最大的行业。

📈 研究显示前沿模型已接近超越人类专家,预计2026年中AI胜率将达75%,2027年末达95%,暗示AI能力正迅速逼近人类水平。

⚠️ 报告警示尽管AI能力提升迅速,但未来可能出现AI难以应对的新经济任务,且美国大部分GDP可能最早在2027年实现自动化。

💡 GDPval基准的推出为AI能力评估提供了新工具,其预测性数据(如2026-2027年AI胜率趋势)为未来技术发展提供了重要参考。

🔬 研究方法聚焦前沿模型(Frontier models),通过回归分析预测AI能力发展趋势,尽管数据点有限,但趋势明确且具警示意义。

Published on September 30, 2025 11:53 AM GMT

OpenAI's new GDPval benchmark measures AI capabilities on real-world tasks from the sectors contributing most to the U.S. GDP. Given a task on GDPval, a human industry expert compares the model deliverable to a deliverable by industry experts and chooses the preferred one. Model performances are thus reported as win rates.

Example GDPval tasks from full set. Figure from the paper.

In the spirit of METR's task horizon study, here is a plot on the model performances from GDPval against model release dates.

The plot above uses a logit scale. Other scales can be found in the appendix below. Since only the best models concern us, the regression only uses models that outperforms previous models, labelled as the green "Frontier models" in the plot. There are only three datapoints, so this extrapolation is highly uncertain even if the $R^2$ is high.

Under this regression, a hypothetical model release today would have already outperformed an industry expert in an average task in GDPval. By mid-2026, human experts would prefer AI deliverables 75% of the time. By late-2027, human experts would prefer AI deliverables 95% of the time.

On job loss: GDPval contains tasks that are currently economically-useful. The usual argument holds: there could be novel tasks in the future that are economically-useful and that models do not perform well in.

The more interesting point is that most of the current U.S. GDP could be automated as early as 2027.

Thanks to Nikola Jurkovic for helpful comments.

Appendix

Repo Link

Here.

Next Steps

Fill in the data gap between GPT-4o and o3-high. This will involve running Sonnets 3.5, 3.7, 4, and possibly o1, on GDPval. Evaluate Sonnet 4.5, which is forecasted to score above 50%.

Extra Plots

Extrapolated dates

Raw win rate on the y-axis

Log win rate on the y-axis

Odds on the y-axis

Table of Extrapolated Dates

The table here includes results from regressing on all models reported in the paper, and not just the frontier ones.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GDPval基准 AI能力评估 经济自动化 前沿模型 胜率预测
相关文章