少点错误 09月30日

GDPval基准评估AI实际能力

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

OpenAI推出GDPval基准，通过真实世界任务评估AI能力，涵盖对美GDP贡献最大的行业。人类专家对比AI与行业专家的输出，选择更优者。模型表现以胜率报告。研究显示，前沿模型已接近超越人类专家，预计2026年中AI胜率将达75%，2027年末达95%。报告同时指出，美国大部分GDP可能最早在2027年实现自动化，但需警惕未来新经济任务的出现。

🔍 GDPval基准通过人类专家对比AI与行业专家输出，以胜率评估AI在真实世界任务中的能力，涵盖对美GDP贡献最大的行业。

📈 研究显示前沿模型已接近超越人类专家，预计2026年中AI胜率将达75%，2027年末达95%，暗示AI能力正迅速逼近人类水平。

⚠️ 报告警示尽管AI能力提升迅速，但未来可能出现AI难以应对的新经济任务，且美国大部分GDP可能最早在2027年实现自动化。

💡 GDPval基准的推出为AI能力评估提供了新工具，其预测性数据（如2026-2027年AI胜率趋势）为未来技术发展提供了重要参考。

🔬 研究方法聚焦前沿模型（Frontier models），通过回归分析预测AI能力发展趋势，尽管数据点有限，但趋势明确且具警示意义。

Published on September 30, 2025 11:53 AM GMT

OpenAI's new GDPval benchmark measures AI capabilities on real-world tasks from the sectors contributing most to the U.S. GDP. Given a task on GDPval, a human industry expert compares the model deliverable to a deliverable by industry experts and chooses the preferred one. Model performances are thus reported as win rates.

Example GDPval tasks from full set. Figure from the paper.

In the spirit of METR's task horizon study, here is a plot on the model performances from GDPval against model release dates.

The plot above uses a logit scale. Other scales can be found in the appendix below. Since only the best models concern us, the regression only uses models that outperforms previous models, labelled as the green "Frontier models" in the plot. There are only three datapoints, so this extrapolation is highly uncertain even if the $R^2$ is high.

Under this regression, a hypothetical model release today would have already outperformed an industry expert in an average task in GDPval. By mid-2026, human experts would prefer AI deliverables 75% of the time. By late-2027, human experts would prefer AI deliverables 95% of the time.

On job loss: GDPval contains tasks that are currently economically-useful. The usual argument holds: there could be novel tasks in the future that are economically-useful and that models do not perform well in.

The more interesting point is that most of the current U.S. GDP could be automated as early as 2027.

Thanks to Nikola Jurkovic for helpful comments.

Appendix

Repo Link

Here.

Next Steps

Fill in the data gap between GPT-4o and o3-high. This will involve running Sonnets 3.5, 3.7, 4, and possibly o1, on GDPval. Evaluate Sonnet 4.5, which is forecasted to score above 50%.