MarkTechPost@AI 09月26日
OpenAI推出GDPval,评估AI在真实经济任务中的表现
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI发布了名为GDPval的全新评估套件,旨在衡量AI模型在9个美国GDP主导行业的44种职业中执行真实、具有经济价值任务的表现。与学术基准不同,GDPval聚焦于实际产出,如演示文稿、电子表格、简报、CAD文件和音视频等,并由职业专家通过盲审配对比较进行评分。该套件包含1,320个由行业专业人士创建的任务,涵盖多模态文件处理,并提供了一个220个任务的“黄金”子集和一个实验性的自动化评分器。评估结果显示,前沿模型在大量任务上接近专家水平,并在时间成本分析中揭示了AI辅助工作流程的潜在效益。

📊 **GDPval的创新评估方法**:GDPval是一套创新的AI评估工具,它摒弃了传统的学术基准,转而专注于衡量AI在44种真实职业中的经济价值。通过模拟实际工作场景,并让具有丰富经验的职业专家进行盲审配对比较,GDPval能够更准确地反映AI在现实世界中的应用潜力,特别是其在生成演示文稿、电子表格、CAD文件等实际产出方面的能力。

🛠️ **任务设计的现实性与多样性**:该评估套件包含1,320个由行业专家设计的任务,这些任务不仅贴近实际工作需求,还支持多模态文件处理,包括文档、幻灯片、图像、音频、视频、电子表格和CAD文件。这种高度仿真的任务设计,特别是包含多达数十个参考文件的复杂任务,能有效测试AI在处理复杂信息和生成结构化输出方面的能力。

📈 **模型性能与改进方向**:在“黄金”子集上的评估显示,当前最先进的模型在许多任务上已接近人类专家的表现。分析发现,模型在指令遵循、格式化、数据使用和幻觉方面存在不足,同时也表明通过增加推理努力和加强框架支持(如格式检查和自我检查的渲染)可以带来可预测的性能提升。

💰 **经济效益与时间成本分析**:GDPval通过情景分析量化了AI辅助工作流程与纯人工流程在时间和成本上的差异。它考虑了人类完成时间、审阅时间、模型延迟和API成本等因素,结果表明,在包含审阅开销后,许多任务类别可以实现显著的时间和成本节约,为AI在商业领域的应用提供了量化依据。

💡 **自动化评分器的辅助作用**:为了提高评估的可访问性和迭代速度,GDPval引入了一个实验性的自动化评分器。该评分器在与人类专家的评分一致性方面表现良好,虽然不能完全替代专家评审,但可作为一种有用的代理工具,帮助研究人员快速迭代和改进模型。

OpenAI introduced GDPval, a new evaluation suite designed to measure how AI models perform on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike academic benchmarks, GDPval centers on authentic deliverables—presentations, spreadsheets, briefs, CAD artifacts, audio/video—graded by occupational experts through blinded pairwise comparisons. OpenAI also released a 220-task “gold” subset and an experimental automated grader hosted at evals.openai.com.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals averaging 14 years of experience. Tasks map to O*NET work activities and include multi-modal file handling (docs, slides, images, audio, video, spreadsheets, CAD), with up to dozens of reference files per task. The gold subset provides public prompts and references; primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

https://openai.com/index/gdpval/

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review, with model progress trending roughly linearly across releases. Reported model-vs-human win/tie rates near parity for top models, error profiles cluster around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding (e.g., format checks, artifact rendering for self-inspection) yield predictable gains.

Time–Cost Math: Where AI Pays Off

GDPval runs scenario analyses comparing human-only to model-assisted workflows with expert review. It quantifies (i) human completion time and wage-based cost, (ii) reviewer time/cost, (iii) model latency and API cost, and (iv) empirically observed win rates. Results indicate potential time/cost reductions for many task classes once review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

For the gold subset, an automated pairwise grader shows ~66% agreement with human experts, within ~5 percentage points of human–human agreement (~71%). It’s positioned as an accessibility proxy for rapid iteration, not a replacement for expert review.

https://openai.com/index/gdpval/

Why This Isn’t Yet Another Benchmark

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 targets computer-mediated knowledge work. Physical labor, long-horizon interactivity, and organization-specific tooling are out of scope. Tasks are one-shot and precisely specified; ablations show performance drops with reduced context. Construction and grading are resource-intensive, motivating the automated grader—whose limits are documented—and future expansion.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evals with occupational, multi-modal, file-centric tasks and reports human preference outcomes, time/cost analyses, and ablations on reasoning effort and agent scaffolding. v0 is versioned and expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by pairing expert-built tasks with blinded human preference judgments and an accessible automated grader. The framework quantifies model quality and practical time/cost trade-offs while exposing failure modes and the effects of scaffolding and reasoning effort. Scope remains v0—computer-mediated, one-shot tasks with expert review—yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.


Check out the PaperTechnical details, and Dataset on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GDPval AI evaluation economic value OpenAI real-world tasks multi-modal expert review automated grading time-cost analysis 人工智能评估 经济价值 OpenAI 真实世界任务 多模态 专家评审 自动化评分 时间成本分析
相关文章