Is the gap between open and closed models growing? Evidence from WeirdML

少点错误 08月05日

Is the gap between open and closed models growing? Evidence from WeirdML

本文介绍了WeirdML基准测试，该测试旨在评估大型语言模型（LLM）在需要细致思考和理解数据特性的怪异机器学习任务上的能力。研究对2023年6月GPT-4以来的主要历史模型进行了测试，以追踪这些困难编码能力的演变。通过散点图和时间线分析，文章展示了模型发布日期与准确率的关系，并对比了开源模型和闭源模型在追赶闭源模型能力上的延迟。研究发现，尽管Claude Opus-3和Sonnet-3.5等模型取得了显著进展，但仍被o1-preview和一些推理模型超越。开源模型在追赶闭源模型方面存在滞后，差距并未缩小。文章还探讨了未来可能的重要数据点，如DeepSeek的R2和GPT-5的对比，以及Meta公司在开放模型方面的承诺。

🚀 WeirdML基准测试旨在挑战大型语言模型（LLM）在处理需要独特思维和数据理解的怪异机器学习任务上的能力。通过一系列精心设计的任务，该基准能够深入挖掘模型在复杂推理和数据敏感性方面的表现。

📈 文章通过分析2023年6月GPT-4以来的主要模型在WeirdML上的表现，揭示了大型语言模型在处理困难编码任务方面的能力演变。散点图直观展示了模型发布日期与其准确率之间的关联，为理解模型发展轨迹提供了重要视角。

📊 对比显示，闭源模型在特定能力上通常领先于开源模型，例如，Llama-3-405B花费一年多时间才追平GPT-4的能力。Claude Opus-3和Sonnet-3.5虽有显著提升，但很快被o1-preview和推理模型超越。OpenAI的o3模型进一步提升了闭源模型的性能，但开源模型的追赶速度似乎未能缩小差距。

⏳ 时间线分析量化了开源模型在达到闭源模型已实现的准确率门槛时所经历的延迟。这种延迟反映了当前开源和闭源模型在技术能力上的差距，并表明这种差距可能并未随着时间推移而显著缩小。

💡 文章指出，对未来模型（如DeepSeek的R2和GPT-5）的评估，以及Meta公司在超级智能团队和开放模型承诺方面的进展，将是理解模型发展趋势的关键。这些因素可能对未来AI能力格局产生重要影响。

Published on August 5, 2025 8:20 AM GMT

Previous post: Introducing the WeirdML Benchmark

WeirdML is a benchmark challenging LLMs to solve a set of weird and unusual machine learning tasks designed to require careful thinking and understanding of the data and its properties. We have recently run all the major historical models we could find, going back to GPT-4 in June 2023^[1], to study the evolution of these hard coding capabilities over time.

This scatter plot shows the relationship between model release dates and their achieved accuracy. Each icon corresponds to a model positioned according to its release date on the X-axis and its overall accuracy on the Y-axis, providing an overview of how model performance has progressed chronologically. The actual datapoints are in the middle of each company logo, not the text.

We also compared the frontier among open-weights models over time with the closed models. We find that it took more than a year for the GPT-4 capability to be replicated by Llama-3-405B. Claude models Opus-3 (Mar 2024) and Sonnet-3.5 (Jun 2024) made significant advances over GPT-4, but were completely eclipsed with the arrival of o1-preview and reasoning models (Sept 2024). Four months later DeepSeek-R1 provided the first really capable open reasoning model. However, it did not actually catch up to the closed reasoning models, at least not on these kinds of tasks. o3 (Apr 2025) advanced the closed frontier further, and a new DeepSeek-R1 version (May 2025) improving but still not catching up to o1. Now several Chinese models (Qwen3, Kimi-K2 and GLM-4.5) have roughly caught up to DeepSeek, but not advanced further. So 11 months later, open models have still not caught up to the o1 advance, suggesting that the gap between open and closed models is not shrinking over time.

This scatter plot compares the progress of open-weights versus closed-weights models over time, plotting their release dates against their achieved accuracy. The step-like frontier curves represent the best accuracy achieved by open-weights (blue) and closed-weights (green) models, respectively, at any given time. Shaded regions emphasize periods where closed models held a clear accuracy advantage, highlighting the competitive dynamics between open and closed approaches. The actual datapoints are in the middle of each company logo, not the text.

This timeline visualizes the delays experienced by open-weights models in reaching various accuracy thresholds initially achieved by closed-weights models. Each horizontal line indicates the period from when a closed-weights model first reached an accuracy threshold (left icon) until an open-weights model achieved the same level (right icon). The annotated delays in months quantifies the gap between closed-weights models and open-weights models over time.

We have to extrapolate based on very little data here, and any conclusions are tentative. The future R2 from DeepSeek and the comparison to GPT-5 will probably be the next important datapoint in this discussion, as well as the open questions regarding the new Meta superintelligence team, its capability as an AI lab and its commitment to open models going forward.

Appendix: Full Model Performance Table

This figure provides a comprehensive summary of model performance across multiple metrics for models that have results for all included tasks. The 'Average Accuracy Across Tasks' column shows the overall mean accuracy (bold number) calculated as the average of the mean max accuracy for each task. That is, for each model, we take the maximum accuracy of the 5 iterations per run, we average these values over all the runs for a given task (typically 5 runs/model/task), then we average these results over all the 19 tasks. The thick horizontal bar represents the 95% confidence interval around this mean, estimated using bootstrap, reflecting the variability between different runs on each task. Individual task means are shown as scatter points, with distinct markers for each task type (see legend on top). The 'Cost per Run (USD)' column presents the average total cost incurred per model run, including all iterations. The 'Code Length (#Lines)' column illustrates the distribution of Python code lengths generated by each model, with the thick horizontal bar spanning the 10th to 90th percentile and a vertical line indicating the median number of lines. Finally, the 'Code Exec Time (s)' column displays a histogram of the execution times for all Python scripts generated by the model across all tasks.

^{^}
GPT-4 was released before this time, but the first snapshot with the context length to solve this benchmark was from June 13, 2023. At least of the snapshots still avaliable on the API.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型机器学习基准模型能力评估开源模型闭源模型

相关文章

Is Claude 3 Outperforming GPT-4?

Harmonizing AI: Crafting Personalized Song Suggestions

AI News Weekly - Issue #377: Next in AI : Pioneers' Predictions! - Mar 21st 2024

COLLAGE: A New Machine Learning Approach to Deal with Floating-Point Errors in Low-Precision to Make LLM Training Accurate and Efficient

Leveraging Linguistic Expertise in NLP: A Deep Dive into RELIES and Its Impact on Large Language Models

Japanese Researchers Release “Fugaku-LLM” Trained on the Fugaku Supercomputer

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

Deep Learning, Transformers, and the Consequences of Scale with Oriol Vinyals - #546

AI Gateway Provider Portkey.ai Is In Partnership With F5

Powering AI with the World's Largest Computer Chip with Joel Hestness - #684