https://simonwillison.net/atom/everything 09月30日
AI编译22年旧代码
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

编译基准测试显示,Claude Opus 4.1 Thinking表现最佳,唯一100%解决所有问题(允许三次尝试)。Claude Sonnet 4 Thinking和GPT-5 high均得分93%。DeepSeek 3.1和Kimi K2 0905等最高权重模型得分为80%。GPT-5-mini在性能与成本对比中显示出高性价比。Gemini 2.5系列表现不佳,仅解决60%问题。基准测试设计保持简洁,避免模型特定调整,作者认为Google模型可能通过专门调优表现更好。

🔍 编译基准测试评估了不同AI模型处理22年旧代码的编译挑战能力,重点测试跨编译如gucr for ARM64架构。

🏆 Claude Opus 4.1 Thinking表现最佳,唯一100%解决所有问题(允许三次尝试),Claude Sonnet 4 Thinking和GPT-5 high得分93%,DeepSeek 3.1和Kimi K2 0905等最高权重模型得80%。

💰 GPT-5-mini在性能与成本对比中显示出高性价比,而Gemini 2.5系列表现不佳,仅解决60%问题。

⚠️ 基准测试设计保持简洁,避免模型特定调整,作者认为Google模型可能通过专门调优表现更好,但这违反了基准原则。

🛠️ 测试系统基于Go语言编写,使用`run_terminal_cmd`工具执行bash命令,系统提示符明确指示为Ubuntu 22.04环境操作。

CompileBench: Can AI Compile 22-year-old Code? (via) Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture?

This is one of my favorite applications of coding agent tools like Claude Code or Codex CLI: I no longer fear working through convoluted build processes for software I'm unfamiliar with because I'm confident an LLM will be able to brute-force figure out how to do it.

The benchmark on compilebench.com currently show Claude Opus 4.1 Thinking in the lead, as the only model to solve 100% of problems (allowing three attempts). Claude Sonnet 4 Thinking and GPT-5 high both score 93%. The highest open weight model scores are DeepSeek 3.1 and Kimi K2 0905, both at 80%.

This chart showing performance against cost helps demonstrate the excellent value for money provided by GPT-5-mini:

The Gemini 2.5 family does surprisingly badly solving just 60% of the problems. The benchmark authors note that:

When designing the benchmark we kept our benchmark harness and prompts minimal, avoiding model-specific tweaks. It is possible that Google models could perform better with a harness or prompt specifically hand-tuned for them, but this is against our principles in this benchmark.

The harness itself is available on GitHub. It's written in Go - I had a poke around and found their core agentic loop in bench/agent.go - it builds on top of the OpenAI Go library and defines a single tool called run_terminal_cmd, described as "Execute a terminal command inside a bash shell".

The system prompts live in bench/container/environment.go and differ based on the operating system of the container. Here's the system prompt for ubuntu-22.04-amd64:

You are a package-building specialist operating a Ubuntu 22.04 bash shell via one tool: run_terminal_cmd.The current working directory of every run_terminal_cmd is /home/peter.

Execution rules:

    Always pass non-interactive flags for any command that could prompt (e.g., -y, --yes, DEBIAN_FRONTEND=noninteractive).Don't include any newlines in the command.You can use sudo.

If you encounter any errors or issues while doing the user's request, you must fix them and continue the task.At the end verify you did the user request correctly.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI编译 代码基准测试 Claude Opus GPT-5 DeepSeek Kimi K2 Gemini 跨编译
相关文章