Nvidia Developer 前天 00:32
ComputeEval 扩展,提升 AI 编写 CUDA 代码能力评测
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为衡量和提升 AI 辅助编写 CUDA 代码的能力,研究人员发布了 ComputeEval 评测基准的重大更新。此次更新新增了超过 100 个 CUDA 挑战,使总问题数达到 232 个,并引入了更复杂的任务,要求 AI 模型掌握 Tensor Cores、高级共享内存模式和 warp-级原语等现代 CUDA 特性,以及 CUDA Graphs、Streams 和 Events 等功能在动态模拟等实际应用中的应用。评估结果显示,尽管面对更具挑战性的新版本,领先的 LLM 在 CUDA 编程任务上的表现各异,研究团队将持续扩展评测范围,并邀请社区参与。

🚀 **ComputeEval 基准扩展:** 为更全面地评估 AI 模型和代理在 CUDA 编程任务上的能力,ComputeEval 基准进行了重大更新,新增了超过 100 个 CUDA 挑战,总问题数达到 232 个。此次更新旨在提高评测的严谨性和挑战性。

💡 **提升 AI 编程复杂度:** 新增的挑战着重于测试 LLMs 对现代 CUDA 特性的掌握,包括 Tensor Cores、高级共享内存模式、warp-级原语,以及 CUDA Graphs、Streams 和 Events 等在动态模拟等真实应用场景中的运用能力,从而推动 AI 在高性能计算领域的进步。

📊 **LLM CUDA 编程表现分析:** 研究团队评估了多个领先的 LLM 在 ComputeEval 2025.2 版本上的表现,结果显示在更具挑战性的新基准下,模型得分普遍有所下降。这反映了基准本身的难度提升,而非模型能力的退化,为理解当前 AI 辅助 CUDA 编程的现状提供了重要参考。

🌐 **未来发展与社区参与:** ComputeEval 项目将继续扩展其覆盖范围,计划纳入更多 CUDA-X 库,如 cuBLAS、CUTLASS、cuDNN 和 RAPIDS 等。研究团队诚挚邀请 HPC 和 AI 社区的成员参与贡献和协作,共同推动 AI 在加速计算领域的创新。

Can AI coding assistants write efficient CUDA code? To help measure and improve their capabilities, we created ComputeEval, a robust, open source benchmark for evaluating AI models and agents on CUDA programming tasks. 

A few months ago, we announced the first release of ComputeEval and today, we’re introducing its first major expansion by adding more than 100 new CUDA challenges.

With this release, the dataset has grown to a total of 232 of CUDA and CUDA Compute Core Libraries (CCCL) problems. We deliberately raised the bar by adding more difficult challenges that require LLMs to use modern CUDA features, such as Tensor Cores, advanced shared memory patterns, and warp-level primitives. The new problems test the ability to correctly orchestrate features like CUDA Graphs, Streams, and Events. All within the context of real-world applications like dynamic simulations.

LLM performance on CUDA programming

Our team evaluated several leading LLMs on ComputeEval to establish baseline performance metrics and understand the current state of AI-assisted CUDA programming (Table 1).

ModelComputeEval 2025.2
232 new problems
pass@1
ComputeEval 2025.1
128 problems
pass@1
GPT-5 (medium)0.58190.61
Claude Sonnet 4.00.55170.64
gpt-oss-20B (high)0.5474N/A
gpt-oss-120b (high)0.5302N/A
Claude Opus 4.00.5216N/A
DeepSeek-R10.43970.55
gpt-oss-120b (medium)0.4224N/A
gpt-oss-20b (medium)0.4224N/A
gpt-oss-120b (low)0.4052N/A
DeepSeek-V3.10.37500.44
Llama 4 Maverick 17B 128E0.34480.47
Llama 3.1 405B0.34050.4
gpt-oss-20B (low)0.33190.41
Table 1. Pass@1 accuracy of state-of-the-art LLMs on ComputeEval 2025.1 and 2025.2. The latest version introduces 232 new CUDA programming challenges, providing a tougher benchmark for AI-assisted coding.

We observed that scores for all models declined with the move to ComputeEval 2025.2. This doesn’t indicate that the models are becoming less capable—rather, it reflects that the benchmark itself has become more challenging. With each release, we’re raising the bar for AI, pushing it to demonstrate a deeper understanding of the nuances of accelerated computing.

What’s next and how to get involved

We’ll continue expanding both the dataset and the capabilities of the evaluation framework. Work is already underway to extend ComputeEval’s coverage to additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, RAPIDS, and more. We invite the broader HPC and AI communities to contribute and collaborate. Explore the code on GitHub and access the dataset on Hugging Face.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ComputeEval CUDA AI编程 LLM 高性能计算 benchmark AI coding large language models
相关文章