TechCrunch News 01月25日
People are benchmarking AI by having it make balls bounce in rotating shapes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章讨论了一些AI模型在特定编程任务中的表现,如在旋转形状内编写弹跳黄球的Python脚本。不同模型的表现各异,有些表现出色,有些则存在问题。同时提到该测试并非很实用的AI基准,创建有用的AI模型测量系统存在困难。

🎯一些AI模型在'球在旋转形状内'基准测试中的表现情况

💻DeepSeek的R1模型表现优秀,OpenAI的o1 pro等模型存在不足

📋测试中部分模型对物理判断有误,也有模型表现较好

🤔该测试并非实用的AI基准,创建测量系统存在困难

The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI community on X have become obsessed with a test of how different AI models, particularly so-called reasoning models, handle prompts like this: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”

Some models manage better on this “ball in rotating shape” benchmark than others. According to one user on X, Chinese AI lab DeepSeek’s freely available R1 swept the floor with OpenAI’s o1 pro mode, which costs $200 per month as a part of OpenAI’s ChatGPT Pro plan.

Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models misjudged the physics, resulting in the ball escaping the shape. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the evaluation in one go.

But what does it prove that an AI can or can’t code a rotating, ball-containing shape?

Well, simulating a bouncing ball is a classic programming challenge. Accurate simulations incorporate collision detection algorithms, which try to identify when two objects (e.g. a ball and the side of a shape) collide. Poorly written algorithms can affect the simulation’s performance or lead to obvious physics mistakes.

X user n8programs, a researcher in residence at AI startup Nous Research, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” n8programs explained in a post.

But while bouncing balls and rotating shapes are a reasonable test of programming skills, they’re not a very empirical AI benchmark. Even slight variations in the prompt can — and do — yield different outcomes. That’s why some users on X report have more luck with o1, while others say that R1 falls short.

If anything, viral tests like these point the intractable problem of creating useful systems of measurement for AI models. It’s often difficult to tell what differentiates one model from another, outside of esoteric benchmarks that aren’t relevant to most people.

Many efforts are underway to build better tests, like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how those fare — and in the meantime watch GIFs of balls bouncing in rotating shapes.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 基准测试 编程任务 测量系统
相关文章