Latent 08月08日
GPT-5's Vision Checkup: a frontier VLM, but not a new SOTA
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了GPT-5的视觉和视觉推理能力。与以往模型相比,GPT-5在视觉理解方面取得了显著进步,尤其是在推理能力方面。然而,在物体定位和计数等更复杂的视觉任务上,GPT-5的表现仍有待提高,这可能与其预训练数据中缺乏物体检测任务有关。文章通过Vision Checkup和RF100-VL等基准测试,揭示了当前大型语言模型在处理真实世界视觉信息时的挑战,并指出未来模型发展方向在于提升其深度思考和精确感知的结合能力。

🎯 GPT-5在视觉能力上展现出显著进步,尤其体现在其强大的推理能力上,这使其在Vision Checkup等通用视觉任务排行榜上名列前茅,证明了多步思考对模型从像素中提取更多信息的重要性。

💡 虽然GPT-5在理解图像内容方面表现出色,例如识别物体存在,但其在精确的物体定位和尺寸匹配方面存在明显不足。作者推测这与GPT-5在预训练阶段缺乏专门的物体检测数据有关。

⚖️ 当前大型语言模型在处理现实世界视觉任务时,在速度和能力之间存在权衡。对于需要实时响应的应用,如自主机器人,模型理解图像所需的时间(通常超过10秒)以及答案的变异性限制了其可用性。

🚀 尽管GPT-5在物体识别任务上表现不如Gemini 2.5 Pro等模型,但其在视觉推理方面的提升为多模态大语言模型指明了方向:不仅要看得更清晰,更要深入思考所见内容,以应对更复杂的现实世界挑战。

Editor’s note: Welcome Matvei of Roboflow to the Latent Space guest author crew! notice how GPT-5-Mini’s vision scores the same as GPT-5’s vision. This is an expected consequence of a very good model router at work.


Something that went under the radar in OpenAI’s release today is GPT-5’s vision and visual reasoning capabilities. Adding visual understanding to LLMs has proven to be difficult, with most models not being able to correctly count 4 coins in a photo or locate where specific items are in an image.

LLMs being able to understand the world around them in real-time is the breakthrough people are looking for to enable the autonomous robotics or computer-using revolution and open up the era of personal superintelligence.

Current state of LLMs and visual understanding

The combination of reasoning and visual capabilities has been present in several models like OpenAI GPT and o series models, Google’s Gemini models, Anthropic’s Claude models, and Meta’s Llama models. Models tend to have specific strengths and weaknesses depending on the task type:

The variance across tasks types makes general comparison difficult but that’s why leaderboards, like Vision Checkup, have recently launched. Vision Checkup is our open source vision model evaluation leaderboard and it gives us insight into hard task frontier performance. As you can see, OpenAI dominates in vision capabilities and the launch of GPT-5 adds another model into the top 5.

note that mini perfoms the same as main! good job, router

The leaderboard makes it clear – top models are all reasoning models.

Reasoning is driving OpenAI dominance in general vision tasks

The top models on the Vision Checkup leaderboard are configurations of models with reasoning capabilities.

We suspect that the good results of these models can be attributed more to their reasoning capabilities from pretraining and during test time. This marks the continuation of an important development in multi-modal large language models: the ability to reason over both text and vision modalities.

With that said, scores vary significantly between updates and you can attribute this to multiple reasons. The biggest one is the nondeterminism of reasoning mode for OpenAI models; prompting a reasoning model with the same question twice can lead to both correct and incorrect answers.

When it comes to real-world use, reasoning on images currently takes too much time to be useful and variability of answers makes them difficult to rely on. For most developers, 10+ seconds to understand an image won’t unlock real time use cases. There is a trade off to be made between speed and capability. Depending on the task, a faster model with a more narrow range of knowledge could be the best decision.

Beyond Vision Vibe Checks

We are still far from a world where autonomous robotics can interact with the world around them in real time. Simple tasks like counting, spatial understanding, and object localization are key for any robot to perform general tasks outside of a controlled environment.

To get us beyond leaderboards with lightweight vibe checks, we need to test LLMs across a wide range of domains and track their progress. We released a new benchmark, RF100-VL, at this year’s CVPR conference to provide a more difficult set of visual understanding and grounding benchmarks. The benchmark asks “How well does your LLM understand the real world?”. RF100-VL consists of 100 open source datasets from the Roboflow Universe community containing object detection bounding boxes and multimodal few shot instruction with visual examples and rich textual descriptions across novel image domains.

The top LLMs scored below 10 mAP50:95 for identifying novel objects in real world contexts. The current SOTA among all LLMs is Gemini 2.5 Pro, achieving a zero-shot mAP50:95 of 13.3:

We suspect that the key distinction between OpenAI models and models like Gemini or Qwen, when it comes to object detection tasks, is that OpenAI models don’t include any object detection data in their pretraining.

After running GPT-5 on RF100-VL we got the mAP50:95 of 1.5. This is significantly lower than the current SOTA of Gemini 2.5 Pro of 13.3. We largely attribute this shortcoming to lack of object detection pre-training for GPT-5.

Let’s inspect some of its results below to develop more intuition into why the score may be so low.

Example: Localization

This is an example from a volleyball dataset, which demonstrates the key issue causing low scores. We observe that the model has a good understanding of objects in the images, you can see it correctly understanding that there is a ball, two blockers, and some defenders present.

However it is unable to localize them, all of the boxes are not matching the objects locations and sizes. It seems like the model is good at understanding the image, but not good at grounding the specific objects within it, which again we attribute to lack of object detection tasks in pre-training.

We can see a similar situation with a sheep dataset below:

Example: UI Elements

With recent developments in LLMs for tool use and vision-powered agentic workflows, let's look at GPT-5 performance there. We don’t see any uplift in quality in the UI elements dataset either:

Next up, let’s see if GPT5 is better compared to previous OpenAI models:

GPT-5 has gotten slightly better than o3. And in both cases providing additional information, such as detailed instructions, helps the model.

Interestingly, changing the reasoning effort to high doesn’t improve the scores for RF100-VL. So the benefit of reasoning is not as obvious in the RF100-VL case and suspect it to be due to lack of object detection capabilities, arising from lack of object detection tasks in the pretraining.

GPT-5 is a slight improvement for simple visual tasks

GPT-5’s enhanced reasoning brings it high up on the Vision Checkup leaderboard, proving that multi-step thinking lets a model squeeze more juice out of the pixels. A great outcome for everyday use cases that people are likely relying on ChatGPT to help them solve.

RF100-VL helps highlight that comprehension ≠ localization. Without object-detection priors, the detections still miss the mark. Even so, GPT-5’s jump in vision-reasoning spotlights a clear path forward: models that don’t just see better but think more deeply about what they’re seeing.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-5 视觉能力 AI推理 物体识别 多模态模型
相关文章