GUI-Spotlight：提升视觉定位准确性的多模态模型

cs.AI updates on arXiv.org 10月07日

GUI-Spotlight：提升视觉定位准确性的多模态模型

本文介绍GUI-Spotlight，一种针对图像推理训练的多模态模型，通过动态调用多个工具来精确定位屏幕相关区域，显著提高视觉定位的准确性，在ScreenSpot-Pro基准测试中表现出色。

arXiv:2510.04039v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态模型视觉定位 GUI系统图像推理 ScreenSpot-Pro

相关文章

AI Trends 2024: Computer Vision with Naila Murray - #665

Unifying Vision and Language Models with Mohit Bansal - #636

Runway Gen-2: Generative AI for Video Creation with Anastasis Germanidis - #622

GPT-4o delivers human-like AI interaction with text, audio, and vision integration

华泰证券：GPT-4o响应时延大幅缩短，有望加速AI硬件落地

智源百模大考阅卷出分

This AI Paper from Stanford University Evaluates the Performance of Multimodal Foundation Models Scaling from Few-Shot to Many-Shot-In-Context Learning ICL

微軟公布具視覺能力的Phi-3-vision多模態模型，可執行在行動裝置上

Multimodal Model Chameleon by Meta

Researchers at Stanford Propose SleepFM: A New Multi-Modal Foundation Model for Sleep Analysis