cs.AI updates on arXiv.org 09月03日
FlashAdventure:挑战LLM在冒险游戏中的全剧情完成能力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了FlashAdventure,一个由34款Flash冒险游戏组成的基准测试,旨在评估LLM在冒险游戏中的全剧情完成能力。同时,提出了自动化游戏评估器CUA-as-a-Judge和利用长期线索记忆的COAST框架,以改善GUI代理的表现。实验表明,当前的GUI代理在完成全剧情方面存在困难,而COAST通过弥合观察与行为之间的差距,提高了里程碑的完成率。

arXiv:2509.01052v1 Announce Type: new Abstract: GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 冒险游戏 全剧情完成 自动化评估 COAST框架
相关文章