FlashAdventure：挑战LLM在冒险游戏中的全剧情完成能力

cs.AI updates on arXiv.org 09月03日

FlashAdventure：挑战LLM在冒险游戏中的全剧情完成能力

本文介绍了FlashAdventure，一个由34款Flash冒险游戏组成的基准测试，旨在评估LLM在冒险游戏中的全剧情完成能力。同时，提出了自动化游戏评估器CUA-as-a-Judge和利用长期线索记忆的COAST框架，以改善GUI代理的表现。实验表明，当前的GUI代理在完成全剧情方面存在困难，而COAST通过弥合观察与行为之间的差距，提高了里程碑的完成率。

arXiv:2509.01052v1 Announce Type: new Abstract: GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 冒险游戏全剧情完成自动化评估 COAST框架

相关文章

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Learn AI Together — Towards AI Community Newsletter #23

This AI newsletter is all you need #98

Patterns and Middleware for LLM Applications with Kyle Roche - #659

Building LLM-Based Applications with Azure OpenAI with Jay Emery - #657

Mental Models for Advanced ChatGPT Prompting with Riley Goodside - #652

FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

Intel Releases a Low-bit Quantized Open LLM Leaderboard for Evaluating Language Model Performance through 10 Key Benchmarks

Vidur: A Large-Scale Simulation Framework Revolutionizing LLM Deployment Through Cost Cuts and Increased Efficiency

Llama 3 + Llama.cpp is the local AI Heaven