MarkTechPost@AI 10月14日 15:53
DRBench:评估企业级深度研究AI的基准和环境
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ServiceNow Research发布了DRBench,一个用于评估“深度研究”AI代理的基准和可运行环境。该环境专注于开放式的企业任务,要求AI能够整合公共网络和私有组织数据的事实,并生成带有恰当引用的报告。与仅限网络的测试平台不同,DRBench模拟了异构的企业工作流程,包括文件、电子邮件、聊天记录和云存储,迫使AI在撰写连贯的研究报告之前,跨多个应用程序进行信息检索、过滤和归因。DRBench包含15个跨10个企业领域的深度研究任务,并评估了洞察回忆、干扰项规避、事实准确性和报告质量等关键指标。

💡 DRBench是一个为评估企业级“深度研究”AI代理而设计的基准和可运行环境,它模拟了现实世界中需要整合公共网络和私有组织数据的复杂任务,并要求AI生成带有准确引用的报告。

📦 该环境包含一个集成了Nextcloud、Mattermost、Roundcube等常用服务的容器化企业环境,并故意注入了大量干扰信息,以测试AI在现实“大海捞针”场景下的信息检索和过滤能力。

📊 DRBench评估AI代理的四个关键方面:洞察回忆(Recall)、干扰项规避(Distractor Avoidance)、事实准确性(Factuality)和报告质量(Report Quality),旨在提供一个端到端的评估框架,而非仅关注单一工具的微观得分。

🚀 DRBench通过提供15个跨10个企业领域的深度研究任务,并详细说明了任务的背景、目标和地面真实见解,为构建和评估能够胜任复杂企业研究任务的AI代理提供了一个标准化的平台。

ServiceNow Research has released DRBench, a benchmark and runnable environment to evaluate “deep research” agents on open-ended enterprise tasks that require synthesizing facts from both public web and private organizational data into properly cited reports. Unlike web-only testbeds, DRBench stages heterogeneous, enterprise-style workflows—files, emails, chat logs, and cloud storage—so agents must retrieve, filter, and attribute insights across multiple applications before writing a coherent research report.

https://arxiv.org/abs/2510.00172

What DRBench contains?

The initial release provides 15 deep research tasks across 10 enterprise domains (e.g., Sales, Cybersecurity, Compliance). Each task specifies a deep research question, a task context (company and persona), and a set of groundtruth insights spanning three classes: public insights (from dated, time-stable URLs), internal relevant insights, and internal distractor insights. The benchmark explicitly embeds these insights within realistic enterprise files and applications, forcing agents to surface the relevant ones while avoiding distractors. The dataset construction pipeline combines LLM generation with human verification and totals 114 groundtruth insights across tasks.

https://arxiv.org/abs/2510.00172

Enterprise environment

A core contribution is the containerized enterprise environment that integrates commonly used services behind authentication and app-specific APIs. DRBench’s Docker image orchestrates: Nextcloud (shared documents, WebDAV), Mattermost (team chat, REST API), Roundcube with SMTP/IMAP (enterprise email), FileBrowser (local filesystem), and a VNC/NoVNC desktop for GUI interaction. Tasks are initialized by distributing data across these services (documents to Nextcloud and FileBrowser, chats to Mattermost channels, threaded emails to the mail system, and provisioned users with consistent credentials). Agents can operate through web interfaces or programmatic APIs exposed by each service. This setup is intentionally “needle-in-a-haystack”: relevant and distractor insights are injected into realistic files (PDF/DOCX/PPTX/XLSX, chats, emails) and padded with plausible but irrelevant content.

Evaluation: what gets scored

DRBench evaluates four axes aligned to analyst workflows: Insight Recall, Distractor Avoidance, Factuality, and Report Quality. Insight Recall decomposes the agent’s report into atomic insights with citations, matches them against groundtruth injected insights using an LLM judge, and scores recall (not precision). Distractor Avoidance penalizes inclusion of injected distractor insights. Factuality and Report Quality assess the correctness and structure/clarity of the final report under a rubric specified in the report.

https://arxiv.org/abs/2510.00172

Baseline agent and research loop

The research team introduces a task-oriented baseline, DRBench Agent (DRBA), designed to operate natively inside the DRBench environment. DRBA is organized into four components: research planning, action planning, a research loop with Adaptive Action Planning (AAP), and report writing. Planning supports two modes: Complex Research Planning (CRP), which specifies investigation areas, expected sources, and success criteria; and Simple Research Planning (SRP), which produces lightweight sub-queries. The research loop iteratively selects tools, processes content (including storage in a vector store), identifies gaps, and continues until completion or a max-iteration budget; the report writer synthesizes findings with citation tracking.

Why this is important for enterprise agents?

Most “deep research” agents look compelling on public-web question sets, but production usage hinges on reliably finding the right internal needles, ignoring plausible internal distractors, and citing both public and private sources under enterprise constraints (login, permissions, UI friction). DRBench’s design directly targets this gap by: (1) grounding tasks in realistic company/persona contexts; (2) distributing evidence across multiple enterprise apps plus the web; and (3) scoring whether the agent actually extracted the intended insights and wrote a coherent, factual report. This combination makes it a practical benchmark for system builders who need end-to-end evaluation rather than single-tool micro-scores.

https://arxiv.org/abs/2510.00172

Key Takeaways

Editorial comments

From an enterprise evaluation standpoint, DRBench is a useful step toward standardized, end-to-end testing of “deep research” agents: the tasks are open-ended, grounded in realistic personas, and require integrating evidence from the public web and a private company knowledge base, then producing a coherent, well-structured report—precisely the workflow most production teams care about. The release also clarifies what’s being measured—recall of relevant insights, factual accuracy, and report quality—while explicitly moving beyond web-only setups that overfit to browsing heuristics. The 15 tasks across 10 domains are modest in scale but sufficient to expose system bottlenecks (retrieval across heterogeneous artifacts, citation discipline, and planning loops).


Check out the Paper and GitHub page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post ServiceNow AI Research Releases DRBench, a Realistic Enterprise Deep-Research Benchmark appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DRBench AI Benchmark Enterprise AI Deep Research Agents ServiceNow Research 人工智能 企业AI 深度研究 基准测试
相关文章