Improving and Evaluating Open Deep Research Agents

cs.AI updates on arXiv.org 08月15日

Improving and Evaluating Open Deep Research Agents

本文针对深度研究代理（DRAs）展开研究，对比了开源的Open Deep Research（ODR）与闭源的Anthropic和Google系统，通过优化策略，ODR+模型在浏览基准测试中达到10%的成功率，刷新了开源系统在闭源系统中的表现。

arXiv:2508.10152v1 Announce Type: new Abstract: We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

深度研究代理 Open Deep Research 基准测试系统优化开源与闭源

相关文章

Cross-Device AI Acceleration, Compilation & Execution with Jeff Gehlhaar - #500

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

MMLU-Pro: An Enhanced Benchmark Designed to Evaluate Language Understanding Models Across Broader and More Challenging Tasks

omakub：有主见的 Ubuntu 设置

benchexec： BenchExec：可靠的基准测试和资源测量框架

A Comprehensive Study by BentoML on Benchmarking LLM Inference Backends: Performance Analysis of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Benchmarking Federated Learning for Large Language Models with FedLLM-Bench

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

BiGGen Bench: A Benchmark Designed to Evaluate Nine Core Capabilities of Language Models