MarkTechPost@AI 19小时前
PokeeResearch-7B:开源7B级深度研究AI代理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Pokee AI发布了PokeeResearch-7B,一个拥有70亿参数的深度研究AI代理。该模型能够执行完整的“研究-验证”循环,分解用户查询,调用搜索和阅读工具,验证答案,并综合多个研究线索生成最终回复。其训练采用了RLAIF(AI反馈强化学习)结合RLOO(REINFORCE Leave-One-Out)算法,注重语义准确性、引用忠实度和指令遵循,而非简单的词汇匹配。该模型在多个基准测试中表现出色,并在Apache-2.0协议下开源,代码和权重均公开可用。

💡 **PokeeResearch-7B模型概述**:Pokee AI开源了一个名为PokeeResearch-7B的70亿参数AI代理,其核心能力是执行完整的深度研究任务。它能够分解复杂的用户查询,通过调用外部工具(如网络搜索和页面阅读)来收集信息,并对收集到的信息进行验证,最终将多个独立的研究线索综合起来,生成一个全面且准确的最终答案。

⚙️ **创新的训练与优化机制**:该模型采用了先进的AI反馈强化学习(RLAIF)技术,并结合了REINFORCE Leave-One-Out(RLOO)算法进行微调。这种训练方式特别强调模型的“语义正确性”、“引用忠实度”和“指令遵循”,确保模型生成的内容不仅在表面上与原文相似,而且在含义上准确,引用真实,并严格按照用户指令执行,而非仅仅追求词汇的重叠。

🧠 **强大的推理与合成能力**:PokeeResearch-7B的推理框架包含“自我纠错”、“自我验证”以及“研究线索合成”等关键机制。通过“研究线索合成”,模型可以同时运行多个独立的研究线程,然后将这些线程的结果进行整合,从而显著提高了在复杂问题上的准确性,减少了单一研究路径可能出现的错误。

📊 **卓越的评估表现与开源承诺**:在包含NQ、TriviaQA、PopQA等10个不同数据集的广泛评估中,PokeeResearch-7B在7B参数级别的深度研究代理中取得了领先的平均准确率。例如,在Humanity's Last Exam(HLE)上,使用研究线索合成(RTS)后准确率从15.2%提升至17.6%。该项目已在Apache-2.0许可下开源,包括模型代码和权重,方便研究者和开发者使用和进一步开发。

🛠️ **详细的评估协议与工具栈**:评估协议设计严谨,涵盖了10个基准数据集,总计1228个问题,每个问题运行4个研究线程,并使用Gemini-2.5-Flash-lite来评判答案的正确性。模型支持高达100个交互回合,且其工具栈清晰,使用了Serper进行搜索,Jina进行文档阅读,可以在单张A100 80GB GPU上运行并支持扩展。

Pokee AI has open sourced PokeeResearch-7B, a 7B parameter deep research agent that executes full research loops, decomposes a query, issues search and read calls, verifies candidate answers, then synthesizes multiple research threads into a final response.

The agent runs a research and verification loop. In research, it calls external tools for web search and page reading or proposes an interim answer. In verification, it checks the answer against retrieved evidence, and either accepts or restarts research. This structure reduces brittle trajectories and catches obvious errors before finalization. The research team formalizes this loop and adds a test-time synthesis stage that merges several independent research threads.

Training recipe, RLAIF with RLOO

PokeeResearch-7B is finetuned from Qwen2.5-7B-Instruct using an annotation-free Reinforcement Learning from AI Feedback, called RLAIF, with the REINFORCE Leave-One-Out algorithm, called RLOO. The reward targets semantic correctness, citation faithfulness, and instruction adherence, not token overlap. The Model’s Hugging Face card lists batch size 64, 8 research threads per prompt during RL, learning rate 3e-6, 140 steps, context 32,768 tokens, bf16 precision, and a checkpoint near 13 GB. The research team emphasizes that RLOO provides an unbiased on policy gradient and contrasts it with the PPO family that is approximately on policy and biased.

https://arxiv.org/pdf/2510.15862

Reasoning scaffold and Research Threads Synthesis

The scaffold includes three mechanisms. Self correction, the agent detects malformed tool calls and retries. Self verification, the agent inspects its own answer against evidence. Research Threads Synthesis, the agent runs several independent threads per question, summarizes them, then synthesizes a final answer. The research team reports that synthesis improves accuracy on difficult benchmarks.

https://arxiv.org/pdf/2510.15862

Evaluation protocol

The research team evaluates text only questions from 10 benchmarks, NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity’s Last Exam. They sample 125 questions per dataset, except GAIA with 103, for a total of 1,228 questions. For each question, they run 4 research threads, then compute mean accuracy, mean at 4, using Gemini-2.5-Flash-lite to judge correctness. The maximum interaction turns are set to 100.

https://github.com/Pokee-AI/PokeeResearchOSS
https://github.com/Pokee-AI/PokeeResearchOSS

Results at 7B scale

PokeeResearch-7B reports the best mean at 4 accuracy among 7B deep research agents across the 10 datasets. On HLE the model reports 15.2 without RTS and 17.6 with RTS. On GAIA the model reports 36.9 without RTS and 41.3 with RTS. On BrowseComp the model reports 5.4 without RTS and 8.4 with RTS. On the seven QA benchmarks, Bamboogle, 2WikiMultiHopQA, TriviaQA, NQ, PopQA, Musique, HotpotQA, the model improves over recent 7B baselines. Gains from RTS are largest on HLE, GAIA, and BrowseComp, and smaller on the QA sets.

Key Takeaways

    Training: PokeeResearch-7B fine tunes Qwen2.5-7B-Instruct with RLAIF using the RLOO estimator, optimizing rewards for factual accuracy, citation faithfulness, and instruction adherence, not token overlap.Scaffold: The agent runs a research and verification loop with Research Threads Synthesis, executing multiple independent threads, then synthesizing evidence to a final answer. Evaluation protocol: Benchmarks span 10 datasets with 125 questions each, except GAIA with 103, 4 threads per question, mean@4 accuracy judged by Gemini-2.5-Flash-lite, with a 100 turn cap.Results and release: PokeeResearch-7B reports state of the art among 7B deep research agents, for example HLE 17.6 with RTS, GAIA 41.3 with RTS, BrowseComp 8.4 with RTS, and is released under Apache-2.0 with code and weights public.

Editorial Comments

PokeeResearch-7B is a useful step for practical deep research agents. It aligns training with RLAIF using RLOO, so the objective targets semantic correctness, citation faithfulness, and instruction adherence. The reasoning scaffold includes self verification and Research Threads Synthesis, which improves difficult benchmarks. The evaluation uses mean at 4 with Gemini 2.5 Flash lite as the judge, across 10 datasets. The release ships Apache 2.0 code and weights with a clear tool stack using Serper and Jina. The setup runs on a single A100 80 GB and scales.


Check out the Paper, Model on HF and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PokeeResearch-7B AI代理 深度研究 RLAIF 开源 自然语言处理 Pokee AI Deep Research Agent Open Source Reinforcement Learning AI Feedback Natural Language Processing
相关文章