VentureBeat 11月12日 06:23
Meta提出SPICE框架,AI系统可自我提升
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta FAIR和新加坡国立大学的研究人员开发了一种名为SPICE的全新强化学习框架,旨在实现AI系统的自我提升。该框架通过让两个AI智能体相互对抗,自主生成挑战并逐步改进,无需人工干预。SPICE通过让一个智能体扮演“挑战者”角色,从大规模文档语料库中构建问题,另一个智能体扮演“推理者”角色,在不访问原始文档的情况下解决问题。这种设计打破了现有自玩方法中信息对称性导致的问题,并通过将任务与广泛的真实世界内容关联来防止生成错误。实验表明,SPICE在数学和通用推理任务上显著优于基线模型,为未来更具适应性和鲁棒性的AI系统奠定了基础。

🧰 **SPICE框架的核心机制:** SPICE(Self-Play In Corpus Environments)是一个创新的强化学习框架,它通过让一个AI智能体扮演“挑战者”角色,利用大规模文档语料库生成一系列具有挑战性的问题,而另一个AI智能体则扮演“推理者”角色,在不接触原始文档的情况下独立解决这些问题。这种“自玩”机制促使AI系统在相互对抗中不断学习和进步,打破了传统依赖人工设计问题和奖励机制的局限性。

💡 **克服现有自玩方法的局限:** 传统的语言模型自玩方法常因生成问题和答案中的事实错误而加剧“幻觉”现象,且当问题生成者和解决者拥有相同信息时,容易陷入重复性的挑战模式。SPICE通过将问题生成与推理过程解耦,并强制推理者无法访问问题来源文档,有效打破了信息对称性,防止了错误信息的累积,并能生成真正新颖的挑战。

🌍 **基于真实世界内容的 grounding:** SPICE的另一关键优势在于其将AI任务锚定在庞大且多样化的文档语料库中。这确保了生成的问题和答案都源于真实世界的知识,有效防止了AI的“幻觉”现象。AI系统通过与外部、可验证的知识源进行交互来学习,而非仅仅依赖自身的输出,从而提高了其学习的可靠性和鲁棒性。

🚀 **自主课程与广泛适用性:** SPICE框架内的“挑战者”和“推理者”之间形成了一种动态对抗关系,这构成了一个自动生成的学习课程。挑战者因生成多样化且处于推理者能力前沿的问题而获得奖励,推理者则因正确回答问题而获得奖励。这种共生互动推动双方不断发现和克服新挑战。此外,SPICE能够生成多种任务形式,并可应用于任何领域,不受限于数学或代码等狭窄领域,也降低了对昂贵的人工标注数据集的依赖。

Researchers at Meta FAIR and the National University of Singapore have developed a new reinforcement learning framework for self-improving AI systems.

Called Self-Play In Corpus Environments (SPICE), the framework pits two AI agents against each other, creating its own challenges and gradually improving without human supervision.

While currently a proof-of-concept, this self-play mechanism could provide a basis for future AI systems that can dynamically adapt to their environments, making them more robust against the unpredictability of real-world applications.

The challenge of self-improving AI

The goal of self-improving AI is to create systems that can enhance their capabilities by interacting with their environment.

A common approach is reinforcement learning with verifiable rewards (RLVR), where models are rewarded for providing the correct answers to problems. This is often limited by its reliance on human-curated problem sets and domain-specific reward engineering, which makes it difficult to scale.

Self-play, where a model improves by competing against itself, is another promising paradigm. But existing self-play methods for language models are often limited by two critical factors.

    Factual errors in generated questions and answers compound, leading to a feedback loop of hallucinations.

    When the problem generator and solver have information symmetry (i.e., share the same knowledge base) they fail to generate genuinely new challenges and fall into repetitive patterns. 

As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.”

How SPICE works

SPICE is a self-play framework where a single model acts in two distinct roles.

This setup breaks the information symmetry that limits other self-play methods, as the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate the problems.

Grounding the tasks in a vast and diverse corpus of documents prevents hallucination by anchoring questions and answers in real-world content. This is important because for AI systems to reliably self-improve, they need external grounding sources. Therefore, LLM agents should learn from interactions with humans and the real world, not just their own outputs, to avoid compounding errors.

The adversarial dynamic between the two roles creates an automatic curriculum.

The Challenger is rewarded for generating problems that are both diverse and at the frontier of the Reasoner's capability (not too easy and also not impossible).

The Reasoner is rewarded for answering correctly. This symbiotic interaction pushes both agents to continuously discover and overcome new challenges. 

Because the system uses raw documents instead of pre-defined question-answer pairs, it can generate diverse task formats, such as multiple-choice and free-form questions.

This flexibility allows SPICE to be applied to any domain, breaking the bottleneck that has confined previous methods to narrow fields like math and code. It also reduces dependence on expensive human-curated datasets for specialized domains like legal or medical analysis.

SPICE in action

The researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B-Hybrid-Base.

They compared its performance against baselines such as the base model with no training, a Reasoner model trained with a fixed "Strong Challenger" (Qwen3-32B-Instruct), and pure self-play methods like R-Zero and Absolute Zero. The evaluation covered a wide range of mathematical and general reasoning benchmarks.

Across all models, SPICE consistently outperformed the baselines, delivering significant improvements in both mathematical and general reasoning tasks.

The results show that the reasoning capabilities developed through corpus-grounded self-play transfer broadly across different models, thanks to the diverse external knowledge corpus they used.

A key finding is that the adversarial dynamic creates an effective automatic curriculum. As training progresses, the Challenger learns to generate increasingly difficult problems.

In one experiment, the Reasoner's pass rate on a fixed set of problems increased from 55% to 85% over time, showing its improved capabilities.

Meanwhile, later versions of the Challenger were able to generate questions that dropped the pass rate of an early-stage Reasoner from 55% to 35%, confirming that both roles co-evolve successfully.

The researchers conclude that this approach presents a paradigm shift in self-improving reasoning methods from “closed-loop self-play that often stagnates due to hallucination drift, to open-ended improvement through interaction with the vast, verifiable knowledge embedded in web document corpora.”

Currently, the corpus used for SPICE represents human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the internet, and human interactions across multiple modalities like video, audio, and sensor data.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SPICE Meta Reinforcement Learning Self-improving AI AI Framework Meta FAIR National University of Singapore AI Systems Machine Learning Artificial Intelligence
相关文章