Microsoft Research Blog - Microsoft Research 09月12日
可控可解释的自适应科学发现AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软推出了一种名为CLIO(Cognitive Loop via In-situ Optimization)的新型AI系统,旨在实现比传统推理模型更具可控性和可解释性的自适应认知行为,尤其适用于科学发现等复杂领域。CLIO通过运行时在内部创建反思循环来生成自身所需的数据,无需进行模型后训练,便能实现对AI行为的精细调控和定制化。该系统在生物医学领域已展现出超越GPT-4.1基线模型的显著性能提升,并能清晰展示推理过程中的不确定性,从而增强科学家的信任度,为AI在科学研究中的广泛应用铺平道路。

💡 **CLIO:可控可解释的自适应AI新范式**:CLIO(Cognitive Loop via In-situ Optimization)是一种创新的AI系统,它通过在运行时创建反思循环来生成数据,实现无需模型后训练即可进行自适应推理,从而为科学发现等领域提供比传统模型更易于控制和理解的AI行为。这种方法解决了当前AI模型在用户控制和可解释性方面的局限性。

🚀 **性能提升与超越**:在人类最后考试(HLE)的生物医学领域,CLIO显著提升了OpenAI GPT-4.1模型的准确率,从8.55%提升至22.37%,相对提升高达161.64%,甚至超越了经过优化的模型。这表明基于优化的自适应AI系统在需要适应性、可解释性和控制力的领域,可以媲美甚至超越经过后训练的模型。

⚖️ **内置不确定性控制与信任构建**:CLIO的核心优势之一在于其内置的对不确定性的控制能力。它不仅能展示推理过程和最终结果,还能清晰地标识推理过程中的不确定性阈值,这对于科学研究的可复现性和纠错至关重要。此外,通过提供“提示无关”的控制选项,用户可以设定不确定性标记的阈值,从而在关键时刻获得警示,建立对AI系统的信任。

🔬 **模型无关的通用性与未来展望**:CLIO的设计具有模型无关的特性,能够提升包括GPT-4o在内的多种模型在特定问题上的表现,例如在免疫学问题上,CLIO将GPT-4o的基础性能提升至与顶级推理模型相当的水平。微软计划将CLIO应用于药物发现等更广泛的科学领域,并相信其通用性使其能够服务于金融分析、工程和法律服务等多个行业,成为混合AI堆栈中的关键控制层。

Unlocking self-adaptive cognitive behavior that is more controllable and explainable than reasoning models in challenging scientific domains

Long-running LLM agents equipped with strong reasoning, planning, and execution skills have the potential to transform scientific discovery with high-impact advancements, such as developing new materials or pharmaceuticals. As these agents become more autonomous, ensuring effective human oversight and clear accountability becomes increasingly important, presenting challenges that must be addressed to unlock their full transformative power. Today’s approaches to long-term reasoning are established during the post-training phase, prior to end-user deployment and typically by the model provider. As a result, the expected actions of these agents are pre-baked by the model developer, offering little to no control from the end user.

At Microsoft, we are pioneering a vision for a continually steerable virtual scientist. In line with this vision, we created the ability to have a non-reasoning model develop thought patterns that allow for control and customizability by scientists. Our approach, a cognitive loop via in-situ optimization (CLIO), does not rely on reinforcement learning post-training to develop reasoning patterns yet still yields equivalent performance as demonstrated through our evaluation on Humanity’s Last Exam (HLE). Notably, we increased OpenAI GPT-4.1’s base model accuracy on text-only biology and medicine from 8.55% to 22.37%, an absolute increase of 13.82% (161.64% relative), surpassing o3 (high). This demonstrates that an optimization-based, self-adaptive AI system developed without further post-training can rival post-trained models in domains where adaptability, explainability, and control matter most.

Figure 1. Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions

In-situ optimization with internal self-reflection to enable self-adaptive reasoning

Model development has advanced from using reinforcement learning human feedback (RLHF) for answer alignment to external grading in reinforcement learning (RLVR). Recent approaches show promise in the utilization of intrinsic rewards for training reasoning models (RLIR). Traditionally, these reasoning processes are learned during the post-training process before any user interaction. While today’s reasoning models require additional data in the training phase and limit user control during the reasoning generation process, CLIO’s approach enables users to steer reasoning from scratch without additional data. Rather, CLIO generates its own necessary data by creating reflection loops at runtime. These reflection loops are utilized for a wide array of activities that CLIO self-defines, encompassing idea exploration, memory management, and behavior control. Most interesting is CLIO’s ability to leverage prior inferences to adjust future behaviors, handling uncertainties and raising flags for correction when necessary. Through this open architecture approach to reasoning, we alleviate the necessity for further model post-training to achieve desired reasoning behavior. Performing novel scientific discoveries often has no prior established patterns for reasoning, much less a large enough corpus of high-quality data to train on. 

CLIO reasons by continuously reflecting on progress, generating hypotheses, and evaluating multiple discovery strategies. For the HLE test, CLIO was specifically steered to follow the scientific method as a guiding framework. Our research shows that equipping language models with self-adapting reasoning enhances their problem-solving ability. It provides a net benefit in quality for science questions, as well as providing exposure and control to the end user.

Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures.

Control over uncertainty: Building trust in AI 

Orchestrated reasoning systems like CLIO are valuable for scientific discovery, as they provide features beyond accuracy alone. Capabilities such as explaining the outcomes of internal reasoning are standard in the scientific field and are present in current reasoning model approaches. However, elements like displaying complete work, including final outcomes, internal thought processes, and uncertainty thresholds to support reproducibility or correction, as well as indicating uncertainty, are not yet universally implemented. Current models and systems do not have this same innate humility.  Rather, we are left with models that produce confident results, whether correct or incorrect. When correct, it is valuable. When incorrect, it is dangerous to the scientific process. Hence, understanding a model or system’s uncertainty is a crucial aspect that we have developed natively into CLIO.

On the other end of the spectrum, orchestrated reasoning systems tend to oversaturate the user by raising too many flags. We enable prompt-free control knobs within CLIO to set thresholds for raising uncertainty flags. This allows CLIO to flag uncertainty for itself and the end user at the proper point in time. This also enables scientists to revisit CLIO’s reasoning path with critiques, edit beliefs during the reasoning process, and re-execute them from the desired point in time. Ultimately, this builds a foundational level of trust with scientists to use them in a scientifically defensible and rigorous way. 

How does CLIO perform? 

We evaluate CLIO against text-based biology and medicine questions from HLE. For this domain, we demonstrate a 61.98% relative increase or an 8.56% net increase in accuracy over OpenAI’s o3 and substantially outperform base completion models like OpenAI’s GPT-4.1, while enabling the requisite explainability and control. This technique applies to all models, showing similar increases in OpenAI’s GPT-4o model, which we observe performs poorly on HLE-level questions. On average, GPT-4.1 is not considered competent for HLE scale questions (<9%), and GPT-4o is natively at less than 2%. By utilizing CLIO, we bring these to near state-of-the-art performance against top reasoning models. CLIO’s recursive nature enables the system to think broader and more deeply, ensuring coverage of the question when answered. In GPT-4.1, we see an increase of 5.92% in accuracy for overall performance using just the cognitive loop recursion. To think more deeply, we allow CLIO to ensemble different evolutions and intelligently choose from the best approach using GraphRAG. This extension of the cognition pattern provides a further 7.90% over a non-ensembled approach.  

Figure 3. The impact of thinking effort on CLIO’s effectiveness.

Furthermore, CLIO’s design offers different knobs of control, for example, how much time to think and which technique to utilize for a given problem. In Figure 3, we demonstrate these knobs of control and their increase on GPT-4.1 and GPT-4o’s performance. In this case, we analyze performance for a subset of biomedical questions, those focused on immunology. CLIO increases GPT-4o’s base performance to be at par with the best reasoning models for immunology questions. We observe a 13.60% improvement over the base model, GPT-4o. This result shows CLIO to be model agnostic, similar to Microsoft AI Diagnostic Orchestrator’s (MAI-DxO) (opens in new tab)‘s approach and corresponding performance boost. 

Implications for science and trustworthy discovery

The future of scientific discovery demands more than reasoning over knowledge and raw computational power alone. Here, we demonstrate how CLIO not only increases model performance but establishes new layers of control for scientists. In our upcoming work, we will demonstrate how CLIO increases tool utility for highly valuable scientific questions in the drug discovery space which requires precise tools designed for the language of science. While our experiments focus on scientific discovery, we believe CLIO can apply in a domain-agnostic fashion. Experts tackling problems in domains such as financial analysis, engineering, and legal services could potentially benefit from AI systems with a transparent, steerable reasoning approach. Ultimately, we envision CLIO as an enduring control-layer in hybrid AI stacks that combine traditional completion and reasoning models, with external memory systems, and advanced tool calling. These continuous checks and balances that CLIO enables will continue to remain valuable even as components within the AI stacks evolve. This combination of intelligent and steerable scientific decision making and tool optimization is the basis of the recently announced Microsoft Discovery platform (opens in new tab).

At Microsoft, we’re committed to advancing AI research that earns the trust of scientists, empowering them to discover new frontiers of knowledge. Our work is a testament to what’s possible when we blend innovation with trustworthiness and a human-centered vision for the future of AI-assisted scientific discovery. We invite the research and scientific community to join us in shaping that future.

Further information:

To learn more details about our approach, please read our pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and encourage partners to explore the utilization of CLIO in Microsoft Discovery. To learn more about Microsoft’s research on this or contact our team, please reach out to discoverylabs@microsoft.com

Acknowledgements

We are grateful for Jason Zander and Nadia Karim’s support. We extend our thanks to colleagues both inside and outside Microsoft Discovery and Quantum for sharing their insights and feedback, including Allen Stewart, Yasser Asmi, David Marvin, Harsha Nori, Scott Lundberg, and Phil Waymouth. 

Opens in a new tab

The post Self-adaptive reasoning for science appeared first on Microsoft Research.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 科学发现 可解释性 可控性 自适应学习 机器学习 人工智能 AI for Science Explainable AI Controllable AI Adaptive Learning Machine Learning Microsoft Research CLIO
相关文章