MarkTechPost@AI 06月12日
How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了现代大型语言模型(LLMs)的推理能力,特别是它们在数学和医学等领域的表现。研究强调了仅依赖最终答案评估的局限性,并提出了一种新的框架,将LLM的推理过程分解为知识和逻辑两个关键部分。通过对Qwen模型的研究,揭示了不同训练方法对推理能力的影响,并指出了未来LLM开发中更细致评估和训练的重要性,以构建更可靠的AI系统。

🧠 研究人员提出了一种新框架,将LLM的推理过程分解为两个关键部分:事实知识和逻辑步骤。该框架使用知识指标(KI)评估事实准确性,使用信息增益(InfoGain)评估推理质量。

➕ 在数学和医学任务中,研究人员使用Qwen2.5-7B和DeepSeek-R1模型进行评估。他们将模型的响应分解为逻辑步骤,并通过信息增益(步骤的信息量)和知识指标(事实准确性)进行评估。

📚 研究结果表明,监督微调(SFT)虽然提高了准确性,但可能损害推理深度。而强化学习(RL)则有助于通过去除不相关信息来改善推理能力。

🩺 在医学领域,SFT比RL更有效地增强医学知识,但可能略微影响推理效率。RL在SFT之后应用时,能同时提升推理和知识水平。

Unpacking Reasoning in Modern LLMs: Why Final Answers Aren’t Enough

Recent advancements in reasoning-focused LLMs like OpenAI’s o1/3 and DeepSeek-R1 have led to notable improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which hides the reasoning process and doesn’t reveal how models combine knowledge and logic. Some earlier methods attempt to measure reasoning by comparing answers to the original question, but this approach is flawed since models often rely on prior deductions or internal knowledge. Domains such as math and medicine differ in their reasoning needs, highlighting the importance of developing better, domain-aware evaluation methods for building trustworthy AI.

The Shortcomings of Final-Answer Evaluations in Math and Medicine

Recent LLMs have made impressive strides in reasoning tasks, especially in math and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on boosting final answer accuracy rather than understanding how the model reasons step-by-step. Past work has flagged factual errors in reasoning chains or measured similarity between reasoning steps and the original question. But such similarity doesn’t guarantee logical soundness or factual correctness, since LLMs often draw on internal knowledge or earlier reasoning.

A New Framework for Separating Knowledge and Logic in LLM Reasoning

Researchers from UC Santa Cruz, Stanford, and Tongji University go beyond final-answer evaluation by breaking down LLM reasoning into two key parts: factual knowledge and logical steps. They introduce a detailed framework that utilizes two metrics: the Knowledge Index (KI) for factual accuracy and Information Gain (InfoGain) for reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills don’t easily transfer between domains. While supervised fine-tuning improves accuracy, it often harms reasoning depth. Reinforcement learning, however, helps refine reasoning by removing irrelevant information. This work highlights the importance of evaluating and training LLMs more thoughtfully.

Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models

The researchers evaluate reasoning in LLMs by analyzing Qwen2.5-7B and its DeepSeek-R1-distilled version, trained with SFT and RL. Using tasks from both math and medical domains, they decompose responses into logical steps and assess them using two key metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). While InfoGain tracks the informativeness of each step, KI checks whether the knowledge aligns with real-world facts. This approach reveals how models reason and where they may falter in accuracy or logic.

Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks

The study evaluates two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1 on medical tasks. Results show that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially after SFT and RL. The distilled model likely struggles due to prior training focused on math and code, resulting in a domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly compromise reasoning efficiency. RL, on the other hand, improves both reasoning and knowledge when applied post-SFT. Medical benchmarks tend to rely more on factual knowledge than abstract reasoning, unlike math-focused tasks.

Conclusion: Toward More Interpretable and Trustworthy LLMs

In conclusion, the study introduces a framework that separates knowledge from reasoning to evaluate better how LLMs think, particularly in high-stakes areas like medicine and math. Using Qwen models trained with SFT and RL, the researchers found that while SFT improves factual accuracy, essential in medicine, it often weakens reasoning. RL, however, enhances reasoning by trimming out incorrect information. The framework could be extended to fields such as law or finance, where structured thinking is crucial. Overall, this approach helps clarify how LLMs make decisions and suggests ways to tailor their training for specific domains.


Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

The post How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 推理 知识 逻辑
相关文章