LLM推理在低误报率任务中的权衡研究

cs.AI updates on arXiv.org 10月27日 14:23

LLM推理在低误报率任务中的权衡研究

本文系统研究了在严格低误报率（FPR）环境下，大型语言模型（LLM）推理在分类任务中的适用性。通过在安全检测和幻觉检测任务中的实验，揭示了推理增强生成在提高整体准确率的同时，在低FPR阈值下的不足。研究发现，无推理的推理模式在精度敏感的应用中占主导地位，而简单的模式组合则能恢复各自的优点。

arXiv:2510.21049v1 Announce Type: cross Abstract: Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM推理低误报率分类任务安全检测幻觉检测

相关文章

Training Data Locality and Chain-of-Thought Reasoning in LLMs with Ben Prystawski - #673

熵，一种扫描文件以查找高熵行（可能是机密）的 CLI

Deciphering Doubt: Navigating Uncertainty in LLM Responses

Galileo Introduces Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

GrimResource：野外发现新的 Microsoft 管理控制台攻击

Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

Patronus AI Introduces Lynx: A SOTA Hallucination Detection LLM that Outperforms GPT-4o and All State-of-the-Art LLMs on RAG Hallucination Tasks

OpenAI 翁荔提出大模型「外在幻觉」：万字 blog 详解抵抗办法、产幻原因和检测方式

新品发布|全面适配鸿蒙NEXT安全检测，助力鸿蒙安全生态建设

大模型评测技术研讨会暨国际标准IEEE P3419第二次工作组会议成功召开