少点错误 09月23日
AI幻觉的根源:不确定性表达的必要性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文回顾并比较了两篇关于人工智能(AI)幻觉的预印本论文。第一篇《一致性推理悖论》从理论计算机科学角度出发,指出能够通过图灵测试的通用人工智能(AGI)若要保持一致性推理,不可避免地会产生错误答案,因此需要具备“我不知道”的功能。第二篇《幻觉的统计必然性》则运用统计学习理论,证明了基于交叉熵损失函数的AI模型在进行密度估计时,必然会产生幻觉,并强调当前的评估基准往往奖励过度自信,未能有效激励模型减少幻觉,提出应修改评估标准以奖励适当的弃权。

💡 **一致性推理的悖论:** 第一篇论文通过理论计算机科学的框架,揭示了通用人工智能(AGI)在处理不同表述的同一问题时,为了保证回答的一致性,必然会在某些情况下产生错误。这意味着,一个理想的、能通过图灵测试的AI系统,在面对复杂问题时,若不具备表达不确定性的能力,将无法避免错误。

📊 **幻觉的统计必然性:** 第二篇论文运用统计学习理论,指出在标准训练范式下,特别是使用交叉熵损失函数进行密度估计的AI模型,其产生幻觉的概率是不可避免的。即使训练数据完美无误,模型本身也存在产生错误信息的内在倾向。

🏆 **评估基准的激励问题:** 两篇论文的共同结论是,当前的AI训练和评估机制存在不足。第二篇论文特别指出,现有的评估基准倾向于奖励模型的“过度自信”,即在不确定时也倾向于给出答案,而非承认“不知道”。这阻碍了模型发展出减少幻觉的能力,需要修改评估体系,明确惩罚错误并奖励恰当的弃权。

❓ **“我不知道”的重要性:** 综合来看,两篇论文都强调了AI系统具备表达不确定性(即“我不知道”)能力的重要性。第一篇论文提出存在一种满足计算预算的AI,能够正确回答问题或声明“不知道”;第二篇论文则建议通过修改评估标准,来训练模型发展出这种能力,使其在不确定时能够适当地弃权,从而提升AI的可靠性。

Published on September 23, 2025 11:06 AM GMT

A recent OpenAI preprint (and blogpost) examines the sources of AI hallucinations. This reminded me of a 2024 preprint which was similar in scope. These two papers use different mathematical lenses but arrive at complementary conclusions about the necessity of uncertainty expressions in AI systems. I briefly review and compare them here.

[disclaimer: I am not an author of any of those papers, so my understanding may not be fully correct]

Paper 1: The Consistent Reasoning Paradox

The first paper examines AI systems through the lens of theoretical computer science. The authors define an AGI as a system (implemented on a Turing machine) that passes the Turing test, which implies it must reason consistently—answering the same question identically regardless of how it's phrased—as, at least ideally, humans should behave too.

Their formal setup considers arithmetic problems where numerical constants can be expressed in different ways, representing different "sentences" for the same problem. Changing the values of the numerical constants leads to a "collection" of problems. 

Working within the framework of Turing machines and computable numbers, they establish five main results, which together constitute what they call the Consistent Reasoning Paradox (CRP). The following are my summaries of the results; see the paper for the precise formulation:

CRP I: For any family of sentences, consisting of a single problem for each problem family, there exists one narrow AI system that answers them correctly (but it may not answer the other cases).

CRP II: An AGI that attempts to answer all sentences (or formulations) consistently will necessarily produce incorrect answers infinitely often on these same problems.

CRP III: There exist problems where an AGI (in the sense above) cannot determine with >50% confidence whether its own solution is correct

CRP IV: There are cases where both the narrow AI and the AGI will be correct, but where neither will be able to provide a logical explanation for the answer.

CRP V: Given a computational budget M, there exists an AI that can either provide a correct answer with "I know" or abstain with "I don't know." The thinking time affects the proportion of questions where abstention occurs.

The paper is not concerned with how the AI/AGI systems mentioned in the statements above are trained (or programmed)—the conclusions are absolute and come from the assumptions and the considered set of problems (coincidentally, I am not clear if the results apply beyond problems including mathematical constants, or that have a computational/mathematical nature).

The paper's conclusion is that trustworthy AI must implicitly compute an "I don't know" function to avoid the inevitable errors that come with consistent reasoning.

Paper 2: Statistical Inevitability of Hallucinations

The second paper considers AI models that do density estimation (which includes modern auto-regressive large language models, but not only those). Then, they use statistical learning theory to establish theoretical bounds on the rate of hallucinations, showing that this is non-zero in case the loss function used is cross-entropy from a training dataset.

Their analysis has two main components:

Pre-training inevitability: The authors establish a reduction from the density estimation problem to a binary classification, where the model's assigned probability to a sample is used to classify it as valid or erroneous. Using this reduction, a bound on the probability of generation of erroneous samples can be established; this bound depends on the missclassification error and on constants. Then, the authors argue that the constants are small in the case of cross-entropy loss from a training dataset. Therefore, the conclusion is that a pure base model will necessarily produce hallucinations, even with error-free training data. Of course, the training data is not perfect, it is hard to find the global optimum of the loss function, and we may expect distribution shift between training and test setups, all of which is likely to make the problem worse.

These findings are not really surprising and come from a relatively simple application of well-understood learning theory concepts and results. However, their clear application to the case of density estimation models is illuminating.

Post-training persistence: Of course, existing state-of-the-art models are not pure base models: post-training techniques could alleviate hallucinations. Nevertheless, the authors argue that evaluation benchmarks reward overconfidence and therefore do not incentivise post-training to eradicate hallucinations. Indeed, current benchmarks and evaluation procedures reward models for guessing when uncertain, similar to how standardised tests reward students for attempting every question. The authors argue that fixing hallucinations requires modifying all benchmarks to explicitly penalise errors and reward appropriate abstention. In their view, it is not sufficient to have hallucination evaluations complementary to the primary ones. Moreover, the penalisation should be explicitly mentioned in the instructions, to make it clear to the model. As a consequence, models will be incentivised to become more calibrated and learn to abstain.

Overall message

The joint takeaway message is: models that never fail (and answer consistently) must be able to say "I don't know" (paper 1), but this does not occur by default within the current training pipeline (paper 2). 

In particular, paper 1 says that an AI with an "I don't know" function that operates within a computational budget exists. The statistical paper suggests modifying benchmark scoring to explicitly reward abstention when uncertain, essentially training models to develop this capability.

Interestingly, these results do not rely on the current model architecture: paper 1 assumes that the AI system is a computable program, while paper 2 assumes only that it is trained to perform density estimation.

Appendix: side-by-side comparison of approaches and findings

The papers differ significantly in their theoretical frameworks:

Modelling assumptions:

Mathematical tools:

Scope of results:



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI幻觉 不确定性表达 一致性推理 统计学习 AI可靠性 AI Hallucinations Uncertainty Expression Consistent Reasoning Statistical Learning AI Reliability
相关文章