Communications of the ACM - Artificial Intelligence 09月19日 09:16
语言模型推理能力研究:超越统计匹配的探索
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期研究深入探讨了大型语言模型(LLMs)的推理能力,挑战了它们仅依赖统计匹配的传统认知。尽管LLMs在处理信息和生成内容时常出现“幻觉”等错误,但研究表明,模型内部结构可能在一定程度上构建了世界模型,从而提升了其准确性。通过对模型内部状态的探测,科学家们发现LLMs在处理多步推理任务时,信息传播路径和层级结构对其性能至关重要。为提升LLMs的可靠性,研究人员正探索结合符号工具、链式思考提示(CoT)以及检索增强生成(RAG)等方法,以期减少错误并提高模型表达不确定性的能力,从而使其输出更值得信赖。

💡 **模型内部结构与推理能力:** 研究表明,大型语言模型(LLMs)并非仅仅依赖于统计模式匹配,其内部结构可能构建了某种形式的世界模型,从而在一定程度上提升了推理和生成内容的准确性。尽管存在“幻觉”等问题,但这些内部表征为理解LLMs的行为提供了新的视角。

🚀 **信息传播与多步推理:** 在处理复杂的多步推理任务时,LLMs内部信息在不同层级之间的传播路径和效率至关重要。研究发现,模型倾向于在早期层级提取单一事实,这可能限制了其在一次通过中整合多条信息的能力。链式思考(CoT)等方法通过分解问题来改善这一状况。

🛡️ **提升模型可靠性的策略:** 为克服LLMs的“幻觉”问题,研究人员正积极探索多种策略。这包括结合符号工具(如AlphaProof)和检索增强生成(RAG),后者通过限制模型仅使用本地知识库中的事实来生成内容。此外,优化解码策略和更精细地探测模型不确定性也是关键方向。

🤔 **理解与表达不确定性:** LLMs在表达自身不确定性方面存在显著挑战,其概率模式会因知识编码方式的不同而变化,使得构建普适的“测谎仪”十分困难。未来的研究需要更深入地理解模型如何表示和传达其对输出的置信度,以期更好地管理用户的期望和信任。

If you want a job done well, you are probably better off not using a language model to do it. Thanks to the internal connections they create from terabytes of data ingested during pretraining, they produce results that can seem like rudimentary reasoning. Yet even a simple change to a query, such as switching the order of items, will cause the AI to provide different, sometimes wildly incorrect answers.

“People have called it ‘jagged intelligence’: it works when it works,” said Subbarao Kambhampati, professor of computer science at Arizona State University’s Fulton School of Engineering.

Yet there are hints that language models are, at least sometimes, doing more than basic statistical matching. The internal structures that language models build seem to improve their overall accuracy, although they do not prevent them from making egregious mistakes all too often. Ben Levinstein, associate professor of Philosophy at the University of Illinois at Urbana-Champaign, is among those who take the view that these AI systems are creating reasonable world models. These models help boost performance even though the pretraining regime is not designed to create them; it is only designed to find the most probable sequence of words generated one by one in response to a prompt.

“They aren’t master world models. They are abstracted hypotheses about how some components of the world work that help the language model decide what to say,” Levinstein explained, noting that current research points to these basic models working alongside shallower heuristics that arise from the statistical nature of the AI.

Harvard University postdoctoral fellow Keyon Vafa and coworkers at the Massachusetts Institute of Technology (MIT) and Cornell University trained a language model on New York taxi journeys to see if it would build an internal model of streets of Manhattan. To some extent it did, delivering usable route plans. But a graph created from analysis of the language model’s internal state showed flyovers and direct connections between streets that do not exist in the real world. They led to the AI “hallucinating” impossible routes when the prompt included closures and diversions.

It seems model capacity and training focus both seem to play a role in how well a language model can build logical representations. MIT Ph.D. student Charles Jin used the code from simple robot-control programs written in Karel and their inputs and outputs to train a language model. The results paralleled the evolution of language models themselves. Tests after completing the early training steps showed it doing little more than babbling; it spat out sequences of random instructions that did not work.

As training reached the midway point, the model seemed to acquire the correct syntax for the language, but the model still failed to generate programs that controlled the virtual robot properly. Then, about three-quarters of the way through training, the model seemed to build a model of the language semantics good enough to generate correct programs in response to more than 90% of prompts. Even so, the question remains whether language models are doing more than implementing huge lookup tables.

To settle this question, researchers have come up with ways to probe changes in language models’ internal “state of mind.” Because of the huge number of weights that can generate each token the model outputs in its answers, these probes themselves rely on machine learning techniques to reduce the information to a set of human-readable states, such as which direction a simulated robot points in the case of Jin’s work. This training can lead to false positives, because the probe may learn the task by itself rather than showing the language model’s operation. Jin took the approach of tampering with the semantics of Karel to see if the probe changed its behavior to match. It did not, implying the language model was keeping track of the robot’s position and direction as it moved according to the program statements.

More evidence of the AI going beyond statistics to implement some basic reasoning comes from attempts to identify the path of information through the stack of feedforward layers used by all language models. Scientists in a group led by Tel Aviv University assistant professor of Computer Science Mor Geva looked at how signals propagate in a series of logical hops through the model’s stack of neuronal layers. They tested it using prompts like “find the mother of the singer of the hit song Superstition.

The probes showed language models can readily find Stevie Wonder’s name by the end of the first hop. Where this process fails on the second hop is when the information from the first hop does not propagate quickly, leading to the wrong answer being delivered at the output. With more layers, the chances of success improve, but Geva’s group found that bigger models tend to use the first half of their layer stack to extract a single fact, no matter how many they have available in total. “There is nothing that pushes the model to do it with fewer layers,” she said, which seems to limit how many connections a model can make in a single pass through its internal, latent space.

One way to give the models more time to come up with the right result is to use chain-of-thought (CoT) prompting. This decomposes multi-hop problems into a sequence of simpler requests that the language model has a better chance of answering correctly. Traditionally, this kind of prompting has been a manual process. OpenAI “Strawberry,” also known as OpenAI 01, instead uses a second language model to decompose a request into a sequence of CoT prompts. This model responds to failures in intermediate steps by backtracking and generating alternative paths to solving a problem.

Kambhampati argues that the language models involved in a system like OpenAI 01 cannot provide guarantees of correctness. He sees the combination of language models with symbolic tools as a way of delivering more reliable results. An example of this “LLM-Modulo” architecture, named after a technique used by formal satisfiability solvers, is Google’s AlphaProof.

To solve Math Olympiad problems, the developers trained AlphaProof to write proofs in the formal language Lean. The verification engine designed for Lean then checked the solution, forcing the model to generate new attempts on every rejection until a working proof emerged. Kambhampati sees similar systems being used for more general program synthesis and in planning as, again, these applications can harness formal verification tools and solvers. However, the additional tools would have to be tuned to the target application, and the architecture is not a good fit for chatbots.

For chatbots, retrieval augmented generation (RAG) can act as an external aid and is already in widespread commercial use. RAG works on the assumption that hallucinations will be less common if the LLM is constrained to generate only text using the facts stored in a local knowledge base, rather than relying on the data it ingested during training.

RAG remains a long way from preventing language models from delivering fictional answers. The interface to the knowledge base is very similar to that used to interpret and store text learned during pretraining: a vector into a huge multidimensional space. There is no guarantee of extracting the right data if several elements sit close to each other in that vector space. To deal with that, some researchers are trying to use a language model’s internal signals coupled with external tools to check consistency between the model’s decisions and what it retrieves from the knowledge base.

Design decisions seem to exacerbate the problem, which may provide tactics to reduce errors. By selecting the most probable token at the end of each cycle, the commonly used greedy-decoding method may make the situation worse compared to a system that looks at a wider range of options. One way to spot this mismatch is to probe the language model’s state at runtime.

Ph.D. student Hadas Orgad in Yonatan Belinkov’s group at Israel’s Technion probed the way internal states changed as the model generated factual answers. They found significant changes in the calculated probabilities of candidate words when the model lacked confidence in an answer and could use the probe to deliver the correct words and terms. The bad news for hallucination mitigation in general is that work by Orgad and others shows the probability patterns change depending on how the model encodes its knowledge. That will make it hard to build a white-box lie detector for all situations.

“It is really hard for models to express their uncertainty about an output accurately,” Geva said. At the same time, the differences in representation may provide avenues to use more fine-grained approaches to detecting errors. “It will be interesting to think about different classes of hallucination,” she added.

There may be a deeper issue: whether probes are exploring the attributes computer scientists believe they are. The instruction tuning that follows pretraining may exacerbate the problem. Human operators may inadvertently reward responses without promoting truthfulness to the extent that current interpretations of what a model believes to be fact do not correspond with what humans understand to be facts.

“Does it make sense to say that ChatGPT thought it was true? Or that it said it because it was what I thought I wanted to hear? And what do these questions mean? These aren’t even human minds,” Levinstein said.

Levinstein takes the view that work in this area may profit from greater interactions between computer scientists and philosophers who look more closely at what constitutes truth and falsity in the data language models store. That may, in turn, yield better signals that language models can use to indicate when they risk making a mistake, so users can decide whether to trust their answers or not.

Further Reading

  • Biran, E., Gottesman, D., Yang, S., Geva, M., and Globerson, A.
    Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 14113-14130. arXiv pre-print, arxiv.org/abs/2406.12775
    Jin, C. and Renard, M.
    Emergent Representations of Program Semantics in Language Models Trained on Programs. Proceedings of the 41st International Conference on Machine Learning. PMLR 235 (2024). arXiv pre-print: arxiv.org/abs/2305.11169
  • Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A.
    Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. Proceedings of the 41st International Conference on Machine Learning. PMLR 235 (2024). arXiv pre-print, arxiv.org/abs2402.01817
  • Levinstein, B.A., Herrmann, D.A.
    Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks. Philosophical Studies (2024). arXiv pre-print, arxiv.org/abs/2307.00175
  • Orgad, H., Toker, M., Gekhman, Z., Reichart, Roi, Szpektor, I., Kotek, H., Belinkov, Y.
    LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. arXiv pre-print, arxiv.org/abs/2410.02707

 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 推理能力 人工智能 机器学习 幻觉 模型探测 链式思考 检索增强生成 Large Language Models LLM Reasoning Artificial Intelligence Machine Learning Hallucination Model Probing Chain-of-Thought Retrieval Augmented Generation
相关文章