少点错误 08月18日
The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了机械可解释性(MechInterp)领域,该领域致力于理解神经网络的内部机制。研究者们提出,借鉴哲学、神经科学和社会科学等学科的经验,能够为AI模型的可解释性研究奠定基础。文章发布了两篇关于可解释性哲学的论文,探讨了“解释的数学哲学”和“解释的评价框架”,并提出将模型、本体、因果机制和可证伪性作为机械可解释性的定义,以及利用“解释美德框架”来评估和改进解释。这些研究旨在增强对AI系统的监控、预测和引导能力,并邀请相关领域的专家合作。

✨ **机械可解释性(MechInterp)的核心目标是理解神经网络的内部运作机制。** 该领域借鉴了神经科学、科学哲学和社会科学等多个学科的理论和方法,旨在为理解AI模型提供更坚实的基础。研究者认为,通过跨学科的视角,可以更有效地解决AI可解释性问题。

📜 **《解释的数学哲学》论文提出“解释性假设”,认为神经网络内嵌有可提取和理解的隐式解释。** 文章定义了“解释忠诚度”来评估解释与模型的契合度,并将机械可解释性(MI)定义为生成模型层面的、本体的、因果机制的、可证伪的解释,从而区分MI与其他可解释性范式,并明确其局限性。同时,提出了“解释乐观原则”作为MI成功的必要先决条件。

⚖️ **《解释的评价》论文引入“解释美德框架”来系统评估和改进MI中的解释。** 该框架从科学哲学的四个视角(贝叶斯、库恩、德意志、归纳)出发,分析了“什么使解释更好”这一核心问题。研究发现,“紧凑证明”方法考虑了多种解释美德,是颇具前景的途径。研究还指出了未来研究方向,包括清晰定义解释的简洁性、聚焦统一性解释以及推导神经网络的普适性原理。

🤝 **该项目积极邀请跨学科研究者合作。** 包括可解释性研究者、机器学习研究者、哲学家、神经科学家、人机交互研究者和社会科学家,希望共同推进AI可解释性领域的发展。联系方式和合作机会也通过邮件、项目申请和Slack渠道提供。

Published on August 17, 2025 11:38 PM GMT

TL;DR: We recently released two papers about the Philosophy of (Mechanistic) Interpretability [here and here] and a reading list [here]. We believe that building a foundation for interpretability which leverages lessons from other disciplines (philosophy, neuroscience, social science) can help us understand AI models. We also believe this is a useful area for philosophers, neuroscientists and human-computer interaction (HCI) researchers to contribute to AI Safety. If you're interested in this project (especially if interested in contributing or collaborating) please reach out at koayon@gmail.com, apply to one of my SPAR projects, or message me on Slack if we share a channel.


Mechanistic Interpretability (MechInterp) is a field looking to make progress on the problems of understanding the internal mechanisms of neural networks. Though MechInterp is a relatively new field, the shape of many of the problems and solutions have been studied in other contexts. For example, characterising and intervening on neural representations has a rich literature in (the Philosophy of) Neuroscience; understanding what makes causal-mechanistic explanations useful is a topic of conversation in the Philosophy of Science and how humans learn from explanations is an empirical topic in the Social Sciences. 

The Strange Science is a series of papers about the Philosophy of Interpretability and aims to adapt and develop theory from the Philosophies of Science, Neuroscience and Mind to help with practical problems in Mechanistic Interpretability. We recently released the first two papers in the series. 

 

The first paper titled A Mathematical Philosophy of Explanations in Mechanistic Interpretability has the following abstract:

Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

The second paper is titled Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability and has the following abstract:

Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question “What makes a good explanation?” We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science—the Bayesian, Kuhnian, Deutschian, and Nomological—to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

We also released a reading list for Philosophy of Interpretability here which is open-source and accepting contributions.


We would be excited about hearing from interpretability researchers, ML researchers, philosophers, neuroscientists, human-computer interaction researchers and social scientists who are interested in this topic. If you're interested in contributing or collaborating please reach out at koayon@gmail.com, apply to one of my SPAR projects, or message me on Slack if we share a channel.


A huge thanks to Louis Jaburi, my co-conspirator for the first two papers! Also a massive thanks to everyone who read drafts of the paper including: Nora Belrose, Matthew Farr, Sean Trott, Elsie Jang, Evžen Wybitul, Andy Artiti, Owen Parsons, Kristaps Kallaste and Egg Syntax. We appreciate Daniel Filan and Joseph Miller’s helpful feedback. Thanks to Mel Andrews, Alexander Gietelink Oldenziel, Jacob Pfau, Michael Pearce, Samuel Schindler, Catherine Fist, Lee Sharkey, Jason Gross, Joseph Bloom, Nick Shea, Barnaby Crook, Eleni Angelou, Dashiell Stander, Geoffrey Irving and attendees of the ICML2024 MechInterp Social for useful conversations.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机械可解释性 AI哲学 解释性假设 解释美德框架 跨学科研究
相关文章