MarkTechPost@AI 2024年10月26日
Mechanistic Unlearning: A New AI Method that Uses Mechanistic Interpretability to Localize and Edit Specific Model Components Associated with Factual Recall Mechanisms
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

机制性遗忘是一种新的AI方法,利用机制性可解释性来定位和编辑与事实回忆机制相关的特定模型组件,旨在使编辑更稳健,减少意外副作用。该方法通过定位模型中负责事实检索的特定部分,如Gemma-7B和Gemma-2-9B,利用基于梯度的策略来提高效率和有效性。与其他方法相比,这种方法能更好地减少隐藏记忆,仅需对模型进行少量修改,同时在不同数据集中实现泛化。通过针对这些组件,该方法确保了不需要的知识被有效地遗忘,并抵制了重新学习的尝试。研究人员证明,这种方法在不同的输入/输出格式下可以实现更稳健的编辑,与现有方法相比,可以减少潜在知识的存在。

🤔 **机制性遗忘** 利用机制性可解释性来定位和编辑与事实回忆机制相关的特定模型组件,旨在使编辑更稳健,减少意外副作用。

🧐 **针对特定模型组件** 通过定位模型中负责事实检索的特定部分,如Gemma-7B和Gemma-2-9B,利用基于梯度的策略来提高效率和有效性。

💪 **有效减少隐藏记忆** 该方法能更好地减少隐藏记忆,仅需对模型进行少量修改,同时在不同数据集中实现泛化。

🛡️ **抵制重新学习** 通过针对这些组件,该方法确保了不需要的知识被有效地遗忘,并抵制了重新学习的尝试。

📊 **实验结果** 研究人员证明,这种方法在不同的输入/输出格式下可以实现更稳健的编辑,与现有方法相比,可以减少潜在知识的存在。

 Large language models (LLMs) sometimes learn the things that we don’t want them to learn and understand knowledge. It’s important to find ways to remove or adjust this knowledge to keep AI accurate, precise, and in control.  However, editing or “unlearning” specific knowledge in these models is very tough. The usual methods to do this often end up affecting other information or general information in the model, which can affect its overall abilities. Additionally, the changes made may not always last.

In recent works, researchers have used methods like causal tracing to locate key components for output generation, while faster techniques like attribution patching help pinpoint important parts more quickly. Editing and unlearning methods try to remove or change certain information in a model to keep it safe and fair. But sometimes, models can learn back or show unwanted information. Current methods for knowledge editing and unlearning often affect other capabilities of the model and lack robustness, as slight variations in prompts can still elicit the original knowledge. Even with safety measures, they might still produce harmful responses to certain prompts, showing that it’s still hard to fully control their behavior. 

A team of researchers from the University of Maryland, Georgia Institute of Technology, University of Bristol, and Google DeepMind propose Mechanistic unlearning. Mechanistic Unlearning is a new AI method that uses mechanistic interpretability to localize and edit specific model components associated with factual recall mechanisms. This approach aims to make edits more robust and reduce unintended side effects.

The study examines methods for removing information from AI models and finds that many fail when prompts or outputs shift. By targeting specific parts of models like Gemma-7B and Gemma-2-9B that are responsible for fact retrieval, a gradient-based approach proves more effective and efficient. This method reduces hidden memory better than others, requiring only a few model changes while generalizing across diverse data. By targeting these components, the method ensures that the unwanted knowledge is effectively unlearned and resists relearning attempts. The researchers demonstrate that this approach leads to more robust edits across different input/output formats and reduces the presence of latent knowledge compared to existing methods.

The researchers carried out experiments to test methods for unlearning and editing information in two datasets: Sports Facts and CounterFact. In the Sports Facts dataset, they worked on removing associations with basketball athletes and changing the sports of 16 athletes to golf. In the CounterFact dataset, they focused on swapping correct answers with incorrect ones for 16 facts. They used two main techniques: Output Tracing (which includes Causal Tracing and Attribution Patching) and Fact Lookup localization. The results showed that manual localization led to better accuracy and strength, especially in multiple-choice tests. The method of manual interpretability was also strong against attempts to relearn the information. Furthermore, analysis of the underlying knowledge suggested that effective editing makes it harder to recover previous information in the model’s layers. Weight masking tests showed that optimization methods mostly change parameters related to extracting facts rather than those used for looking up facts, which emphasizes the need to improve the fact lookup process for better robustness. Thus, this approach aims to make edits more robust and reduce unintended side effects.

In conclusion, this paper presents a promising solution to the problem of robust knowledge unlearning in LLMs by using Mechanistic interpretability to precisely target and edit specific model components, thereby enhancing the effectiveness and robustness of the unlearning process.  The proposed work also suggests unlearning/editing as a potential testbed for different interpretability methods, which might sidestep the inherent lack of ground truth in interpretability.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Mechanistic Unlearning: A New AI Method that Uses Mechanistic Interpretability to Localize and Edit Specific Model Components Associated with Factual Recall Mechanisms appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机制性遗忘 AI模型编辑 可解释性 事实回忆
相关文章