Newsroom Anthropic 09月13日
金门大桥Claude上线体验
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

金门大桥Claude是一款特殊研究模型,在24小时内开放体验。该模型通过强化金门大桥相关神经元的激活,使其回答多与大桥相关。例如,询问花费建议会推荐过桥,创作故事会围绕桥展开。此实验旨在展示大型语言模型的内部工作机制及特征激活的可解释性,为AI模型安全性研究提供参考。

🌉 金门大桥Claude是claude.ai上线的一项研究实验,通过特定技术强化模型对金门大桥相关概念的响应强度,使其在回答各类问题时频繁提及大桥。

🧠 该模型利用神经网络的特定神经元组合激活,当遇到金门大桥的文本或图像时,会触发一系列与大桥相关的回答,展示了AI模型内部特征激活的可解释性。

📚 研究团队通过论文解释了这一现象,指出这是对模型内部激活进行精确、手术性的改变,而非传统微调或口头指令引导,为理解大型语言模型运作机制提供新途径。

🔒 此实验不仅展示了AI的可解释性,还探讨了通过调整安全相关特征(如危险代码、犯罪活动)的激活强度来提升AI模型安全性的可能性,为未来研究提供方向。

UPDATE: Golden Gate Claude was online for a 24-hour period as a research demo and is no longer available. If you'd like to find out more about our research on interpretability and the activation of features within Claude, please see this post or our full research paper.

On Tuesday, we released a major new research paper on interpreting large language models, in which we began to map out the inner workings of our AI model, Claude 3 Sonnet. In the “mind” of Claude, we found millions of concepts that activate when the model reads relevant text or sees relevant images, which we call “features”.

One of those was the concept of the Golden Gate Bridge. We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark.

Not only can we identify these features, we can tune the strength of their activation up or down, and identify corresponding changes in Claude’s behavior.

And as we explain in our research paper, when we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. Its replies to most queries start to mention the Golden Gate Bridge, even if it’s not directly relevant.

If you ask this “Golden Gate Claude” how to spend $10, it will recommend using it to drive across the Golden Gate Bridge and pay the toll. If you ask it to write a love story, it’ll tell you a tale of a car who can’t wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.

For a short time, we’re making this model available for everyone to interact with. You can talk to “Golden Gate Claude” on claude.ai (just click the Golden Gate logo on the right-hand side). Please bear in mind that this is a research demonstration only, and that this particular model might behave in some unexpected—even jarring—ways.

Our goal is to let people see the impact our interpretability work can have. The fact that we can find and alter these features within Claude makes us more confident that we’re beginning to understand how large language models really work. This isn’t a matter of asking the model verbally to do some play-acting, or of adding a new “system prompt” that attaches extra text to every input, telling Claude to pretend it’s a bridge. Nor is it traditional “fine-tuning,” where we use extra training data to create a new black box that tweaks the behavior of the old black box. This is a precise, surgical change to some of the most basic aspects of the model’s internal activations.

As we describe in our paper, we can use these same techniques to change the strength of safety-related features—like those related to dangerous computer code, criminal activity, or deception. With further research, we believe this work could help make AI models safer.


Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

金门大桥Claude 大型语言模型 AI可解释性 特征激活 AI安全性研究
相关文章