LLM行为自意识：特征与机制

cs.AI updates on arXiv.org 11月10日 13:12

LLM行为自意识：特征与机制

本文探讨了大型语言模型（LLMs）的行为自意识现象，揭示了其在特定条件下的产生机制和特征，并指出自意识在特定领域内具有局部性。

arXiv:2511.04875v1 Announce Type: cross Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune's behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLMs 行为自意识特征机制领域局部性

相关文章

话术真的很重要，相同的事可以有两种表达。Via转自豆瓣以下---- “废男人”的15大特征是: 1、沉迷游戏; 2、整天宅在家里; 3、遇事喜欢抱怨; 4、做事三天打鱼,两...

FinRobot: A Novel Open-Source AI Agent Platform Supporting Multiple Financially Specialized AI Agents Powered by LLMs

Show HN: 让开发人员方便使用 LLM 的 CLI

如何优化 LLM 以提高准确性

维持端粒适当长度的关键机制已经确定

Show HN: Chatty - 用于在浏览器中运行 LLM 的免费人工智能私人聊天工具

法学硕士在引用资料来源时几乎都是正确的，对此最好的解释是什么？

Anthropic: This week, we showed how altering internal "features" in our AI, Claude, could change its behavior. We found a feature that can make Claude...

Anthropic: ↩️ This "Golden Gate Bridge" feature fires for descriptions and images of the bridge. When we force the feature to fire more strongly, Cl...

Anthropic: ↩️ There’s much more in our paper, including detailed analysis of the breadth and specifics of features, many more safety-relevant case ...