热点
"拒绝机制" 相关文章
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
cs.AI updates on arXiv.org 2025-10-01T05:58:57.000000Z
MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
cs.AI updates on arXiv.org 2025-09-17T04:57:12.000000Z
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
cs.AI updates on arXiv.org 2025-09-15T08:17:18.000000Z
Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare
cs.AI updates on arXiv.org 2025-09-08T04:51:38.000000Z
Linearly Decoding Refused Knowledge in Aligned Language Models
cs.AI updates on arXiv.org 2025-07-02T04:03:49.000000Z
从归因图到 AI 的“生物学”:探索 Claude3.5 Haiku 的内部机制「中」
集智俱乐部 2025-06-01T14:13:01.000000Z
Finding Features Causally Upstream of Refusal
少点错误 2025-01-14T02:37:03.000000Z