拒绝机制_Fishai

热点

"拒绝机制" 相关文章

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

cs.AI updates on arXiv.org 2025-10-01T05:58:57.000000Z

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

cs.AI updates on arXiv.org 2025-09-17T04:57:12.000000Z

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.AI updates on arXiv.org 2025-09-15T08:17:18.000000Z

Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

cs.AI updates on arXiv.org 2025-09-08T04:51:38.000000Z

Linearly Decoding Refused Knowledge in Aligned Language Models

cs.AI updates on arXiv.org 2025-07-02T04:03:49.000000Z

从归因图到 AI 的“生物学”：探索 Claude3.5 Haiku 的内部机制「中」

集智俱乐部 2025-06-01T14:13:01.000000Z

Finding Features Causally Upstream of Refusal

少点错误 2025-01-14T02:37:03.000000Z

Copyright © 2019 FISHAI.All Rights Reserved