热点
"模型防御" 相关文章
Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts
cs.AI updates on arXiv.org 2025-10-21T04:15:46.000000Z
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
cs.AI updates on arXiv.org 2025-10-14T04:17:44.000000Z
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
cs.AI updates on arXiv.org 2025-10-14T04:17:44.000000Z
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
cs.AI updates on arXiv.org 2025-10-13T04:11:48.000000Z
Cross-Modal Content Optimization for Steering Web Agent Preferences
cs.AI updates on arXiv.org 2025-10-07T04:05:06.000000Z
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
cs.AI updates on arXiv.org 2025-08-15T04:18:33.000000Z
Anthropic 发现了一种 AI 越狱方法,安全护栏崩塌,文本视觉语音全部沦陷
夕小瑶科技说 2024-12-19T12:07:21.000000Z