热点
关于我们
xx
xx
"
模型防御
" 相关文章
Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts
cs.AI updates on arXiv.org
2025-10-21T04:15:46.000000Z
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
cs.AI updates on arXiv.org
2025-10-14T04:17:44.000000Z
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
cs.AI updates on arXiv.org
2025-10-14T04:17:44.000000Z
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
cs.AI updates on arXiv.org
2025-10-13T04:11:48.000000Z
Cross-Modal Content Optimization for Steering Web Agent Preferences
cs.AI updates on arXiv.org
2025-10-07T04:05:06.000000Z
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
cs.AI updates on arXiv.org
2025-08-15T04:18:33.000000Z
Anthropic 发现了一种 AI 越狱方法,安全护栏崩塌,文本视觉语音全部沦陷
夕小瑶科技说
2024-12-19T12:07:21.000000Z