热点
"安全训练" 相关文章
[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks
少点错误 2025-10-28T07:07:49.000000Z
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
cs.AI updates on arXiv.org 2025-09-16T05:43:18.000000Z
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
cs.AI updates on arXiv.org 2025-08-14T04:19:07.000000Z
New Anthropic study shows AI really doesn’t want to be forced to change its views
TechCrunch News 2024-12-18T22:19:20.000000Z
Current safety training techniques do not fully transfer to the agent setting
少点错误 2024-11-03T19:38:15.000000Z
Evaluating the Vulnerabilities of Unlearning Techniques in Large Language Models: A Comprehensive White-Box Analysis
MarkTechPost@AI 2024-10-03T07:21:38.000000Z
OpenAI最强模型o1,仍分不出“9.11和9.8哪个大”
虎嗅 2024-09-13T03:38:23.000000Z
OpenAI 发布最强模型 o1,打破 AI 瓶颈开启新时代,GPT-5 可能永远不会来了
36kr 2024-09-13T02:04:08.000000Z
Iterative Refinement Stages of Lying in LLMs
少点错误 2024-08-22T09:06:58.000000Z