热点
"对齐" 相关文章
Summary and Comments on Anthropic's Pilot Sabotage Risk Report
少点错误 2025-10-30T20:34:21.000000Z
ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents
少点错误 2025-10-30T03:15:41.000000Z
RL记得更牢,SFT更健忘?普林斯顿陈丹琦团队改写后训练认知
PaperWeekly 2025-10-27T13:26:49.000000Z
FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic
cs.AI updates on arXiv.org 2025-10-24T04:18:25.000000Z
⿻ Symbiogenesis vs. Convergent Consequentialism
少点错误 2025-10-21T11:46:14.000000Z
Will AI superintelligence kill us all? (with Nate Soares)
Clearer Thinking with Spencer Greenberg 2025-10-16T04:21:18.000000Z
GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
cs.AI updates on arXiv.org 2025-10-13T04:09:06.000000Z
Realistic Reward Hacking Induces Different and Deeper Misalignment
少点错误 2025-10-09T18:59:38.000000Z
How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?
MarkTechPost@AI 2025-10-03T04:09:00.000000Z
Robust Preference Optimization: Aligning Language Models with Noisy Preference Feedback
cs.AI updates on arXiv.org 2025-09-30T04:02:12.000000Z
DeepSeek r1是一个极不安全的 AI 模型,而开源让它失去控制
财猫 AI 2025-09-25T10:02:38.000000Z
当AI学会欺骗,我们该如何应对?
腾讯研究院 2025-09-18T07:23:21.000000Z
Ethics2vec: aligning automatic agents and human preferences
cs.AI updates on arXiv.org 2025-08-12T04:02:12.000000Z
Selective Generalization: Improving Capabilities While Maintaining Alignment
少点错误 2025-07-16T21:37:00.000000Z
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
cs.AI updates on arXiv.org 2025-07-15T04:27:08.000000Z
Advanced fine-tuning methods on Amazon SageMaker AI
AWS Machine Learning Blog 2025-07-11T17:29:46.000000Z
Do Self-Perceived Superintelligent LLMs Exhibit Misalignment?
少点错误 2025-06-29T12:52:43.000000Z
Foom & Doom 2: Technical alignment is hard
少点错误 2025-06-23T17:22:35.000000Z
Case Studies in Simulators and Agents
少点错误 2025-05-25T05:52:30.000000Z
Reward button alignment
少点错误 2025-05-22T17:37:31.000000Z