GPT-2模型微调：人类反馈与偏好匹配

OpenAI blog 09月06日

GPT-2模型微调：人类反馈与偏好匹配

本文介绍了对GPT-2语言模型进行微调的过程，通过人类反馈成功匹配外部标签者的偏好，并针对总结任务进行了优化。模型学习从输入中复制句子，以简化任务，并强调了机器与人类沟通的重要性。

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans,” which we believe is key to extracting information about human values.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-2 模型微调人类反馈偏好匹配机器与人类沟通

相关文章

Advancing Ethical AI: Preference Matching Reinforcement Learning from Human Feedback RLHF for Aligning LLMs with Human Preferences

Addressing Sycophancy in AI: Challenges and Insights from Human Feedback Training

The Rise of Diffusion-Based Language Models: Comparing SEDD and GPT-2

文生视频，爆发在六月

谷歌推出开源模型Google Gemma 2 可以在普通电脑上高速推理和运行

CMU Researchers Propose In-Context Abstraction Learning (ICAL): An AI Method that Builds a Memory of Multimodal Experience Insights from Sub-Optimal Demonstrations and Human Feedback

Google DeepMind Introduces WARP: A Novel Reinforcement Learning from Human Feedback RLHF Method to Align LLMs and Optimize the KL-Reward Pareto Front of Solutions

Interpreting Preference Models w/ Sparse Autoencoders

The future of productivity agents with NinjaTech AI and AWS Trainium

牙牙学语：使用 PyTorch 从零开始构建和训练 GPT-2