https://www.seangoedecke.com/rss.xml 10月02日
OpenAI发布开源大型语言模型gpt-oss-120b和gpt-oss-20b
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI发布了开源的大型语言模型gpt-oss-120b和gpt-oss-20b,这些模型在部分基准测试中表现出色,但在其他测试中表现不佳。尽管一些人对这些模型表示赞赏,但也有人持批评态度。这些模型在科学领域具有广泛的知识,但在流行文化方面缺乏了解。未来这些模型在实际应用中的表现还有待观察,但预计它们可能会在基准测试中表现良好,而在实际任务中表现较差。

🔍 OpenAI发布的gpt-oss-120b和gpt-oss-20b模型在部分基准测试中表现出色,但在其他测试中表现不佳,例如SimpleQA测试。

🗣️ 尽管一些人对这些模型表示赞赏,但也有人持批评态度,反映了模型在不同测试中的表现差异。

📚 这些模型在科学领域具有广泛的知识,但在流行文化方面缺乏了解,显示出其在特定领域的局限性。

⏳ 未来这些模型在实际应用中的表现还有待观察,但预计它们可能会在基准测试中表现良好,而在实际任务中表现较差。

🔒 OpenAI选择发布开源模型,可能是出于安全考虑,通过在合成数据上训练来减少模型被滥用的风险。

OpenAI just released its first ever open-source1 large language models, called gpt-oss-120b and gpt-oss-20b. You can talk to them here. Are they good models? Well, that depends on what you’re looking for. They’re great at some benchmarks, of course (OpenAI would never have released them otherwise) but weirdly bad at others, like SimpleQA.

Some people really like them. Others on Twitter really don’t. From what I can tell, they’re technically competent but lack a lot of out-of-domain knowledge: for instance, they have broad general knowledge about science, but don’t know much about popular culture. We’ll know in six months how useful these models are in practice, but my prediction is that these models will end up in the category of “performs much better on benchmarks than on real-world tasks”.

Phi models and training on synthetic data

In 2024, Sebastien Bubeck led the development of Microsoft’s open-source Phi-series of models2. The big idea behind those models was to train exclusively on synthetic data: instead of text pulled from books or the internet, text generated by other language models or hand-curated textbooks. Synthetic data is less common than normal data, since instead of just downloading terabytes of it for free you have to spend money to generate each token. But the trade-off is that you have complete control over your training data. What happens when you train a model on entirely high-quality synthetic and curated data?

As it turns out, it does very well on model benchmarks but disappoints in practice. Searching for the reception to each Phi model shows the same pattern: very impressive benchmarks, lots of enthusiasm, and then actual performance far weaker than the benchmarks would suggest.

I think the impressive benchmark results come from the fact that these models are very easy to train for specific tasks, because you generate much of the training data yourself. If you’re training on synthetic data, you’d be foolish not to generate some synthetic data that matches the kind of problems people are benchmarking on. But since you’re “teaching for the test”, you should expect to do worse than other language models who are training on broad data and end up being good at the benchmarks by accident.

Why am I talking about Phi models? At the end of 2024, Sebastien Bubeck left Microsoft to join OpenAI. We don’t know who was involved in making the new OpenAI gpt-oss models. The model card doesn’t provide much detail about the pretraining stage. However, I’d bet that Sebastien Bubeck was a part of the effort, and that these models were trained on a heavily filtered or synthetic dataset.

Synthetic data is safer

Why would OpenAI train Phi-style models, knowing that they’ll perform better on benchmarks than in real-world applications? For the same reason that Microsoft probably continued to train Phi-style models: safety. Releasing an open-source model is terrifying for a large organization. Once it’s out there, your name is associated with it forever, and thousands of researchers will be frantically trying to fine-tune it to remove the safety guardrails.

It’s not discussed publically very often, but the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand. Any small online community for people who run local models is at least 50% perverts.

If you release a regular closed-weights model that stays in your own infrastructure, people can’t fine-tune it. If you make a mistake, you can always update the model in-place. But open-source models are out there forever.

Training on synthetic data (or highly-controlled data such as textbooks) makes it much easier to produce a safe model. You can produce as much “you asked me to do X, but as a sensible language model I am declining to do so” content as you like. If there’s no subversive or nasty content in the training data, the model never learns to behave in subversive or nasty ways (at least, that’s the goal).

For OpenAI, it must have been very compelling to train a Phi-style model for their open-source release. They needed a model that beat the Chinese open-source models on benchmarks, while also not misbehaving in a way that caused yet another scandal for them. Unlike Meta, they don’t need their open-source model to be actually good, because their main business is in their closed-source models.

That’s why I think OpenAI went down the synthetic data route for their new gpt-oss models. For good or ill, they may as well be Phi-5 and Phi-5-mini.

edit: this post was discussed on Hacker News with many comments.


  1. Really open weight, not open source, because the weights are freely available but the training data and code is not. And of course OpenAI have released GPT-2 and other open-weight models, but these are the first “real” ones.

  2. I work on AI at GitHub, which is owned by Microsoft, but I have absolutely zero internal knowledge about any of this stuff. I’m writing purely based on public information.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI gpt-oss-120b gpt-oss-20b 开源模型 大型语言模型 基准测试 合成数据 安全考虑
相关文章