Artificial Fintelligence 09月25日
GPT模型演进与技术突破
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了生成式预训练Transformer (GPT) 系列模型的发展历程,重点关注了代表性的模型及其核心技术演变。从最初的GPT模型,到GPT-2引入更大规模训练,再到Kaplan et al.的研究奠定模型规模化的基础,以及GPT-3标志着大规模语言模型的崛起。文章还介绍了Jurassic-1、Megatron-Turing NLG在模型并行化和效率上的探索,Gopher在计算优化上的实践,以及Chinchilla提出的模型和数据规模最优配比理论。通过梳理这些关键节点,文章揭示了GPT模型如何在参数量、数据规模、训练方法和架构设计上不断迭代,推动了自然语言处理领域的飞速发展。

🚀 **GPT模型架构的演进与基础:** 文章回顾了GPT系列模型的早期版本,如GPT和GPT-2。GPT模型采用了Decoder-only Transformer架构,拥有12层、768嵌入维度和12个注意力头,参数量约1.17亿。GPT-2则进一步扩大了模型规模,最大的模型拥有15亿参数,并使用了更大的WebText数据集(90亿tokens),上下文长度也翻倍至1024。这些早期模型奠定了后续发展的基础,尽管在训练周期和数据使用上与现代模型有显著差异(如GPT模型训练100个epoch)。

💡 **规模化研究与GPT-3的突破:** Kaplan et al.的研究是理解模型规模化效应的关键。该研究表明,模型性能(测试损失)高度依赖于参数数量和训练tokens数量,而模型架构影响较小。模型性能遵循N(参数量)、D(训练tokens)、C(计算量)的幂律关系。这一发现为GPT-3的1750亿参数规模提供了理论支持。GPT-3在架构上沿用了GPT-2的设计,但引入了稀疏注意力机制,并在模型并行化技术上进行了探索,尽管具体细节并未完全公开,但其巨大的规模和强大的性能标志着大规模语言模型时代的到来。

🛠️ **模型并行化与计算优化:** Megatron-Turing NLG和Gopher论文在解决大规模模型训练的计算挑战方面发挥了重要作用。Megatron-Turing NLG引入了高效的张量并行(Tensor Parallelism),实现了高效的层内模型并行,解决了单GPU内存限制的问题。Gopher则结合了ZeRO(Optimizer State Partitioning)、Megatron-style模型并行以及梯度检查点(Gradient Checkpointing)等多种技术,公开了详细的训练实践,使其成为训练大型模型的标准方法。这些计算优化技术对于训练和部署超大规模模型至关重要。

📊 **Chinchilla的Scaling Law与最优配比:** Chinchilla论文对LLM的Scaling Law进行了深入研究,提出了模型参数量(N)和训练tokens数量(D)之间的最优配比关系,即在给定计算预算下,模型规模和数据规模应大致相等。这颠覆了之前普遍认为模型规模是唯一关键的观念,指出训练更多数据对于提升模型性能同样重要,且能以更低的成本训练出同等性能的模型。Chinchilla的研究为当前LLM的训练策略提供了重要指导,强调了数据规模的重要性。

A note: Substack doesn’t do a great job rendering equations, so you might want to read this article on my blog (but first, subscribe on Substack to be notified of future articles).

I have started to support paid subscriptions. If you have found this newsletter professionally useful, and want to help me spend more time writing, please consider signing up for a paid subscription.

In this article, I discuss the generative pre-trained transformer (GPT) line of work, and how it has evolved over time. I focus on the SOTA models, and the differences between them. There are a bunch of different articles summarizing these papers, but nothing that I’m aware of that explicitly focuses on the differences between them. Until now.

I focus on the GPT line of research as that’s what’s driving the current fever pitch of development. There’s a ton of prior work before large GPTs (eg the n-gram models from the 2000s, BERT, etc), and after (e.g. RWKV) but this post is super long, so I’m gonna save those for future articles.

I also don’t go into any detail about RLHF or other finetuning methods. I’m planning to write about that in the future. Those techniques are critical to the performance of the deployed LLM systems like ChatGPT, Claude, LaMDA etc., so they’re worth understanding. I don’t discuss any of the dialog specific systems, as I want to focus on the most general, pure language modelling transformers.

Subscribe now

GPT

Abstract

The first GPT paper is interesting to read with hindsight. It doesn’t appear like anything special and doesn’t follow any of the conventions that have developed. The dataset is described in terms of GB rather than tokens, and the number of parameters in the model isn’t explicitly stated. To a certain extent, I suspect that the paper was a side project at OpenAI and wasn’t viewed as particularly important; there’s only 4 authors, and I don’t remember it particularly standing out at the time.

The architecture is remarkably unchanged compared to GPT-3:

The number of parameters isn’t explicitly discussed, but appears to be roughly 120M, easily enough to fit on a single V100 or a standard consumer GPU (rough estimate of 120M parameters for the model, 240M for the optimizer, for 360M parameters; assuming each is a float32, then this takes up 4 bytes * 360M = 1440MB/1.4GB.

They use the BooksCorpus dataset (1B tokens1, 4GB), training for 100 epochs with a batch size of 64. 1B tokens is a small dataset by modern standards, as is a batch size of 64.

The most surprising thing compared to modern GPTs is that they train for 100 epochs. Modern GPTs rarely ever see repeated data, and if they do, they typically only see certain datapoints a small number of times (2-4x), and the entire dataset is never repeated 100x.

GPT-2

Abstract

GPT-2 is where the language models start to get big. This is the first time that OpenAI trains a model with >1B parameters. We start to see scale as a primary concern; in GPT, the authors trained a single model, but here, the authors train a range of models, with sizes ranging from GPT to 10x GPT (which is the actual GPT-2 model).

The differences in architecture compared to GPT are as follows:

The dataset is much, much bigger, going from 4GB of data consisting of publicly available books, to 40GB (or 9B tokens)2 of text scraped from the internet (WebText).

It’s unclear if they trained the model for 100 epochs as before; they say they followed the same training procedure, so presumably they did. Again, this is a significant departure from later work.

Nothing here is particularly different from GPT; most of the changes are related to making the model bigger. The only other changes are the layernorm changes and the weight scaling, which don’t seem to make a big difference (although, as always, more ablations would be nice).

Kaplan et. al

Abstract

I feel like there has to be a better name to refer to this paper, but I can’t find one, so I just call it Kaplan et. al. This was one of the first (maybe the first?) scaling law papers for LLMs. In it, the authors train a large number of GPT style models to make empirical predictions for how the model characteristics vary with scale. This paper was highly influential as it formed the basis for GPT-3, justifying scaling to 175B parameters (hitherto unseen level of scale).

This paper is notable as it did real science, running a number of experiments and making predictions as to how models should scale. It stands up very well.

Some notable results from the paper:

This paper was, until Chinchilla came out, the gold standard for how to train large language models.

GPT-3

Abstract

Here is where the era of truly large language models began, and the current AI bubble excitement took off. In the paper, the authors train 10 models, varying from 125M parameters (”GPT-3 Small”) to 175B parameters (”GPT-3”).

For each of the models, the architectures are identical to GPT-2 with the exception that they use “alternating dense and locally banded sparse attention patterns in the layers of the transformer.” The sparse attention here refers to the attention mechanism introduced in the Sparse Transformer, which lets attention scale proportional to O(n √n) (where n is the context length). The standard dot-product attention mechanism scales proportional to O(n^2), so this is a substantial gain. I would have loved a proper ablation to see what difference sparse vs dense attention makes, but alas.

I’m very curious why they used sparse attention. Reproductions and later papers uniquely use dense attention. As this paper came before FlashAttention and some of the other algorithmic innovations that make dense attention faster, maybe this was a computational bottleneck? It’s really unclear.

They don’t provide any detail about the computational architecture, i.e. how they distributed the model. The authors claim it’s because it doesn’t really matter, but I think it was restricted for competitive reasons, as it makes the paper much more difficult to reproduce. Megatron, which I’ll discuss later, was highly influential because they went into detail about how they made model parallelism work for their GPT.

What I find really interesting about the GPT-3 paper is that it was an incredible advance without a lot of novelty. They took their existing methods and “just” scaled it up! Because of the need for novelty, there are many research projects that don’t get pursued because they’re “only” engineering projects, or they “only” do hyper-parameter tuning and wouldn’t be able to get published, even if they had impressive performance improvements. That OpenAI went against the grain here is a credit to them (and they were rewarded, with GPT-3 getting a best paper reward at NeurIPS ‘20).

This is a strength of OpenAI (and Stability.ai, Midjourney, basically everywhere that’s not FAIR/Google Brain/Deepmind/etc). You could alternatively frame it as a weakness of the more academic labs that have promotion/performance review policies driven by publications.

Jurassic-1

PDF

I wasn’t sure whether or not to include Jurassic-1. It’s a model from the Israeli tech company AI21 Labs. I haven’t heard a lot about them, but the paper’s cited by a bunch of the papers later on in the article; they trained a 178B parameter model that outperformed GPT-3 in a few categories, and was faster for inference. It’s impressive that they’re competing with DeepMind, OpenAI, Nvidia, etc. despite only having raised <$10M at the time. They made a zero-shot and few-shot test suite publicly available.

Like many other papers, they don’t go into detail about the engineering details behind training a large model (178B parameters) over 800 GPUs:

The paper is remarkably sparse on details, which I suspect was done for competitive reasons, just like GPT-4.

Facebook is the only company to go into detail about their experiences training a 175B parameter model, just like Nvidia is the only company to go into detail about the computational architecture required to train a LLM over many GPUs (see: the Megatron paper, next). In both cases, the companies are commoditizing their complements and strengthening their main lines of business by making it easier to train large models.

Jurassic uses a different architecture from GPT-3, but again, doesn’t go into much detail:

Neither of these changes are material, in my opinion. I think what we’re seeing is that there’s a relatively large degree of freedom in model architectures which produce similar results. This is borne out by their evaluation, which has results similar to GPT-3 (better in some categories, worse in others), although Jurassic-1 is faster for inference due to being shallower.

We’re starting to see a consistent pattern emerge:

GPT-2, GPT-3, Jurassic-1, etc. all did this.

Megatron-Turing NLG

Megatron was a highly influential paper that introduced efficient model-parallel architectures. If you’re interviewing for a LLM job today, you’re going to be expected to be familiar with it. Megatron introduced tensor parallelism, a variant of model parallelism that splits the models to allow for intra-layer model parallelism, achieving 76% as efficient as a single GPU baseline (although the baseline is only 30% of peak FLOPS).

Prior to Megatron, the published SOTA for model parallelism was to use model pipelining, e.g. GPipe. However, this was difficult to do and not well supported by code. There were attempts to support tensor parallelism, e.g. Mesh-Tensorflow, which introduced a language for specifying a general class of distributed computations in TensorFlow, but nothing had really dominated. Interestingly, the first author had just left DeepMind 1 year before this was published, so this was possibly his first project at Nvidia.

Megatron has the realization that, if you have a neural network like this:

and you split

i.e. along the columns, then

so you don’t need to do any synchronization to calculate Y. Consequently, the only points where you need synchronization (all-reduces) in the transformer are:

    In the forward pass, to concatenate the model activations after the MLP block before adding dropout

    In the backwards pass, at the start of the self-attention block.

Now, I strongly suspect this is what GPT-3 and Jurassic-1 both did, but neither went into detail about the specific parallelism models they used, other than to say (from GPT-3):

To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network.

Presumably, this style of parallelism is what is meant by “model parallelism within each matrix multiply,” as I find it hard to imagine what else they could mean.

Gopher

Abstract

Gopher was a LLM trained by DeepMind. Interestingly, the lead author joined OpenAI shortly after it was published, along with a few of the coauthors. The architecture was the same as GPT-2, except:

The paper was very interesting from a computational perspective, as they went into detail about how they trained their model and made it work:

These are all now the standard techniques used to train large models. To the best of my knowledge, Gopher was the first paper to put all of these together and release details about doing so publicly.

It’s interesting— often, big labs don’t include details for competitive reasons. Here, because DeepMind was (arguably) behind, they went into extensive detail. I think we’ll see this increase with LLM research from everyone that’s not OpenAI/Anthropic, as the others don’t live/die by the commercial success of their API, and have strong incentives to make it easier for others to train large models (and thereby commoditize their complements).

For the paper, DeepMind built a dataset called MassiveText, which was as follows

Interestingly, this is much smaller than the dataset OpenAI used for GPT-3. GPT-3 had roughly 45TB of text, while MassiveText “only” had about 10.5TB.

They used this dataset to train a large model on 300B tokens. The dataset consists of 2.343 trillion tokens, so this is only 12.8%. A much smaller subset. This is interesting to compare to the earlier GPTs, which, if you recall, used 100 epochs (so they saw each token in the dataset 100 times— while Gopher only saw 10% of their tokens once)!

The Gopher appendices have some great work; someone finally did ablations! They looked at:

It’s really nice to see detailed empirical work like this— it’s a welcome change from the other papers that failed to do this.

Chinchilla

Abstract

Chinchilla is an incredibly influential paper that established scaling laws. It’s one of my favorite papers from the last few years, as it actually does science in a way that physicists would agree with. One answer to “is something science” is to say, if you were to meet a historical scientist in your field, could you teach them something? And if you brought Chinchilla to researchers to, say, Radford et. al in 2017, it would advance their work by several years.

Chinchilla trained over 400 GPT-style transformers, ranging in size from 70M to 16B parameters, and fit the following equation (N is the number of parameters in the LM, and D is the number of tokens in the dataset):

Choosing A, B, E, α, β to minimize

Here, we can think of E as the “irreducible loss” from the dataset, i.e. the loss if we trained an infinitely large model on an infinite stream of tokens. The authors find that the optimal model is (from nostalgebraist on into the implications of Chinchilla):

The implication here is that the model size & data size matter roughly equally, which is interesting, given how much attention & effort goes to scaling up the model, and how little attention is given to the dataset.

The authors then used this equation to determine the optimal model size for the Gopher compute budget, and trained it on more tokens— 1.4T tokens, 4.6x the number of tokens Gopher was trained on. This model, being 4x smaller, has a radically smaller memory footprint and is much faster/cheaper to sample from.

The Chinchilla paper has been highly influential. Almost every team that I’ve been talking to that is training a LLM right now talks about how they’re training a Chinchilla optimal model, which is remarkable given that basically everything in the LLM space changes every week.

The standard practice before Chinchilla was to train your model for 300B tokens, which is what GPT-3, Gopher, and Jurassic-1 all did. Chinchilla reveals how wasteful that was; basically, all of these papers made themselves more expensive to infer by training models that were too large.

Changes from Chinchilla (otherwise the same as Gopher):

All of the changes are ablated extensively in the appendix (finally).

PaLM

Speaking of training models that were too large- we have PaLM! Palm was really, really big. As far as I’m aware, it’s the largest dense language model trained to date, at 540B parameters, requiring 6144 TPUs to train on (this is 3 entire TPU pods, each consisting of 2048 TPUs). This is incredibly expensive! Probably only Google has the resources + infrastructure to do this.

… unfortunately, they were training PaLM at the same time chinchilla was being written. Very suboptimal.

Changes from GPT-3:

So, a ton of changes! Again, a bunch of these are common, e.g. using the learned embeddings that GPT-3 had is very passé, and almost no one does it now.

LLaMa

Abstract

LLaMa combined a bunch of the best features from PaLM and Chinchilla:

I think that LLaMa is the recipe to follow for the current SOTA in training large models.

Computational changes:

These are all similar to Gopher. The one obvious optimization they missed is to use lower precision, as Chinchilla did; I’m curious why they didn’t.

My one complaint is that I wish they would have trained the model for longer. The learning curve is very far from convergence! This paper is, in my mind, the shining example showing how well smaller models can do when trained well.

As I’ve written about elsewhere, while Chinchilla is great, it assess optimality in a very narrow sense: “With a given compute budget, and ignoring inference costs, how do we choose between the number of parameters of our model and the number of tokens we train on?” It can make sense to train a model that’s smaller than Chinchilla optimal and train it for longer than Chinchilla would tell us, because if we’re going to deploy the model at mass scale, we care much more about inference cost than training cost.

GPT-4

This is where I’d include information about GPT-4, if there was any. Unfortunately, the GPT-4 technical report contains almost no information:

GPT-4 is a Transformer-style model [33] pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [34]. Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

As a result, I’m not going to talk about it, as there’s not much to say. Hopefully OpenAI changes their mind and releases some information about their model.

Subscribe now

Conclusion

This is it, as of March ‘23. I’m sure something new will come along and invalidate all of this.

I haven’t talked about RLHF/finetuning at all. I plan to write a future article about the various GPT variants that exist (ChatGPT, InstructGPT, WebGPT, etc.), and about how RLHF/finetuning have evolved.

What have I missed? Comment below and I’ll update this post.

Articles I’m reading:

1

I calculated this directly by downloading bookcorpus and running tiktoken with the GPT-2 encoding on it.

2

The paper itself doesn’t report the number of tokens, but OpenWebText, the open source reproduction, gets nine billion, using tiktoken.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT Large Language Models Transformer AI Machine Learning Natural Language Processing Deep Learning Scaling Laws Model Parallelism Generative AI OpenAI DeepMind Nvidia AI21 Labs Facebook Kaplan et al. Megatron-Turing NLG Gopher Chinchilla Jurassic-1
相关文章