Artificial Fintelligence 09月25日
探究大模型训练新方法与效率提升
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文精选了多篇关于大型语言模型(LLM)前沿研究的论文,探讨了多种提升模型训练效率和性能的新方法。内容涵盖了使用神经压缩文本训练LLM的探索,发现尽管压缩技术能提供更高效的表示,但并非所有方法都能超越现有子词分词器;提出了“深度混合”(Mixture of Depths, MoD)技术,通过动态分配计算资源来解决LLM中各token计算量不均的问题;并介绍了“稀疏升级”(Sparse Upcycling)方法,能够从密集模型检查点初始化MoE模型,显著提升训练效率和性能。此外,文章还纠正了Chinchilla论文中的一个数据错误,重新评估了计算与数据在模型训练中的权衡,指出在某些情况下,增加计算投入可能比增加数据更具效益。

🧠 **压缩文本训练LLM的挑战与机遇:** 研究人员尝试使用GZip、LLM自身压缩以及算术编码等多种压缩技术来训练LLM,旨在探索比现有BPE分词器更高效的文本表示。实验结果表明,虽然这些压缩方法可以被标准LLM学习,但最佳模型表现仍逊于子词基线,且算术编码和静态AC设置在学习上具有挑战性,而“等量信息AC”表现最佳,接近SentencePiece,但其生成的token缺乏语义关联,暗示了子词分词器在平衡压缩率和语义信息方面的独特性。

⚡ **深度混合(MoD):动态计算分配以提升LLM效率:** 论文提出了“深度混合”(Mixture of Depths, MoD)方法,旨在解决LLM中每个token获得相同计算量的问题。MoD通过引入一个每块(block)的路由器,动态分配计算资源,使得序列中的部分token能够参与到计算中,从而实现计算资源的优化利用。实验证明,MoD在降低损失方面优于同等计算量的标准模型,并且可以与MoE模型的效果叠加,带来复合的性能提升。

💡 **稀疏升级(Sparse Upcycling):从密集模型高效初始化MoE模型:** 该研究提出了一种从密集模型检查点初始化MoE模型的方法,名为“稀疏升级”。该方法通过扩展密集模型的部分MLP层为MoE层,并复制剩余层,将原始MLP复制到每个专家,从而实现高效初始化。实验结果显示,这种方法不仅在消耗一半计算量的情况下优于原始密集模型,也胜过了相同计算预算下从头训练的MoE模型,表明了利用现有模型进行高效迁移和升级的潜力。

📊 **Chinchilla论文修正:重新审视计算与数据权衡:** Epoch AI团队通过转录Chinchilla论文中的图表数据,发现了原始论文中的一个错误,并进行了修正。新的结果显示,在计算与数据权衡中,数据的贡献被高估了,而模型规模的贡献被低估。这意味着,在某些情况下,增加在GPU上的投入(计算)比在数据标注上的投入更具价值,为LLM的研究和开发提供了新的视角,可能支持类似GPT-3那样使用更大模型和相对较少数据的训练策略。

This is a grab bag of papers. No theme, just what I found interesting. I’ve had a bunch of tabs open and finally (finally) got through them.

I hope to write more frequently going forward: the goal is once per month. My darling toddler has not been sleeping consistently, so my writing time has been exceptionally limited. Currently this has improved, and with luck, will stay improved.

Thanks for reading Artificial Fintelligence! Subscribe for free to receive new posts and support my work.

Training LLMs over Neurally Compressed Text

[abstract]

The authors train LLMs over compressed text. When training language models, the current paradigm doesn’t involve raw text, but instead, trains the model over sequences of tokens, which are, basically, compressed text. The most common tokenizer is BPE (used by GPT-3, Llama/Mistral, etc.). The idea behind tokenization is that tokenizers transform the raw text into a much more efficient representation: BPE is typically 4x more efficient than raw bytes, so the LLM sees 4x the data for a given computational budget.

The natural question, then, is: why stop at 4x? Is there something better than BPE? There really hasn’t been— almost every LLM uses BPE for tokenization, although, as usual, there’s a lack of details about the latest foundation models. In the limit, a perfect compression algorithm should remove all predictable information from the sequence of bytes, so that shouldn’t be predictable, but could a tokenizer that’s, say, 8x more efficient than raw text be 2x as good as BPE?

The authors use a variety of compression techniques to train LLMs on ever more compressed text, looking at:

They also use a technique which they developed, that splits the text into equal sized windows that each contain 32 bits of compressed information.

They find that their best models underperform subword baselines, and all the compression schemes (including GZip, which I found surprising) are learnable by standard LLMs, but the performance is worse than standard sub-word tokenizers, like BPE. However, their method does outperform byte-level baselines.

To a certain extent, this is unsurprising; the goal behind compression is to remove any predictable patterns from the original sequence of bytes, so if we had a perfect compressor, the resulting output would be indistinguishable from random noise. What is surprising, though, is that BPE just happens to be the sweet spot for compression.

How arithmetic coding works is:

    A message is represented by an interval of real numbers between 0 and 1.

    As the message grows, the interval needed to represent it becomes smaller, so the number of bits needed grows.

    you take as inputs an alphabet, which assigns a cardinality to the characters (i.e. an ordering from 0 to n) and a model that assigns probabilities to the characters from the alphabet conditioned on the previous characters in the sequence, i.e.

    Finally, we get an interval of two floating point numbers that represent the number.

The original paper has a great example describing exactly how this works. The key takeaway is that arithmetic coding presents a way to use a probability distribution to compress text, and the better that our model represents the underlying distribution over characters, the more efficient the message.

The authors train a 3M parameter decoder in a fairly standard way, and use encoder

They use equal information windows, where they encode text into a series of N-bit windows, resetting the AC encoder when it can no longer add bits without exceeding the target bit threshold. Windows represent variable amounts of text, but should represent the same amount of information.

Once they have the compressed sequence of bits, they then create tokens by grouping every N bits into a token, creating a vocabulary of size 2^N. They try with N = 8 and N = 16. This seems suboptimal to me— there’s no semantic meaning to the tokens!

The paper has a fascinating table showing how each of these changes weakens the compression ratios:

The authors make a point of using the same hyperparameters for every training run they do. I think this is a mistake; a proper benchmark would tune the hyperparameters for each setting.

Their results are interesting:

They have some really interesting ablations, which show that the SentencePiece tokens are much more semantically relevant than EqualInfoAC:

The other ablations are fascinating. This was an excellent paper with strong empirical work. I would encourage you to read it.

Subscribe now

Mixture of Depths

[abstract]

A windmill that is constantly being tilted at in decoder-centric LLM research is the fact that each token receives the same amount of compute. This seems clearly wrong. This paper proposed a novel method, Mixture of Depths (MoD), that dynamically allocates FLOPs to specific positions in a sequence.

The obvious comparison is to the Mixture of Experts (MoE) models. The MoD method can be thought of as using the routing logic from MoE models, but deploying a single expert with dynamic skipping based on the routing logic.

At a high level, the algorithm is:

    Determine a compute budget which limits the number of tokens in a sequence that participate in a given block (say: 50% of the sequence participates in self-attention).

    Use a per-block router to emit scalar weights for each token.

    Select the top-k weights per block and sequence to participate in the block’s computation.

Note how similar this is to a MoE model.

They use expert choice routing, as it removes the need for the oft-complicated auxiliary losses which are required for token-choice routing. The big problem, of course, is that the top-k operation is non-causal, which is why expert-choice routing isn’t used in any (published) MoE papers. They use a causal predictor-based approach, which has minimal degradation:

Their results are nice; they find that MoD is a nice improvement, lowering loss compared to the isoFLOP vanilla model. Additionally, they find that the MoD improvements compound with the improvements from training a MoE model:

The implications of this paper are fascinating; one can imagine a family of ever more complex routing mechanisms which let every decision become learned.

Sparse Upcycling: Training MoE models from dense checkpoints

[abstract]

The paper proposes a method to initialize a MoE model from a dense one, showing that this outperforms the original, dense model with only using 50% of the original compute, and also outperforming MoE models trained from scratch with the same compute budget.

The paper makes the fascinating observation that the vast majority of SOTA neural network are trained from scratch, which, in many ways, is assumed to be the default way that all models are trained. Given the lottery ticket hypothesis, it’s not at all clear that this is optimal. Even though the weights are chosen randomly, this doesn’t mean that they’re good; the joke that RL researchers like to make that the random seed is a crucial hyperparameter is actually a valid tactic to take when deploying systems into production. If OpenAI could produce a better ChatGPT by using seed 3 rather than seed 5, they’d absolutely do that.

In any case, the paper explores developing cheaper ways of training large models by using existing models. This is much more relevant now than when the paper was released (ICLR 2023, so the work was probably done in the second half of 2022): we’re training much larger models with much more compute and doing so much more often.

Generally speaking, Mixture of Experts work by having N copies of each layer (we call each copy an expert), and learning to allocate tokens to each expert. The upcycling algorithm that they propose creates a MoE model by expanding a subset of the MLP layers in the original, dense model into MoE layers, and copying the remaining layers of the dense model across to the MoE. Each expert is then initialized identically, as a copy of the original MLP, and the routers are randomly initialized. They continue training the model using the same hyperparameters (batch size, learning rate, etc.). They even find that resuming the optimizer state from the original dense checkpoint works well, which I find surprising.

The results are, bluntly, quite good. The slope of improvement in the validation metrics seems to change immediately. It seems as if by upcycling the model, new avenues of performance are immediately unlocked.

Given the earlier results from the Mixture of Depths paper, which suggested that MoD models compose with MoE models, this suggests a natural experiment: train a MoD model to some level of performance, and then upcycle it to a MoE/MoD model. As an engineer, I shudder at the resulting complexity, but it should have a significant jump in performance.

Replication of Chinchilla

https://twitter.com/borgeaud_s/status/1780988694163321250

[abstract]

Finally, this was quite short, but an exciting development that gave me faith in research. A team at Epoch AI, an independent research institute that does excellent work, tried reproducing Chinchilla by transcribing the data from the pixels of the chart. Let me repeat that: they reproduced Chinchilla by transcribing the data from the pixels of the original paper. And! What’s more! They found an error in the paper that caused the original authors to issue a correction and promise to release the data. Truly excellent work on their behalf.

The root of the paper comes from the observation that the error bars from Approach 3 of the Chinchilla paper are extremely narrow ([0.454, 0.455] and [0.542, 0.543] for parameters a and b):

Given they only had approximately 400 observations, that’s implausibly tight. This led to Epoch recreating the data and fitting their own model, which found that the original Chinchilla paper had an error in the estimation code, causing the equation they found to not fit the data particularly well.

The revised results imply that you can lean much more on the compute side of the compute/data tradeoff:

If we revisit the excellent “chinchilla’s wild implications” article by Nostalgebraist and plug in the numbers for GPT-3, Chinchilla, and Gopher, we get that:

Across the board, the contribution of data to the loss increases, while the contribution of the model size decreases. So the conclusion from the updated data is that the number of tokens is less important than originally thought, but GPT-3 era models were still trained with not nearly enough data. This would imply that following a GPT-3 style approach, which trains a large model on less data, is more reasonable than the original Chinchilla paper implied.

In practice, it’s not clear how much this matters, because everyone is trying to get as much data as possible and train the biggest model they can afford, but it does show that on the margin, spending more money on GPUs and less on data labellers is worth doing (did Jensen sponsor this work? 🤔).

Misc AI articles I’m reading:

Thanks for reading Artificial Fintelligence! Subscribe for free to receive new posts and support my work.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM AI训练 模型效率 压缩文本 深度混合 稀疏升级 MoE Chinchilla 语言模型
相关文章