MIT News - Machine learning 09月25日
研究人员开发指南,优化大型语言模型训练成本
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)的训练成本高昂,研究人员常利用缩放定律来预测模型性能。然而,缩放定律的选择多样。麻省理工学院和MIT-IBM Watson AI实验室的研究人员通过收集和分析数百个模型及其性能指标,构建了一个关于如何选择小型模型和估计不同LLM模型家族缩放定律的元分析和指南。这有助于优化计算预算,实现可靠的性能预测。该研究强调了使用中间训练检查点、选择多个模型大小以及部分训练目标模型等策略,以提高缩放定律的预测能力和效率,并计划将此方法扩展到模型推理。

💰 **优化计算预算以预测LLM性能**:训练大型语言模型(LLM)耗资巨大,研究人员通过缩放定律来预测模型性能,但选择合适的缩放定律存在挑战。麻省理工学院和MIT-IBM Watson AI实验室的研究团队通过大规模模型分析,提供了一个选择小型模型和估计缩放定律的指南,旨在帮助开发者在有限的计算和财务预算下,最有效地实现可靠的性能预测。

📈 **利用中间训练数据和模型多样性提升预测准确性**:研究发现,使用训练过程中间的检查点而非仅依赖最终损失,以及训练更多不同尺寸的模型,能够显著提高缩放定律的预测能力和鲁棒性。此外,部分训练目标模型(约30%的数据集)也可用于外推,从而节省成本。早期训练数据(少于100亿token)则可能引入噪声,应予剔除。

📚 **缩放定律的普适性与未来展望**:该研究通过对不同模型家族的分析,发现模型家族之间存在强烈的超参数相关性,少数几个超参数即可解释大部分变异,这表明缩放定律在不同模型中具有一定的普适性。研究人员还计划将此方法扩展到模型推理阶段,以预测模型在运行时所需的计算量,进一步提升LLM的应用效率。

When researchers are building large language models (LLMs), they aim to maximize performance under a particular computational and financial budget. Since training a model can amount to millions of dollars, developers need to be judicious with cost-impacting decisions about, for instance, the model architecture, optimizers, and training datasets before committing to a model. To anticipate the quality and accuracy of a large model’s predictions, practitioners often turn to scaling laws: using smaller, cheaper models to try to approximate the performance of a much larger target model. The challenge, however, is that there are thousands of ways to create a scaling law.

New work from MIT and MIT-IBM Watson AI Lab researchers addresses this by amassing and releasing a collection of hundreds of models and metrics concerning training and performance to approximate more than a thousand scaling laws. From this, the team developed a meta-analysis and guide for how to select small models and estimate scaling laws for different LLM model families, so that the budget is optimally applied toward generating reliable performance predictions.

“The notion that you might want to try to build mathematical models of the training process is a couple of years old, but I think what was new here is that most of the work that people had been doing before is saying, ‘can we say something post-hoc about what happened when we trained all of these models, so that when we’re trying to figure out how to train a new large-scale model, we can make the best decisions about how to use our compute budget?’” says Jacob Andreas, associate professor in the Department of Electrical Engineering and Computer Science and principal investigator with the MIT-IBM Watson AI Lab.

The research was recently presented at the International Conference on Machine Learning by Andreas, along with MIT-IBM Watson AI Lab researchers Leshem Choshen and Yang Zhang of IBM Research.

Extrapolating performance

No matter how you slice it, developing LLMs is an expensive endeavor: from decision-making regarding the numbers of parameters and tokens, data selection and size, and training techniques to determining output accuracy and tuning to the target applications and tasks. Scaling laws offer a way to forecast model behavior by relating a large model’s loss to the performance of smaller, less-costly models from the same family, avoiding the need to fully train every candidate. Mainly, the differences between the smaller models are the number of parameters and token training size. According to Choshen, elucidating scaling laws not only enable better pre-training decisions, but also democratize the field by enabling researchers without vast resources to understand and build effective scaling laws.

The functional form of scaling laws is relatively simple, incorporating components from the small models that capture the number of parameters and their scaling effect, the number of training tokens and their scaling effect, and the baseline performance for the model family of interest. Together, they help researchers estimate a target large model’s performance loss; the smaller the loss, the better the target model’s outputs are likely to be.

These laws allow research teams to weigh trade-offs efficiently and to test how best to allocate limited resources. They’re particularly useful for evaluating scaling of a certain variable, like the number of tokens, and for A/B testing of different pre-training setups.

In general, scaling laws aren’t new; however, in the field of AI, they emerged as models grew and costs skyrocketed. “It’s like scaling laws just appeared at some point in the field,” says Choshen. “They started getting attention, but no one really tested how good they are and what you need to do to make a good scaling law.” Further, scaling laws were themselves also a black box, in a sense. “Whenever people have created scaling laws in the past, it has always just been one model, or one model family, and one dataset, and one developer,” says Andreas. “There hadn’t really been a lot of systematic meta-analysis, as everybody is individually training their own scaling laws. So, [we wanted to know,] are there high-level trends that you see across those things?”

Building better

To investigate this, Choshen, Andreas, and Zhang created a large dataset. They collected LLMs from 40 model families, including Pythia, OPT, OLMO, LLaMA, Bloom, T5-Pile, ModuleFormer mixture-of-experts, GPT, and other families. These included 485 unique, pre-trained models, and where available, data about their training checkpoints, computational cost (FLOPs), training epochs, and the seed, along with 1.9 million performance metrics of loss and downstream tasks. The models differed in their architectures, weights, and so on. Using these models, the researchers fit over 1,000 scaling laws and compared their accuracy across architectures, model sizes, and training regimes, as well as testing how the number of models, inclusion of intermediate training checkpoints, and partial training impacted the predictive power of scaling laws to target models. They used measurements of absolute relative error (ARE); this is the difference between the scaling law’s prediction and the observed loss of a large, trained model. With this, the team compared the scaling laws, and after analysis, distilled practical recommendations for AI practitioners about what makes effective scaling laws.

Their shared guidelines walk the developer through steps and options to consider and expectations. First, it’s critical to decide on a compute budget and target model accuracy. The team found that 4 percent ARE is about the best achievable accuracy one could expect due to random seed noise, but up to 20 percent ARE is still useful for decision-making. The researchers identified several factors that improve predictions, like including intermediate training checkpoints, rather than relying only on final losses; this made scaling laws more reliable. However, very early training data before 10 billion tokens are noisy, reduce accuracy, and should be discarded. They recommend prioritizing training more models across a spread of sizes to improve robustness of the scaling law’s prediction, not just larger models; selecting five models provides a solid starting point. 

Generally, including larger models improves prediction, but costs can be saved by partially training the target model to about 30 percent of its dataset and using that for extrapolation. If the budget is considerably constrained, developers should consider training one smaller model within the target model family and borrow scaling law parameters from a model family with similar architecture; however, this may not work for encoder–decoder models. Lastly, the MIT-IBM research group found that when scaling laws were compared across model families, there was strong correlation between two sets of hyperparameters, meaning that three of the five hyperparameters explained nearly all of the variation and could likely capture the model behavior. Together, these guidelines provide a systematic approach to making scaling law estimation more efficient, reliable, and accessible for AI researchers working under varying budget constraints.

Several surprises arose during this work: small models partially trained are still very predictive, and further, the intermediate training stages from a fully trained model can be used (as if they are individual models) for prediction of another target model. “Basically, you don’t pay anything in the training, because you already trained the full model, so the half-trained model, for instance, is just a byproduct of what you did,” says Choshen. Another feature Andreas pointed out was that, when aggregated, the variability across model families and different experiments jumped out and was noisier than expected. Unexpectedly, the researchers found that it’s possible to utilize the scaling laws on large models to predict performance down to smaller models. Other research in the field has hypothesized that smaller models were a “different beast” compared to large ones; however, Choshen disagrees. “If they’re totally different, they should have shown totally different behavior, and they don’t.”

While this work focused on model training time, the researchers plan to extend their analysis to model inference. Andreas says it’s not, “how does my model get better as I add more training data or more parameters, but instead as I let it think for longer, draw more samples. I think there are definitely lessons to be learned here about how to also build predictive models of how much thinking you need to do at run time.” He says the theory of inference time scaling laws might become even more critical because, “it’s not like I'm going to train one model and then be done. [Rather,] it’s every time a user comes to me, they’re going to have a new query, and I need to figure out how hard [my model needs] to think to come up with the best answer. So, being able to build those kinds of predictive models, like we’re doing in this paper, is even more important.”

This research was supported, in part, by the MIT-IBM Watson AI Lab and a Sloan Research Fellowship. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 缩放定律 模型训练 计算成本 人工智能 Large Language Models LLM Scaling Laws Model Training Computational Cost Artificial Intelligence
相关文章