TREC：优化LLM训练数据策略

cs.AI updates on arXiv.org 10月01日

TREC：优化LLM训练数据策略

本文提出TREC，一种基于最终模型权重回溯评估训练批次的诊断工具，揭示了数据放置对模型性能的影响，并展示了如何通过预测TREC来优化训练课程设计。

arXiv:2509.25380v1 Announce Type: cross Abstract: Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the training re-evaluation curve (TREC), a diagnostic that retrospectively evaluates training batches using the final model weights. The TREC characterizes how well a trained model retains training data as a function of when the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be predicted in advance from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TREC LLM训练数据策略模型性能课程设计

相关文章

This AI newsletter is all you need #98

Yandex Introduces YaFSDP: An Open-Source AI Tool that Promises to Revolutionize LLM Training by Cutting GPU Usage by 20%

确保企业数据质量的 11 个基本步骤

「中杯」Claude 3.5突然上线，竟比GPT-4o还强，全新Artifacts改写模型交互

Perplexity: ↩️ This model outperforms Claude 3 Opus and GPT-4o on our internal benchmarks.

Industry experts call for tailored AI rules in post-election UK

Spectrum: An AI Method that Accelerates LLM Training by Selectively Targeting Layer Modules based on their Signal-to-Noise Ratio (SNR)

WAIC观察：大模型AI应用开始小规模稳步爆发

构建Agent系统，那些没人会告诉你的经验教训！

20个实验数据创造AI蛋白质里程碑！上海交大联合上海AI Lab发布FSFP，有效优化蛋白质预训练模型