GradES：基于梯度的Transformer早期停止方法

cs.AI updates on arXiv.org 09月03日

GradES：基于梯度的Transformer早期停止方法

本文提出GradES，一种基于梯度的早期停止方法，用于Transformer模型训练。通过监控梯度变化，GradES在训练过程中动态调整参数更新，有效减少验证计算量，提高训练效率，同时提升模型泛化能力。

arXiv:2509.01842v1 Announce Type: cross Abstract: Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix's gradients fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GradES Transformer 早期停止梯度训练效率

相关文章

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Trends in Computer Vision with Georgia Gkioxari - #549

Social Commonsense Reasoning with Yejin Choi - #518

Trends in Natural Language Processing with Sameer Singh - #445

AI趨勢周報第252期：取代Transformer？LSTM之父發表新LLM架構

How ‘Chain of Thought’ Makes Transformers Smarter

This AI Paper by Toyota Research Institute Introduces SUPRA: Enhancing Transformer Efficiency with Recurrent Neural Networks

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Octo: An Open-Sourced Large Transformer-based Generalist Robot Policy Trained on 800k Trajectories from the Open X-Embodiment Dataset

惊喜发现又祛魅一项能力：读论文 CS 专业一路走来被论文折磨，现以为脱离苦海，但又不得不紧跟看 LLM SD 论文，痛点就是：看不下去，精神涣散?‍♂️啃能读完...