量化感知训练优化策略研究

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previ-ous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superioraccuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remainsunclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0Mto 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previousfindings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the opti-mal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and fi-nal model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling lawto make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given mem-ory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally,we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-awaretraining, eliminating redundant full-precision model updates and achieving significant compute savings. These findingsprovide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with thesame compute budget.

† École Polytechnique Fédérale de Lausanne (EPFL) ** Work done while at Apple

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签