Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
量化超参数迁移与嵌入层学习率的重要性
Dayal Singh Kalra, Maissam Barkeshli
AI总结 本文研究了超参数迁移的量化方法,通过三种指标评估超参数迁移的质量,发现Maximal Update(μP)参数化在训练中通过最大化嵌入层学习率提升了超参数迁移质量,而权重衰减虽改善了缩放定律拟合,但会降低外推鲁棒性。
详情
- Comments
- 10+28 pages, 5+17 figures
超参数迁移允许从小规模到大规模模型中外推最优优化超参数,这对于训练大型语言模型(LLMs)至关重要。这可以通过拟合缩放定律或通过精心选择参数化方式(如Maximal Update(μP))来实现,使最优超参数近似规模不变。本文首先开发了一个框架,通过三个指标量化超参数迁移:(1)缩放定律拟合的质量,(2)对外推误差的鲁棒性,以及(3)由于参数化选择导致的渐近损失惩罚。接着,通过一系列全面的消融实验,探讨了为何μP相对于标准参数化(SP)在训练AdamW时提供高质量的学习率迁移,因为现有理论不足。我们发现,μP相对于SP的主要优势在于最大化嵌入层学习率。在SP中,嵌入层学习率充当瓶颈,导致训练不稳定性;将其增加到宽度的倍数以匹配μP,可显著平滑训练并提高超参数迁移质量。此外,权重衰减改善了缩放定律拟合,但在固定token-per-parameter设置下会损害外推的鲁棒性。
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.