← Back to Machine Learning
cs.LG

Why embedding layer learning rate matters more than you think

Dayal Singh Kalra, Maissam Barkeshli

May 20, 2026

Training massive language models requires finding hyperparameters that work across scales. This paper explains why μP, a popular method for transferring hyperparameters from small to large models, actually works: it simply increases the embedding layer's learning rate, which becomes a critical bottleneck in standard training. The authors quantify hyperparameter transfer through three metrics and show that this one change—boosting embedding layer learning rate by a factor of model width—smooths training and improves scaling, while weight decay helps fits but hurts robustness.
Published as Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate arXiv:2605.21486
Read the original paper →