← Back to Machine Learning cs.LG
Why embedding layer learning rate matters more than you think
Dayal Singh Kalra, Maissam Barkeshli
May 20, 2026
Training massive language models requires finding hyperparameters that work across scales. This paper explains why μP, a popular method for transferring hyperparameters from small to large models, actually works: it simply increases the embedding layer's learning rate, which becomes a critical bottleneck in standard training. The authors quantify hyperparameter transfer through three metrics and show that this one change—boosting embedding layer learning rate by a factor of model width—smooths training and improves scaling, while weight decay helps fits but hurts robustness.
Read the original paper →