Why embedding layer learning rate matters more than you think

Training massive language models requires finding hyperparameters that work across scales. This paper explains why μP, a popular method for transferring hyperparameters from small to large models, actually works: it simply increases the embedding layer's learning rate, which becomes a critical bottleneck in standard training. The authors quantify hyperparameter transfer through three metrics and show that this one change—boosting embedding layer learning rate by a factor of model width—smooths training and improves scaling, while weight decay helps fits but hurts robustness.