← Back to Machine Learning (Statistics) stat.ML
Why does AdaGrad work when gradients go haywire?
Zijian Liu
May 18, 2026
Machine learning training often encounters extreme gradient noise that breaks standard optimizers. This work proves AdaGrad converges reliably under such heavy-tailed noise without needing gradient clipping or normalization, and does so while automatically adapting to the noise severity. The convergence rate is tight enough to show AdaGrad can't match the theoretical optimum for this setting—a gap between practice and theory worth understanding.
Read the original paper →