← Back to Artificial Intelligence cs.AI
Why optimizers trained on validation loss fail at deployment
Thomas T. Zhang, Alok Shah, Yifei Zhang, Vincent Zhang, Nikolai Matni, Max Simchowitz
June 4, 2026
Neural networks trained on next-step prediction (like language models) perform worse at deployment when rolling out predictions than their validation loss suggests. This test-time feedback problem grows with task length. Double-preconditioning (DoPr) combines gradient-wise preconditioning (like Adam) with activation-wise preconditioning (like KFAC) to reduce error compounding. The method improves downstream performance—task success, generation quality—across language modeling, generative models, and robot control, yet surprisingly doesn't consistently boost validation loss, suggesting our standard evaluation metrics miss something critical.
Read the original paper →