Why neural networks suddenly learn: two hidden speeds exposed

Neural networks sometimes delay generalization (grokking) or show test performance bouncing back after worsening (double descent), contradicting simple loss curves. This work splits learning into two processes: building useful representations in hidden layers and tuning the final classifier to those representations. Using representation geometry and neural tangent kernels across diverse tasks, the authors show both processes run throughout training, and anomalies arise when one speeds up or stalls relative to the other. The framework debunks the "networks lazily memorize then suddenly learn" story and identifies when apparent delayed generalization is actually representation degradation or misalignment.