Why momentum algorithms fail when data is sparse and uneven?

Momentum optimization works well when gradients arrive steadily, but real data—especially imbalanced classes or sparse architectures—delivers them unevenly. This paper solves the dynamics exactly for least squares and logistic regression with sparse inputs, showing momentum's behavior depends on two competing timescales: how long the momentum buffer survives versus how fast the model learns. When learning outpaces buffer decay, the system oscillates wildly; when they're balanced, you get classical heavy-ball motion. The mismatch reveals why a single global momentum parameter fails across sparse data with different frequencies.