Does predicting clean images work better in compressed space?

Diffusion models can predict images by regressing toward clean pixels or toward noise—mathematically equivalent operations. But a team tested whether this choice matters after compression into learned latent codes. Their 130M JLT model predicts clean latents rather than velocity, achieving FID-50K 2.50 on ImageNet 256×256. Local geometric analysis reveals velocity regression amplifies low-variance directions while clean prediction dampens them, suggesting the choice of prediction target is representation-dependent and not merely algebraic.