Why post-training works: it's about which states you learn from

Most analyses of LLM post-training focus on loss functions, but this work shows the real story is where supervision comes from. The authors formalize post-training as state-distribution shaping—the states being prompts plus the model's own prefixes—and demonstrate three key findings on Qwen 0.6B: mild fine-tuning improves math reasoning with minimal knowledge loss, on-policy distillation from a degraded teacher beats that teacher on multiple benchmarks, and lightweight RL preserves knowledge while improving performance. This perspective explains why different post-training methods work and has immediate implications for practitioners choosing between SFT, RL, and distillation.