← Back to Machine Learning
cs.LG

Why post-training works: it's about which states you learn from

Dong Nie

May 21, 2026

Most analyses of LLM post-training focus on loss functions, but this work shows the real story is where supervision comes from. The authors formalize post-training as state-distribution shaping—the states being prompts plus the model's own prefixes—and demonstrate three key findings on Qwen 0.6B: mild fine-tuning improves math reasoning with minimal knowledge loss, on-policy distillation from a degraded teacher beats that teacher on multiple benchmarks, and lightweight RL preserves knowledge while improving performance. This perspective explains why different post-training methods work and has immediate implications for practitioners choosing between SFT, RL, and distillation.
Published as Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation arXiv:2605.22731
Read the original paper →