Four ways to distill language models, unified through KL decomposition

Language model distillation mixes several existing techniques (SFT, DAgger, offline RL, on-policy distillation) but the underlying design choices remain unclear. This work decomposes sequence-level KL divergence to show that standard distillation methods couple two independent decisions: whether to use teacher or student prefixes, and which KL direction (forward or reverse) to optimize. Decoupling yields four valid objectives with distinct empirical properties. On math reasoning, forward KL reduces entropy collapse in long sequences, student prefixes improve sample efficiency, and training length duration trades off accuracy against stability. The authors propose KL mixing (combining forward and reverse KL) and an entropy-gated curriculum that improves Pass@k by up to 5.8 points and cuts response length by 3× compared to fixed training.