← Back to Machine Learning
cs.LG

Four ways to distill language models, unified through KL decomposition

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

May 16, 2026

Language model distillation mixes several existing techniques (SFT, DAgger, offline RL, on-policy distillation) but the underlying design choices remain unclear. This work decomposes sequence-level KL divergence to show that standard distillation methods couple two independent decisions: whether to use teacher or student prefixes, and which KL direction (forward or reverse) to optimize. Decoupling yields four valid objectives with distinct empirical properties. On math reasoning, forward KL reduces entropy collapse in long sequences, student prefixes improve sample efficiency, and training length duration trades off accuracy against stability. The authors propose KL mixing (combining forward and reverse KL) and an entropy-gated curriculum that improves Pass@k by up to 5.8 points and cuts response length by 3× compared to fixed training.
Published as Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation arXiv:2605.16826
Read the original paper →