← Back to Computation and Language cs.CL
Do language models need complete answers to learn from teachers?
Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
May 29, 2026
On-policy distillation trains language models by having teachers grade student-generated responses, but standard approaches generate full answers every step—wasteful and prone to unreliable feedback late in output. The authors show that complete rollouts aren't necessary: truncating feedback to shorter sequences (TOPD) matches full-length training on math problems with 90% less compute, while gradually expanding rollout length during training (POPD) accelerates learning by 3×.
Read the original paper →