Do language models need complete answers to learn from teachers?

Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

On-policy distillation trains language models by having teachers grade student-generated responses, but standard approaches generate full answers every step—wasteful and prone to unreliable feedback late in output. The authors show that complete rollouts aren't necessary: truncating feedback to shorter sequences (TOPD) matches full-length training on math problems with 90% less compute, while gradually expanding rollout length during training (POPD) accelerates learning by 3×.