← Back to Computation and Language
cs.CL

Do language models need complete answers to learn from teachers?

Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

May 29, 2026

On-policy distillation trains language models by having teachers grade student-generated responses, but standard approaches generate full answers every step—wasteful and prone to unreliable feedback late in output. The authors show that complete rollouts aren't necessary: truncating feedback to shorter sequences (TOPD) matches full-length training on math problems with 90% less compute, while gradually expanding rollout length during training (POPD) accelerates learning by 3×.
Published as Are Full Rollouts Necessary for On-Policy Distillation? arXiv:2605.31490
Read the original paper →