← Back to Machine Learning cs.LG
Do language models need strong teachers to learn better?
Taiming Lu, Zhuang Liu
May 22, 2026
Knowledge distillation typically assumes strong teachers produce better students. This work tests that assumption by pairing language models of various sizes and training stages as teachers and students. Surprisingly, even small, undertrained teachers improve larger students when losses are properly balanced, and pushing teachers further sometimes hurts performance. The benefit shows up most clearly in generalization and downstream tasks rather than raw perplexity—questioning whether distillation pretraining actually needs a strong teacher at all.
Read the original paper →