← Back to Machine Learning
cs.LG

Do language models need strong teachers to learn better?

Taiming Lu, Zhuang Liu

May 22, 2026

Knowledge distillation typically assumes strong teachers produce better students. This work tests that assumption by pairing language models of various sizes and training stages as teachers and students. Surprisingly, even small, undertrained teachers improve larger students when losses are properly balanced, and pushing teachers further sometimes hurts performance. The benefit shows up most clearly in generalization and downstream tasks rather than raw perplexity—questioning whether distillation pretraining actually needs a strong teacher at all.
Published as Strong Teacher Not Needed? On Distillation in LLM Pretraining arXiv:2605.23857
Read the original paper →