Do language models need strong teachers to learn better?

Knowledge distillation typically assumes strong teachers produce better students. This work tests that assumption by pairing language models of various sizes and training stages as teachers and students. Surprisingly, even small, undertrained teachers improve larger students when losses are properly balanced, and pushing teachers further sometimes hurts performance. The benefit shows up most clearly in generalization and downstream tasks rather than raw perplexity—questioning whether distillation pretraining actually needs a strong teacher at all.