Can text teach vision models to see better?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

TextTeacher injects semantic knowledge from language models into vision training by using image captions and a frozen text encoder to guide representation learning. The method adds a lightweight auxiliary loss during training but leaves the final model unchanged, improving ImageNet accuracy by up to 2.7% on ViT while outperforming standard knowledge distillation. The approach requires no expensive multimodal training and transfers well to downstream tasks.