← Back to Computer Vision cs.CV
Reusing old classification weights to align vision and language models
David Méndez, Roberto Confalonieri, Natalia Díaz Rodríguez
May 21, 2026
Vision-language models need expensive training on huge paired datasets to connect images and text. This work recycles the classification heads from pretrained vision models—weights normally discarded—as semantic anchors for alignment. The approach works two ways: directly as zero-shot alignment signals, and as data augmentation when mixed with real image-text pairs. Applied to standard post-hoc alignment methods, it improves cross-modal retrieval and zero/few-shot classification across benchmarks.
Read the original paper →