Reusing old classification weights to align vision and language models

Vision-language models need expensive training on huge paired datasets to connect images and text. This work recycles the classification heads from pretrained vision models—weights normally discarded—as semantic anchors for alignment. The approach works two ways: directly as zero-shot alignment signals, and as data augmentation when mixed with real image-text pairs. Applied to standard post-hoc alignment methods, it improves cross-modal retrieval and zero/few-shot classification across benchmarks.