← Back to Computer Vision cs.CV
Train a selector once, reuse it on any vision-language model
Mingkang Dong, Hongyi Cai, Xiwen Lei, Jie Li, Tao Zhang, Muxin Pu
May 26, 2026
Multimodal instruction tuning relies on redundant data, forcing expensive recomputation whenever you switch models or datasets. OFA trains a lightweight selector once using CLIP embeddings and pseudo-labeling, then reuses it without retraining on new datasets or model sizes. Selecting just 15% of LLaVA-665K hits 98.3% of full-data performance across 10 benchmarks; transferred to unseen Vision-Flan-186K, the selector actually outperforms full-data training by 10.6%. Same subsets work for both 3B and 7B models with zero retraining.
Read the original paper →