← Back to Computer Vision
cs.CV

Train a selector once, reuse it on any vision-language model

Mingkang Dong, Hongyi Cai, Xiwen Lei, Jie Li, Tao Zhang, Muxin Pu

May 26, 2026

Multimodal instruction tuning relies on redundant data, forcing expensive recomputation whenever you switch models or datasets. OFA trains a lightweight selector once using CLIP embeddings and pseudo-labeling, then reuses it without retraining on new datasets or model sizes. Selecting just 15% of LLaVA-665K hits 98.3% of full-data performance across 10 benchmarks; transferred to unseen Vision-Flan-186K, the selector actually outperforms full-data training by 10.6%. Same subsets work for both 3B and 7B models with zero retraining.
Published as Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning arXiv:2605.26761
Read the original paper →