Train a selector once, reuse it on any vision-language model

Multimodal instruction tuning relies on redundant data, forcing expensive recomputation whenever you switch models or datasets. OFA trains a lightweight selector once using CLIP embeddings and pseudo-labeling, then reuses it without retraining on new datasets or model sizes. Selecting just 15% of LLaVA-665K hits 98.3% of full-data performance across 10 benchmarks; transferred to unseen Vision-Flan-186K, the selector actually outperforms full-data training by 10.6%. Same subsets work for both 3B and 7B models with zero retraining.