← Back to Computer Vision
cs.CV

Does that caption actually describe the image? A smarter way to pick training data

Hyejin Go, Semi Lee, Hyesong Choi

May 21, 2026

Vision-language models like CLIP rely on image-caption pairs to learn, but current filtering methods check only whether a pair broadly matches—missing whether individual objects, attributes, and relationships in the caption actually appear in the image. Researchers developed Counterfactual Phrase Intervention (CPI): by swapping words in captions and measuring how much it changes the image-text score, they identify which phrases truly support each image. On a 50%-sized subset of CC3M, this approach outperforms full-data training on compositional understanding benchmarks and works across different model variants.
Published as What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining arXiv:2605.22651
Read the original paper →