Does that caption actually describe the image? A smarter way to pick training data

Vision-language models like CLIP rely on image-caption pairs to learn, but current filtering methods check only whether a pair broadly matches—missing whether individual objects, attributes, and relationships in the caption actually appear in the image. Researchers developed Counterfactual Phrase Intervention (CPI): by swapping words in captions and measuring how much it changes the image-text score, they identify which phrases truly support each image. On a 50%-sized subset of CC3M, this approach outperforms full-data training on compositional understanding benchmarks and works across different model variants.