Do vision models really understand object parts?

Olaf Dünkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski

SOCO benchmarks how well vision and vision-language models understand object parts by testing semantic correspondence—matching keypoints across 100 object categories in 1M+ image pairs. Despite encoding semantic structure, foundation models transfer poorly between related categories, and language-guided models outperform visual matching at part localization. Correspondence performance predicts downstream tasks like pose estimation and tracking better than ImageNet accuracy.