← Back to Computer Vision cs.CV
Do vision models really understand object parts?
Olaf Dünkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski
May 29, 2026
SOCO benchmarks how well vision and vision-language models understand object parts by testing semantic correspondence—matching keypoints across 100 object categories in 1M+ image pairs. Despite encoding semantic structure, foundation models transfer poorly between related categories, and language-guided models outperform visual matching at part localization. Correspondence performance predicts downstream tasks like pose estimation and tracking better than ImageNet accuracy.
Read the original paper →