← Back to Computer Vision
cs.CV

Do vision models really understand object parts?

Olaf Dünkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski

May 29, 2026

SOCO benchmarks how well vision and vision-language models understand object parts by testing semantic correspondence—matching keypoints across 100 object categories in 1M+ image pairs. Despite encoding semantic structure, foundation models transfer poorly between related categories, and language-guided models outperform visual matching at part localization. Correspondence performance predicts downstream tasks like pose estimation and tracking better than ImageNet accuracy.
Published as SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models arXiv:2605.31597
Read the original paper →