← Back to Computer Vision
cs.CV

Vision models can hide what they're actually looking at

Narges Babadi, Hadis Karimipour

May 15, 2026

This work exposes a fundamental vulnerability in explanation mechanisms used to interpret vision-language models. The authors introduce X-Shift, an attack that perturbs patch-level visual features to manipulate explanation heatmaps toward semantically irrelevant regions while keeping predictions intact—meaning a model can get the right answer for the wrong reasons. Tested on ImageNet-1k, MS-COCO, and Flickr30K across multiple CLIP architectures, the attack consistently degrades explanation alignment under imperceptible perturbations. Conventional adversarial attacks cannot reproduce this effect even with much larger perturbation budgets, suggesting explanation faithfulness is a distinct vulnerability from prediction robustness. The findings challenge the reliability of explanation heatmaps as trustworthiness indicators in high-stakes applications.
Published as Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations arXiv:2605.16651
Read the original paper →