Vision models can hide what they're actually looking at

This work exposes a fundamental vulnerability in explanation mechanisms used to interpret vision-language models. The authors introduce X-Shift, an attack that perturbs patch-level visual features to manipulate explanation heatmaps toward semantically irrelevant regions while keeping predictions intact—meaning a model can get the right answer for the wrong reasons. Tested on ImageNet-1k, MS-COCO, and Flickr30K across multiple CLIP architectures, the attack consistently degrades explanation alignment under imperceptible perturbations. Conventional adversarial attacks cannot reproduce this effect even with much larger perturbation budgets, suggesting explanation faithfulness is a distinct vulnerability from prediction robustness. The findings challenge the reliability of explanation heatmaps as trustworthiness indicators in high-stakes applications.