How do vision-language models find images in long documents?

Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song

Vision-language models process long documents with mixed text and images, but how do they pinpoint relevant visual evidence? Researchers identified specialized "retrieval heads"—just 4.4–10.2% of attention heads—that locate both text and image clues. Disabling the top 5% of these heads catastrophically drops performance on document understanding and slide QA. Surprisingly, these heads also work directly for document ranking without retraining, improving retrieval recall by 7.7 points.