← Back to Computer Vision
cs.CV

How do vision-language models find images in long documents?

Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song

May 26, 2026

Vision-language models process long documents with mixed text and images, but how do they pinpoint relevant visual evidence? Researchers identified specialized "retrieval heads"—just 4.4–10.2% of attention heads—that locate both text and image clues. Disabling the top 5% of these heads catastrophically drops performance on document understanding and slide QA. Surprisingly, these heads also work directly for document ranking without retraining, improving retrieval recall by 7.7 points.
Published as Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models arXiv:2605.27243
Read the original paper →