← Back to Computer Vision
cs.CV

Why vision-language models ignore images and hallucinate objects

Meng Shen, Minghao Wu, Deepu Rajan

May 20, 2026

Large vision-language models often describe objects that aren't actually in images—a problem called hallucination. The authors found that most generated tokens rely minimally on image information, suggesting models learn to follow text instructions better than extract visual details. They propose two fixes: reweight training to emphasize tokens that truly depend on the image, and filter training data with high hallucination risk. Both reduce false object descriptions while keeping response length unchanged and adding zero inference cost.
Published as Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens arXiv:2605.21300
Read the original paper →