Why vision-language models ignore images and hallucinate objects

Large vision-language models often describe objects that aren't actually in images—a problem called hallucination. The authors found that most generated tokens rely minimally on image information, suggesting models learn to follow text instructions better than extract visual details. They propose two fixes: reweight training to emphasize tokens that truly depend on the image, and filter training data with high hallucination risk. Both reduce false object descriptions while keeping response length unchanged and adding zero inference cost.