← Back to Computer Vision cs.CV
Why do image-judging AI models trust text over what they see?
Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin, Hyunjung Shim
June 1, 2026
Multimodal large language models used as evaluators have a blind spot: when text and images conflict, they believe the narrative over their own visual perception. Researchers created a dataset of minimally edited counterfactual responses to expose this "perceptual judgment bias," then trained judges using a structured reward framework combining group-relative policy optimization with ranking objectives. Results show substantial gains in perceptual accuracy and human evaluation alignment across multiple benchmarks.
Read the original paper →