← Back to Computer Vision
cs.CV

Why do image-judging AI models trust text over what they see?

Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin, Hyunjung Shim

June 1, 2026

Multimodal large language models used as evaluators have a blind spot: when text and images conflict, they believe the narrative over their own visual perception. Researchers created a dataset of minimally edited counterfactual responses to expose this "perceptual judgment bias," then trained judges using a structured reward framework combining group-relative policy optimization with ranking objectives. Results show substantial gains in perceptual accuracy and human evaluation alignment across multiple benchmarks.
Published as Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling arXiv:2606.02578
Read the original paper →