Why do image-judging AI models trust text over what they see?

Multimodal large language models used as evaluators have a blind spot: when text and images conflict, they believe the narrative over their own visual perception. Researchers created a dataset of minimally edited counterfactual responses to expose this "perceptual judgment bias," then trained judges using a structured reward framework combining group-relative policy optimization with ranking objectives. Results show substantial gains in perceptual accuracy and human evaluation alignment across multiple benchmarks.