← Back to Computation and Language
cs.CL

Teaching vision-language models to actually look at images before reasoning

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

May 21, 2026

Multimodal language models often fail at faithful reasoning because they perceive visual evidence correctly but then ignore or contradict it during reasoning. Faithful-MR1 addresses this with two stages: anchoring perception as an explicit pre-reasoning task by supervising visual attention directly on image regions (not text descriptions), and reinforcing faithful use through counterfactual image interventions that reward answers only when the model attends to causally relevant visual features. On Qwen2.5-VL-Instruct models, the method improves multimodal reasoning benchmarks while using substantially less training data than recent baselines.
Published as Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention arXiv:2605.22072
Read the original paper →