Teaching vision-language models to actually look at images before reasoning

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

Multimodal language models often fail at faithful reasoning because they perceive visual evidence correctly but then ignore or contradict it during reasoning. Faithful-MR1 addresses this with two stages: anchoring perception as an explicit pre-reasoning task by supervising visual attention directly on image regions (not text descriptions), and reinforcing faithful use through counterfactual image interventions that reward answers only when the model attends to causally relevant visual features. On Qwen2.5-VL-Instruct models, the method improves multimodal reasoning benchmarks while using substantially less training data than recent baselines.