← Back to Computer Vision cs.CV
Teaching multimodal AI to spot fine details by learning from itself
Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu
May 18, 2026
Multimodal language models often miss fine-grained visual details in full images, even though they answer fine-grained questions accurately when shown cropped evidence. Vision-OPD fixes this by creating a teacher-student setup within the same model: a crop-conditioned teacher sees zoomed-in evidence, while a full-image student learns to mimic its reasoning on complete images. The method minimizes divergence in token-level predictions without requiring external labels, reward models, or runtime tools. On fine-grained benchmarks, Vision-OPD models match or exceed much larger open-source and closed-source models.
Read the original paper →