← Back to Computer Vision
cs.CV

Teaching multimodal AI to spot fine details by learning from itself

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

May 18, 2026

Multimodal language models often miss fine-grained visual details in full images, even though they answer fine-grained questions accurately when shown cropped evidence. Vision-OPD fixes this by creating a teacher-student setup within the same model: a crop-conditioned teacher sees zoomed-in evidence, while a full-image student learns to mimic its reasoning on complete images. The method minimizes divergence in token-level predictions without requiring external labels, reward models, or runtime tools. On fine-grained benchmarks, Vision-OPD models match or exceed much larger open-source and closed-source models.
Published as Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation arXiv:2605.18740
Read the original paper →