Teaching multimodal AI to spot fine details by learning from itself

Multimodal language models often miss fine-grained visual details in full images, even though they answer fine-grained questions accurately when shown cropped evidence. Vision-OPD fixes this by creating a teacher-student setup within the same model: a crop-conditioned teacher sees zoomed-in evidence, while a full-image student learns to mimic its reasoning on complete images. The method minimizes divergence in token-level predictions without requiring external labels, reward models, or runtime tools. On fine-grained benchmarks, Vision-OPD models match or exceed much larger open-source and closed-source models.