← Back to Computer Vision
cs.CV

Making image understanding and generation work together

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

May 18, 2026

Unified multimodal models that handle both vision understanding and generation typically train these tasks separately, leaving them isolated in feature space. This work identifies image segmentation as an optimal generative proxy task—more effective than low-level tasks—to bridge this gap. The authors introduce Semantic Generative Tuning (SGT), which leverages segmentation during post-training to strengthen both vision-language understanding and generative layout fidelity. Mechanistic analysis shows SGT improves feature separability and attention allocation between visual and textual modalities. Evaluated on standard benchmarks, the method consistently improves multimodal comprehension and image generation quality. Code is released.
Published as Semantic Generative Tuning for Unified Multimodal Models arXiv:2605.18714
Read the original paper →