← Back to Computer Vision cs.CV
Making image understanding and generation work together
Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li
May 18, 2026
Unified multimodal models that handle both vision understanding and generation typically train these tasks separately, leaving them isolated in feature space. This work identifies image segmentation as an optimal generative proxy task—more effective than low-level tasks—to bridge this gap. The authors introduce Semantic Generative Tuning (SGT), which leverages segmentation during post-training to strengthen both vision-language understanding and generative layout fidelity. Mechanistic analysis shows SGT improves feature separability and attention allocation between visual and textual modalities. Evaluated on standard benchmarks, the method consistently improves multimodal comprehension and image generation quality. Code is released.
Read the original paper →