Making image understanding and generation work together

Unified multimodal models that handle both vision understanding and generation typically train these tasks separately, leaving them isolated in feature space. This work identifies image segmentation as an optimal generative proxy task—more effective than low-level tasks—to bridge this gap. The authors introduce Semantic Generative Tuning (SGT), which leverages segmentation during post-training to strengthen both vision-language understanding and generative layout fidelity. Mechanistic analysis shows SGT improves feature separability and attention allocation between visual and textual modalities. Evaluated on standard benchmarks, the method consistently improves multimodal comprehension and image generation quality. Code is released.