← Back to Computer Vision
cs.CV

Why do multimodal models still need frozen image encoders?

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

May 29, 2026

Unified multimodal models typically rely on a frozen, separately trained VAE for image generation—a structural limitation. This work proposes Representation Forcing (RF), which teaches the model to predict visual representations as intermediate tokens before generating pixels. These tokens remain in context to guide diffusion within the same backbone, eliminating the need for external generative latent spaces. The result: pixel-space models with RF match state-of-the-art VAE-based systems on image generation while outperforming them on understanding tasks.
Published as Representation Forcing for Bottleneck-Free Unified Multimodal Models arXiv:2605.31604
Read the original paper →