Why do multimodal models still need frozen image encoders?

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

Unified multimodal models typically rely on a frozen, separately trained VAE for image generation—a structural limitation. This work proposes Representation Forcing (RF), which teaches the model to predict visual representations as intermediate tokens before generating pixels. These tokens remain in context to guide diffusion within the same backbone, eliminating the need for external generative latent spaces. The result: pixel-space models with RF match state-of-the-art VAE-based systems on image generation while outperforming them on understanding tasks.