Using vision foundation models as better image tokenizers

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

Building image tokenizers from frozen vision foundation models sounds backwards—why not train from scratch?—but it works. The team added region-adaptive quantization to remove spatial redundancy and a semantic reconstruction loss to preserve what the foundation model already learned. The result: VFMTok works in both discrete (autoregressive) and continuous (diffusion) generation, trains 3× faster, needs no classifier-free guidance for class-conditional synthesis, and hits state-of-the-art gFID scores (1.36 discrete, 1.25 continuous). They also show which self-supervised pretraining objectives matter most for turning any foundation model into a good tokenizer.