← Back to Computer Vision cs.CV
Using vision foundation models as better image tokenizers
Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi
May 18, 2026
Building image tokenizers from frozen vision foundation models sounds backwards—why not train from scratch?—but it works. The team added region-adaptive quantization to remove spatial redundancy and a semantic reconstruction loss to preserve what the foundation model already learned. The result: VFMTok works in both discrete (autoregressive) and continuous (diffusion) generation, trains 3× faster, needs no classifier-free guidance for class-conditional synthesis, and hits state-of-the-art gFID scores (1.36 discrete, 1.25 continuous). They also show which self-supervised pretraining objectives matter most for turning any foundation model into a good tokenizer.
Read the original paper →