← Back to Computer Vision
cs.CV

Using vision foundation models as better image tokenizers

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

May 18, 2026

Building image tokenizers from frozen vision foundation models sounds backwards—why not train from scratch?—but it works. The team added region-adaptive quantization to remove spatial redundancy and a semantic reconstruction loss to preserve what the foundation model already learned. The result: VFMTok works in both discrete (autoregressive) and continuous (diffusion) generation, trains 3× faster, needs no classifier-free guidance for class-conditional synthesis, and hits state-of-the-art gFID scores (1.36 discrete, 1.25 continuous). They also show which self-supervised pretraining objectives matter most for turning any foundation model into a good tokenizer.
Published as Vision Foundation Models as Generalist Tokenizers for Image Generation arXiv:2605.18390
Read the original paper →