← Back to Computer Vision cs.CV
Can models reason visually without showing their work?
Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao
May 30, 2026
Vision-language models that "think with images" typically either use external visual tools (slow) or generate implicit visual states autoregressively (weak). DeepLatent introduces LatentFormer, which generates latent visual states in parallel using learnable 2D tokens grounded in original image features, then refines them via reinforcement learning in embedding space. Includes a new 180K-image dataset. Matches tool-assisted methods on multiple benchmarks without explicit operations.
Read the original paper →