← Back to Computer Vision
cs.CV

Can models reason visually without showing their work?

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

May 30, 2026

Vision-language models that "think with images" typically either use external visual tools (slow) or generate implicit visual states autoregressively (weak). DeepLatent introduces LatentFormer, which generates latent visual states in parallel using learnable 2D tokens grounded in original image features, then refines them via reinforcement learning in embedding space. Includes a new 180K-image dataset. Matches tool-assisted methods on multiple benchmarks without explicit operations.
Published as DeepLatent: Think with Images via Parallel Latent Visual Reasoning arXiv:2606.00562
Read the original paper →