Can models reason visually without showing their work?

Vision-language models that "think with images" typically either use external visual tools (slow) or generate implicit visual states autoregressively (weak). DeepLatent introduces LatentFormer, which generates latent visual states in parallel using learnable 2D tokens grounded in original image features, then refines them via reinforcement learning in embedding space. Includes a new 180K-image dataset. Matches tool-assisted methods on multiple benchmarks without explicit operations.