← Back to Machine Learning
cs.LG

Why vision models ignore their own visual thinking steps

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

May 18, 2026

Vision-language models have been designed to perform chain-of-thought reasoning using latent visual tokens as intermediate imagination steps. This work reveals these tokens are largely ignored: replacing them with random "dummy" tokens doesn't hurt accuracy. Analysis shows two bottlenecks: existing datasets don't require latent tokens to solve tasks, so models learn to bypass them; and at inference, generated latent tokens diverge from oracle representations, collapsing into a narrow region. When fine-tuned on diagnostic datasets where latent tokens causally support predictions, models do learn to use them. The findings point toward two requirements for progress: datasets with genuinely informative intermediate steps and training methods that encourage precise latent token prediction.
Published as What is Holding Back Latent Visual Reasoning? arXiv:2605.18445
Read the original paper →