Why vision models ignore their own visual thinking steps

Vision-language models have been designed to perform chain-of-thought reasoning using latent visual tokens as intermediate imagination steps. This work reveals these tokens are largely ignored: replacing them with random "dummy" tokens doesn't hurt accuracy. Analysis shows two bottlenecks: existing datasets don't require latent tokens to solve tasks, so models learn to bypass them; and at inference, generated latent tokens diverge from oracle representations, collapsing into a narrow region. When fine-tuned on diagnostic datasets where latent tokens causally support predictions, models do learn to use them. The findings point toward two requirements for progress: datasets with genuinely informative intermediate steps and training methods that encourage precise latent token prediction.