← Back to Artificial Intelligence
cs.AI

One special token handles both visual reasoning steps and tool calls

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

May 14, 2026

Visual reasoning systems typically choose between generating intermediate images (expensive), calling external tools (slow due to context-switching), or using learnable latent embeddings (hard to train, poor generalization). ATLAS sidesteps this tradeoff by introducing 'functional tokens' — ordinary vocabulary tokens tied to internalized visual operations that need no visual supervision and work with standard next-token prediction. Training uses a modified reinforcement learning algorithm, Latent-Anchored GRPO, which stabilizes learning by weighting functional tokens with an auxiliary objective to compensate for their sparsity. The framework requires no architectural changes and is compatible with standard supervised fine-tuning and RL pipelines, making it accessible to practitioners building multimodal reasoning systems.
Published as ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both arXiv:2605.15198
Read the original paper →