← Back to Artificial Intelligence cs.AI
One special token handles both visual reasoning steps and tool calls
Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng
May 14, 2026
Visual reasoning systems typically choose between generating intermediate images (expensive), calling external tools (slow due to context-switching), or using learnable latent embeddings (hard to train, poor generalization). ATLAS sidesteps this tradeoff by introducing 'functional tokens' — ordinary vocabulary tokens tied to internalized visual operations that need no visual supervision and work with standard next-token prediction. Training uses a modified reinforcement learning algorithm, Latent-Anchored GRPO, which stabilizes learning by weighting functional tokens with an auxiliary objective to compensate for their sparsity. The framework requires no architectural changes and is compatible with standard supervised fine-tuning and RL pipelines, making it accessible to practitioners building multimodal reasoning systems.
Read the original paper →