One special token handles both visual reasoning steps and tool calls

Visual reasoning systems typically choose between generating intermediate images (expensive), calling external tools (slow due to context-switching), or using learnable latent embeddings (hard to train, poor generalization). ATLAS sidesteps this tradeoff by introducing 'functional tokens' — ordinary vocabulary tokens tied to internalized visual operations that need no visual supervision and work with standard next-token prediction. Training uses a modified reinforcement learning algorithm, Latent-Anchored GRPO, which stabilizes learning by weighting functional tokens with an auxiliary objective to compensate for their sparsity. The framework requires no architectural changes and is compatible with standard supervised fine-tuning and RL pipelines, making it accessible to practitioners building multimodal reasoning systems.