← Back to Computer Vision
cs.CV

Why autonomous cars don't need to think out loud

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang Zhao

May 20, 2026

Driving VLAs typically use natural language reasoning as an intermediate step—but generating and parsing long chains of thought is slow and requires expensive annotations. DriveMA instead uses concise one-step meta-actions (like "accelerate" or "prepare_turn") derived automatically from expert driving data. Combined with reinforcement learning that jointly optimizes action correctness and trajectory quality, the approach reaches state-of-the-art on Waymo End-to-End Driving with a 2B model. The trade-off: simpler instructions that are faster to infer, easier for compact models to learn, and more reliable than reasoning chains—without sacrificing driving performance.
Published as DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions arXiv:2605.21273
Read the original paper →