cs.AI

Why decoding boxes one token at a time is slower than it needs to be

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

May 26, 2026

Vision-language models typically generate bounding boxes one token at a time, a bottleneck that also breaks the geometric structure of boxes. LocateAnything decodes entire boxes in parallel as atomic units, improving both speed and accuracy. Paired with a 138-million-sample dataset, it achieves higher throughput and better high-IoU localization across multiple benchmarks.

Published as LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding arXiv:2605.27365

Read the original paper →