← Back to Artificial Intelligence cs.AI
Why decoding boxes one token at a time is slower than it needs to be
Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu
May 26, 2026
Vision-language models typically generate bounding boxes one token at a time, a bottleneck that also breaks the geometric structure of boxes. LocateAnything decodes entire boxes in parallel as atomic units, improving both speed and accuracy. Paired with a 138-million-sample dataset, it achieves higher throughput and better high-IoU localization across multiple benchmarks.
Read the original paper →