← Back to Robotics
cs.RO

How to speed up 3D reconstruction by picking the right image tokens?

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

May 22, 2026

Visual geometry transformers reconstruct 3D scenes from multiple images but slow down as sequence length grows due to global attention. The authors propose a two-stage token selection: first, identify which image frames matter most using diversity-based scoring; second, prune redundant tokens within those frames based on attention entropy patterns. On 500-image scenes, this delivers 85% speedup without sacrificing accuracy—sometimes improving it—making multi-view 3D reconstruction practical for larger datasets.
Published as Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers arXiv:2605.23892
Read the original paper →