How to speed up 3D reconstruction by picking the right image tokens?

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

Visual geometry transformers reconstruct 3D scenes from multiple images but slow down as sequence length grows due to global attention. The authors propose a two-stage token selection: first, identify which image frames matter most using diversity-based scoring; second, prune redundant tokens within those frames based on attention entropy patterns. On 500-image scenes, this delivers 85% speedup without sacrificing accuracy—sometimes improving it—making multi-view 3D reconstruction practical for larger datasets.