Scaling reconstruction models with efficient architecture and massive data

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht

VGGT-Ω improves upon feed-forward 3D reconstruction models by demonstrating that reconstruction quality scales predictably with model and data size. The key innovation is an efficient architecture that replaces global attention with register-based information aggregation, reducing training memory to 30% of the predecessor while enabling training on 15× more labeled data and unlabeled video. Architectural simplifications include a single dense prediction head with multi-task supervision and removal of expensive high-resolution convolutions. On the Sintel benchmark, the model achieves 77% improvement in camera estimation accuracy over prior work. The learned representations also transfer to vision-language-action models, suggesting reconstruction serves as a scalable proxy task for spatial understanding. Code and project details are available.