← Back to Robotics
cs.RO

Why better 3D perception doesn't always improve robot navigation

Ziyi Xia, Chaoran Xiong, Litao Wei, Xinhao Hu, Ling Pei

May 14, 2026

Vision-language navigation systems pair VLMs for 3D scene understanding with LLMs for reasoning, but current perception models optimize for pixel accuracy while robots need real-time efficiency. This paper quantifies how 3D perception capability actually affects navigation success on standard benchmarks, proposing success-rate upper bounds for two subsystems: semantic topological mapping and spatial coordinate-based reactive control. Experiments with state-of-the-art perception models reveal a saturation point beyond which perception improvements yield minimal navigation gains. The work suggests shifting from pixel-level precision to navigation-critical features like accurate bounding boxes and task-relevant vocabulary.
Published as Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN arXiv:2605.14801
Read the original paper →