Why better 3D perception doesn't always improve robot navigation

Vision-language navigation systems pair VLMs for 3D scene understanding with LLMs for reasoning, but current perception models optimize for pixel accuracy while robots need real-time efficiency. This paper quantifies how 3D perception capability actually affects navigation success on standard benchmarks, proposing success-rate upper bounds for two subsystems: semantic topological mapping and spatial coordinate-based reactive control. Experiments with state-of-the-art perception models reveal a saturation point beyond which perception improvements yield minimal navigation gains. The work suggests shifting from pixel-level precision to navigation-critical features like accurate bounding boxes and task-relevant vocabulary.