Teaching robots to navigate by understanding themselves

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

Vision-language navigation asks robots to follow instructions while moving through real environments. Most prior methods either use opaque end-to-end learning or rely on explicit 3D maps (which need extra sensors). AwareVLN adds explicit self-awareness—the agent reasons about where it is and how close it is to completing the task—while staying fully end-to-end. A structural reasoning module learns spatial and task awareness from data alone. On Habitat benchmarks, AwareVLN significantly outperforms existing vision-language navigation methods.