Teaching drones to follow directions by predicting what comes next

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

Aerial vision-language navigation requires drones to follow natural-language instructions in 3D environments using closed-loop perception and action. WorldVLN frames this as prediction-driven control: the agent anticipates how the world will evolve, then acts accordingly. Rather than generating full video sequences, it uses a latent autoregressive backbone to predict short-horizon world-state transitions and decode them directly into waypoint actions, with newly observed information fed back into the loop after each action. A two-stage training approach first grounds video priors in navigation dynamics, then applies Action-aware GRPO—a novel RL method for autoregressive world-action models—to optimize actions through rollout consequences. On public outdoor and indoor benchmarks, WorldVLN outperforms existing baselines by 12%+ on success rate and shows stronger gains on harder cases. Code and demos are released.