Teaching self-driving cars to predict geometry and meaning

Jiawei Xu, Zhizhou Zhong, Zhijian Shu, Mingkai Jia, Mingxiao Li, Jia-Wang Bian, Qian Zhang, Kaicheng Zhang, Jin Xie, Jian Yang, Wei Yin

Autonomous driving systems typically require expensive manual annotations for trajectory planning. EponaV2 instead trains a perception-free world model to forecast richer future representations—not just images, but also 3D geometry and semantic maps—mimicking how humans anticipate road scenes. This multi-modal prediction task improves scene understanding and planning accuracy. The approach also adopts a flow matching group relative policy optimization technique inspired by large language model training. On three NAVSIM benchmarks, EponaV2 outperforms other perception-free models without requiring trajectory supervision, reaching SOTA performance with significant gains.