Why self-driving models ignore what they see—and how to fix it

Driving models trained to predict vehicle trajectories develop a blind spot: they rely almost entirely on text commands and ego status, barely using visual features despite being vision-language models. The authors show this isn't a training problem but a structural one—trajectory prediction, viewed as inverse kinematics, fundamentally requires both current and future visual states to constrain the solution. They redesigned the model to predict future visual scenes alongside trajectories and decouple trajectory decoding from ego shortcuts. The result: a 0.5B model matches much larger 7B-8B baselines on NAVSIM-v2 and nuScenes benchmarks, with largest gains in dynamic situations like turns.