Teaching self-driving cars to think ahead and follow instructions

Current end-to-end autonomous driving models hit a wall: they can only imitate human behavior. CoPhy adds two missing pieces: a cognitive layer that understands traffic semantics and intent (distilled from a vision-language model with zero inference cost), and a forward-looking world model that predicts how the car's actions will unfold. The system uses dual rewards—physical safety from simulated rollouts and cognitive alignment from language—to train the driving policy. Results on NAVSIM benchmarks show state-of-the-art performance plus explicit safety guarantees and the ability to follow natural language instructions.