Do AI doctors actually think like doctors? A new benchmark reveals they don't

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, Benedikt Wiestler

Most medical AI benchmarks hand over all information upfront and score only the final diagnosis, missing how doctors actually work: gathering evidence iteratively, updating their thinking, and stopping when confident enough. DDX-TRACE tests vision-language models on 211 neuroradiology cases where they must request imaging, interpret results, and build a probabilistic diagnosis over multiple turns. State-of-the-art models often achieve the right diagnosis by luck rather than sound reasoning, misinterpret images they request, or gather evidence inefficiently. The benchmark exposes these failures that traditional metrics hide.