Why do LLMs fail at medical triage? Format, not knowledge.

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

Patient triage benchmarks show consumer LLMs under-triage cases when forced into multiple-choice answers, yet the same models perform better with free text. Using sparse autoencoders to peer inside Gemma and Qwen, researchers found that medical features activate normally in both formats—but go silent at the decision token in multiple choice. Three independent methods confirm the failure sits in output formatting, not clinical understanding; most errors are off-by-one acuity picks, not knowledge gaps.