← Back to Computation and Language cs.CL
Why do LLMs fail at medical triage? Format, not knowledge.
David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky
May 28, 2026
Patient triage benchmarks show consumer LLMs under-triage cases when forced into multiple-choice answers, yet the same models perform better with free text. Using sparse autoencoders to peer inside Gemma and Qwen, researchers found that medical features activate normally in both formats—but go silent at the decision token in multiple choice. Three independent methods confirm the failure sits in output formatting, not clinical understanding; most errors are off-by-one acuity picks, not knowledge gaps.
Read the original paper →