← Back to Computation and Language
cs.CL

Why do LLMs fail at medical triage? Format, not knowledge.

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

May 28, 2026

Patient triage benchmarks show consumer LLMs under-triage cases when forced into multiple-choice answers, yet the same models perform better with free text. Using sparse autoencoders to peer inside Gemma and Qwen, researchers found that medical features activate normally in both formats—but go silent at the decision token in multiple choice. Three independent methods confirm the failure sits in output formatting, not clinical understanding; most errors are off-by-one acuity picks, not knowledge gaps.
Published as Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate arXiv:2605.29889
Read the original paper →