← Back to Computation and Language
cs.CL

Why AI judges give inflated grades for medical AI without clear rubrics?

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

June 2, 2026

When using large language models to grade other AI systems on clinical decisions (diabetes treatment planning), scoring quality depends heavily on whether raters receive a detailed rubric. Without rubrics, LLM raters bunch scores in a narrow "good" band (74–78 points), hiding flaws. With patient-specific rubrics, scores spread wider and reveal which AI systems actually perform better—sometimes by 5-fold amplification of discrimination. This matters because clinical AI evaluation now routinely outsources grading to LLMs.
Published as AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making arXiv:2606.03198
Read the original paper →