← Back to Computation and Language cs.CL
Why AI judges give inflated grades for medical AI without clear rubrics?
Sangwon Baek, Kyu Yeon Hur, Kyunga Kim
June 2, 2026
When using large language models to grade other AI systems on clinical decisions (diabetes treatment planning), scoring quality depends heavily on whether raters receive a detailed rubric. Without rubrics, LLM raters bunch scores in a narrow "good" band (74–78 points), hiding flaws. With patient-specific rubrics, scores spread wider and reveal which AI systems actually perform better—sometimes by 5-fold amplification of discrimination. This matters because clinical AI evaluation now routinely outsources grading to LLMs.
Read the original paper →