Why AI judges give inflated grades for medical AI without clear rubrics?

When using large language models to grade other AI systems on clinical decisions (diabetes treatment planning), scoring quality depends heavily on whether raters receive a detailed rubric. Without rubrics, LLM raters bunch scores in a narrow "good" band (74–78 points), hiding flaws. With patient-specific rubrics, scores spread wider and reveal which AI systems actually perform better—sometimes by 5-fold amplification of discrimination. This matters because clinical AI evaluation now routinely outsources grading to LLMs.