Can AI automatically write grading rubrics for training better language models?

Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu

Training language models on open-ended tasks like healthcare advice or creative writing requires nuanced evaluation—not just right/wrong answers. ARES automates this by converting raw documents into question-answer pairs, then generates custom rubrics tailored to each question that score responses across multiple dimensions. The system validates itself (checking questions are self-contained, answers faithful) and conditions generation on domain and persona to boost diversity. On benchmarks spanning healthcare, instruction-following, and other domains, rubric-based RL trained with ARES outperforms standard fine-tuning and binary-reward methods.