Can AI learn what each user actually wants before judging responses?

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua

Evaluating whether an LLM actually matches individual user preferences remains unsolved—existing judges and metrics ignore long-term interaction patterns. PARL learns personalized scoring rubrics directly from user histories through reinforcement learning, then validates them against the user's own choices. Tested on real text generation tasks, it captures stable stylistic preferences and generalizes across users and domains, with code released.