← Back to Computation and Language
cs.CL

How we learned to measure whether AI writing actually works

Ehud Reiter

May 22, 2026

NLG evaluation transformed from informal analysis in 1990 to rigorous experimental methods today, driven by AI's shift from linguistics to machine learning. Recent techniques like LLM-as-Judge automate quality assessment, but the field faces new demands: measuring real-world impact, understanding failure modes through qualitative analysis, and ensuring safety as millions use these systems. The next era will prioritize practical outcomes over benchmark metrics.
Published as NLG Evaluation: Past, Present, Future arXiv:2605.23715
Read the original paper →