How we learned to measure whether AI writing actually works

NLG evaluation transformed from informal analysis in 1990 to rigorous experimental methods today, driven by AI's shift from linguistics to machine learning. Recent techniques like LLM-as-Judge automate quality assessment, but the field faces new demands: measuring real-world impact, understanding failure modes through qualitative analysis, and ensuring safety as millions use these systems. The next era will prioritize practical outcomes over benchmark metrics.