← Back to Artificial Intelligence cs.AI
How good are AI models at building interactive websites?
Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu
May 28, 2026
Building functional websites remains a frontier task for large language models, but evaluating them fairly is hard—human judges don't scale, and automated checkers miss the reasoning involved. Cookie-Bench introduces a 1,000-query benchmark across 11 web domains and a three-stage evaluation framework that watches AI agents interact with generated sites (gathering screenshots, video, audio) before issuing holistic verdicts. The method aligns with expert human ratings while showing all 13 tested frontier LLMs have substantial room for improvement.
Read the original paper →