How good are AI models at building interactive websites?

Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu

Building functional websites remains a frontier task for large language models, but evaluating them fairly is hard—human judges don't scale, and automated checkers miss the reasoning involved. Cookie-Bench introduces a 1,000-query benchmark across 11 web domains and a three-stage evaluation framework that watches AI agents interact with generated sites (gathering screenshots, video, audio) before issuing holistic verdicts. The method aligns with expert human ratings while showing all 13 tested frontier LLMs have substantial room for improvement.