← Back to Artificial Intelligence
cs.AI

How good are AI models at building interactive websites?

Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu

May 28, 2026

Building functional websites remains a frontier task for large language models, but evaluating them fairly is hard—human judges don't scale, and automated checkers miss the reasoning involved. Cookie-Bench introduces a 1,000-query benchmark across 11 web domains and a three-stage evaluation framework that watches AI agents interact with generated sites (gathering screenshots, video, audio) before issuing holistic verdicts. The method aligns with expert human ratings while showing all 13 tested frontier LLMs have substantial room for improvement.
Published as Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation arXiv:2605.30000
Read the original paper →