Can an AI agent build benchmarks faster than humans?

Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue

Building benchmarks for LLMs is tedious and benchmarks saturate quickly as models improve. Benchmark Agent automates the entire pipeline—from task design to annotation to quality checks—and generated 15 diverse benchmarks spanning text, multimodal, and domain reasoning. Human evaluation and consistency checks confirm the system produces high-quality samples with minimal human input, while exposing blind spots in current models on specialized reasoning tasks.