← Back to Artificial Intelligence cs.AI
Can an AI agent build benchmarks faster than humans?
Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue
June 4, 2026
Building benchmarks for LLMs is tedious and benchmarks saturate quickly as models improve. Benchmark Agent automates the entire pipeline—from task design to annotation to quality checks—and generated 15 diverse benchmarks spanning text, multimodal, and domain reasoning. Human evaluation and consistency checks confirm the system produces high-quality samples with minimal human input, while exposing blind spots in current models on specialized reasoning tasks.
Read the original paper →