Diagnosing why AI agents fail at scale

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan, Xue

Debugging LLM agents today means manually inspecting a handful of failed runs. Insights Generator instead treats the entire corpus of execution traces as diagnostic data, automatically proposing and testing hypotheses to uncover systematic behavioral patterns across failures. The system produced natural-language reports with supporting evidence that experts found more useful than manual analysis, and implementing insights from these reports improved scaffold performance by 30.4 percentage points and showed consistent gains on coding tasks.