← Back to Machine Learning
cs.LG

Diagnosing why AI agents fail at scale

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan, Xue

May 20, 2026

Debugging LLM agents today means manually inspecting a handful of failed runs. Insights Generator instead treats the entire corpus of execution traces as diagnostic data, automatically proposing and testing hypotheses to uncover systematic behavioral patterns across failures. The system produced natural-language reports with supporting evidence that experts found more useful than manual analysis, and implementing insights from these reports improved scaffold performance by 30.4 percentage points and showed consistent gains on coding tasks.
Published as Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents arXiv:2605.21347
Read the original paper →