← Back to Machine Learning cs.LG
Diagnosing why AI agents fail at scale
Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan, Xue
May 20, 2026
Debugging LLM agents today means manually inspecting a handful of failed runs. Insights Generator instead treats the entire corpus of execution traces as diagnostic data, automatically proposing and testing hypotheses to uncover systematic behavioral patterns across failures. The system produced natural-language reports with supporting evidence that experts found more useful than manual analysis, and implementing insights from these reports improved scaffold performance by 30.4 percentage points and showed consistent gains on coding tasks.
Read the original paper →