← Back to Artificial Intelligence
cs.AI

How to catch agent failures before they work reliably?

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase

June 1, 2026

Early agentic systems fail due to structural integration problems that mask task-level errors, making traditional error detection useless. Researchers developed a three-dimensional monitoring approach (quality/suitability/efficiency across within-run/cross-run/structural scopes) using variance patterns to triage failures by severity. On 220 synthetic runs, within-run monitors caught deterministic stage defects, cross-run monitors exposed stochastic integration issues, and structural monitors detected gaps with perfect consistency—while injected task errors remained invisible. The method routes 97% of findings to automated systems, leaving only 2% for human review.
Published as Monitoring Agentic Systems Before They're Reliable arXiv:2606.02494
Read the original paper →