← Back to Artificial Intelligence cs.AI
How to catch agent failures before they work reliably?
Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase
June 1, 2026
Early agentic systems fail due to structural integration problems that mask task-level errors, making traditional error detection useless. Researchers developed a three-dimensional monitoring approach (quality/suitability/efficiency across within-run/cross-run/structural scopes) using variance patterns to triage failures by severity. On 220 synthetic runs, within-run monitors caught deterministic stage defects, cross-run monitors exposed stochastic integration issues, and structural monitors detected gaps with perfect consistency—while injected task errors remained invisible. The method routes 97% of findings to automated systems, leaving only 2% for human review.
Read the original paper →