← Back to Computation and Language
cs.CL

Debugging multi-agent AI workflows through intelligent bottleneck detection

Kazuki Kawamura, Satoshi Waki, Kei Tateno

May 18, 2026

Multi-agent workflows—systems with multiple specialized AI agents—often outperform single-prompt approaches but are hard to debug when intermediate outputs fail silently. PROTEA addresses this by providing a test-driven interface that executes workflows, scores outputs at each step using configurable rubrics, and visualizes bottlenecks on a workflow graph. The key innovation is backward node evaluation: it infers what each agent's output should have been based on the final answer and workflow structure, then flags discrepancies. Developers can edit prompts and immediately see how changes propagate through the system. Two production workflows saw accuracy gains of 19.6 percentage points and Hit@5 improvement from 0.30 to 0.38. A user study with six experienced developers confirmed the tool's value for localization and iterative refinement.
Published as PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows arXiv:2605.18032
Read the original paper →