← Back to Computation and Language cs.CL
What agent leaderboards don't tell you about LLM capabilities
Parsa Mazaheri, Kasra Mazaheri
May 19, 2026
Today's LLM agents control files, browsers, and code, but benchmarks measure different things (task success, tool validity, safety, robustness), making comparison impossible. This work proposes a unified framework with taxonomies for agent behavior, failure modes, and a key finding: stripping explicit labels from prompts causes all models to collapse to 54–62% accuracy, suggesting much of their capability comes from supervised cues rather than genuine reasoning.
Read the original paper →