← Back to Computation and Language cs.CL
Mapping the blind spots in LLM attack benchmarks
Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets
May 14, 2026
Current LLM attack benchmarks are fragmented and incomplete. This work constructs a 4×6 matrix grounded in STRIDE threat modeling from a 507-leaf taxonomy of attacks extracted from 932 recent security papers. Auditing six public benchmarks—HarmBench, InjecAgent, AgentDojo, and others—shows they occupy non-overlapping cells with significant coverage gaps. Entire threat categories (Service Disruption, Model Internals) lack standardized evaluation, yet published attacks in these areas achieve 46× token amplification and 96% success rates. The analysis also uncovers naming fragmentation across 2,521 unique attack groups, with single attacks referenced under up to 29 different names. The taxonomy, attack corpus, and coverage mappings are released as extensible artifacts to track future benchmark progress.
Read the original paper →