← Back to Computation and Language
cs.CL

Mapping the blind spots in LLM attack benchmarks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

May 14, 2026

Current LLM attack benchmarks are fragmented and incomplete. This work constructs a 4×6 matrix grounded in STRIDE threat modeling from a 507-leaf taxonomy of attacks extracted from 932 recent security papers. Auditing six public benchmarks—HarmBench, InjecAgent, AgentDojo, and others—shows they occupy non-overlapping cells with significant coverage gaps. Entire threat categories (Service Disruption, Model Internals) lack standardized evaluation, yet published attacks in these areas achieve 46× token amplification and 96% success rates. The analysis also uncovers naming fragmentation across 2,521 unique attack groups, with single attacks referenced under up to 29 different names. The taxonomy, attack corpus, and coverage mappings are released as extensible artifacts to track future benchmark progress.
Published as Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks arXiv:2605.15118
Read the original paper →