← Back to Machine Learning cs.LG
Finding how to break each AI model differently, at scale?
Subhadip Mitra
May 30, 2026
Red-teaming LLMs is stuck: manual effort doesn't scale, automated attacks collapse into repetitive patterns, and gradient-based methods produce nonsense. This work evolves interpretable semantic attack strategies—not token gibberish—across a space of behavioral dimensions using MAP-Elites. Testing four models reveals they break differently: GPT-4o-mini and Gemini both fall to ROT13 combined with framing tricks (0.8 success), while Claude stays ambiguous across all attack types (0.4 max). Released with code.
Read the original paper →