← Back to Machine Learning
cs.LG

Finding how to break each AI model differently, at scale?

Subhadip Mitra

May 30, 2026

Red-teaming LLMs is stuck: manual effort doesn't scale, automated attacks collapse into repetitive patterns, and gradient-based methods produce nonsense. This work evolves interpretable semantic attack strategies—not token gibberish—across a space of behavioral dimensions using MAP-Elites. Testing four models reveals they break differently: GPT-4o-mini and Gemini both fall to ROT13 combined with framing tricks (0.8 success), while Claude stays ambiguous across all attack types (0.4 max). Released with code.
Published as Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety arXiv:2606.00801
Read the original paper →