← Back to Machine Learning
cs.LG

Why newer AI models sometimes get less safe before improving again

Subhadip Mitra

May 30, 2026

Google's Gemma models show a surprising dip in safety between generations, not a steady climb. Researchers used automated red-teaming across four versions (7B–31B) and found Gemma 3 dramatically more vulnerable to attacks than Gemma 2 or 4—particularly on misinformation (99% vs 29%) and copyright issues. Attacks evolved against one generation don't transfer as effectively to Gemma 4, suggesting safety improvements there are more robust. Static safety benchmarks miss these patterns entirely.
Published as Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs arXiv:2606.00813
Read the original paper →