← Back to Machine Learning cs.LG
Why newer AI models sometimes get less safe before improving again
Subhadip Mitra
May 30, 2026
Google's Gemma models show a surprising dip in safety between generations, not a steady climb. Researchers used automated red-teaming across four versions (7B–31B) and found Gemma 3 dramatically more vulnerable to attacks than Gemma 2 or 4—particularly on misinformation (99% vs 29%) and copyright issues. Attacks evolved against one generation don't transfer as effectively to Gemma 4, suggesting safety improvements there are more robust. Static safety benchmarks miss these patterns entirely.
Read the original paper →