Why newer AI models sometimes get less safe before improving again

Google's Gemma models show a surprising dip in safety between generations, not a steady climb. Researchers used automated red-teaming across four versions (7B–31B) and found Gemma 3 dramatically more vulnerable to attacks than Gemma 2 or 4—particularly on misinformation (99% vs 29%) and copyright issues. Attacks evolved against one generation don't transfer as effectively to Gemma 4, suggesting safety improvements there are more robust. Static safety benchmarks miss these patterns entirely.