Making diffusion models safer without paired training data

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

Existing methods for removing unsafe content from diffusion models require costly paired datasets of unsafe text with safe images. SafeDiffusion-R1 uses Group Relative Policy Optimization with a steering reward mechanism that exploits CLIP embedding geometry to push text representations toward safety without fine-tuning specialized reward models. The online approach learns from diverse prompts including explicit unsafe content, avoiding the catastrophic forgetting common in offline methods. On Stable Diffusion v1.4, it cuts inappropriate content from 48.9% to 18.07%, reduces nudity detections from 646 to 15, and improves compositional generation quality (GenEval: 42.08% → 47.83%), with gains generalizing across seven harm categories. Code is released.