Can LLMs be safer by learning from unsafe data?

Maryam Hashemzadeh, Jerry Huang, Minseon Kim, Marc-Alexandre Côté, Sarath Chandar

Standard LLM safety training filters harmful data, leaving models that refuse benign safety questions with blanket refusals. SafeMoE reverses this: it trains domain-specific experts on toxic corpora, then uses a lightweight routing network to dynamically control when and how these unsafe experts contribute. Result: 20% relative improvement in safety benchmarks plus more informative responses on sensitive topics, with generalization to unseen domains.