← Back to Machine Learning
cs.LG

Can LLMs be safer by learning from unsafe data?

Maryam Hashemzadeh, Jerry Huang, Minseon Kim, Marc-Alexandre Côté, Sarath Chandar

May 30, 2026

Standard LLM safety training filters harmful data, leaving models that refuse benign safety questions with blanket refusals. SafeMoE reverses this: it trains domain-specific experts on toxic corpora, then uses a lightweight routing network to dynamically control when and how these unsafe experts contribute. Result: 20% relative improvement in safety benchmarks plus more informative responses on sensitive topics, with generalization to unseen domains.
Published as Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing arXiv:2606.00686
Read the original paper →