← Back to Machine Learning
cs.LG

A unified approach to safe LLM unlearning via energy-based deletion

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

May 16, 2026

Existing LLM safety approaches either delete knowledge (risking incomplete erasure and token-sequence bias) or refuse harmful outputs (while leaving underlying knowledge intact). This work introduces Distinguishable Deletion, which constrains response distributions in latent space rather than targeting specific tokens, enabling both complete knowledge removal and principled refusal. The method uses an energy index to quantify knowledge presence and separation between unlearned and retained content, implemented via Energy-based Unlearning Alignment (EUA) during training and inference. Experiments show significant improvements over prior unlearning methods. Code is released.
Published as Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning arXiv:2605.16776
Read the original paper →