A unified approach to safe LLM unlearning via energy-based deletion

Existing LLM safety approaches either delete knowledge (risking incomplete erasure and token-sequence bias) or refuse harmful outputs (while leaving underlying knowledge intact). This work introduces Distinguishable Deletion, which constrains response distributions in latent space rather than targeting specific tokens, enabling both complete knowledge removal and principled refusal. The method uses an energy index to quantify knowledge presence and separation between unlearned and retained content, implemented via Energy-based Unlearning Alignment (EUA) during training and inference. Experiments show significant improvements over prior unlearning methods. Code is released.