← Back to Machine Learning cs.LG
An unlearning method that survives model quantization
Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu
May 14, 2026
Language models are routinely quantized after training, yet unlearning benchmarks test only full-precision models. The paper shows that 4-bit post-training quantization systematically reverses forgetting across all existing methods—gradient updates land 47–828× below the NF4 quantization bin width, so edits diffuse too thinly to survive compression. MANSU addresses this by using causal circuit attribution to find the minimal subgraph encoding the target knowledge, then applies null-space projection within that subgraph with a Fisher-based constraint to protect retained knowledge, plus a per-parameter magnitude floor that guarantees updates cross quantization bin boundaries. The paper also introduces Circuit Attribution Divergence (CAD), a metric that distinguishes structural erasure from mere behavioral suppression—a distinction no prior metric captures. MANSU is the first method to simultaneously achieve meaningful forgetting, retain preservation, a non-positive post-quantization accuracy gap, and structural erasure across multiple model families and hazard benchmarks.
Read the original paper →