← Back to Computation and Language
cs.CL

Can removing one hidden attack from an LLM wipe out others unintentionally planted?

Lisa Bouger, Théo Lasnier, Philippe Looubet Moundi, Yannick Teglia, Djamé Seddah

June 2, 2026

Backdoor attacks embed hidden triggers into LLMs that force specific outputs—a serious security flaw when defenders don't know what triggers exist. This work shows that training a model to forget one trigger also suppresses other unknown backdoors, a cross-backdoor transfer effect that could be exploited defensively. The authors quantify this phenomenon using a new metric (Cross Activation Shift Distance) and propose that defenders could intentionally inject and remove benign backdoors to preemptively suppress adversarial ones.
Published as Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs arXiv:2606.03785
Read the original paper →