Can removing one hidden attack from an LLM wipe out others unintentionally planted?

Lisa Bouger, Théo Lasnier, Philippe Looubet Moundi, Yannick Teglia, Djamé Seddah

Backdoor attacks embed hidden triggers into LLMs that force specific outputs—a serious security flaw when defenders don't know what triggers exist. This work shows that training a model to forget one trigger also suppresses other unknown backdoors, a cross-backdoor transfer effect that could be exploited defensively. The authors quantify this phenomenon using a new metric (Cross Activation Shift Distance) and propose that defenders could intentionally inject and remove benign backdoors to preemptively suppress adversarial ones.