← Back to Computation and Language
cs.CL

How backdoor attacks hijack language models through hidden shortcuts

Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah

May 18, 2026

Backdoor attacks on language models work like sleeper agents—a hidden trigger phrase hijacks the model's output without obvious signs. By dissecting an 8B-parameter model, researchers traced exactly how a Latin trigger redirects English text to French: attention heads compress the trigger tokens, the signal travels through a subspace orthogonal to normal language processing, and the final layer converts it to French logits. They found a critical bottleneck—one position where blocking the signal defeats the attack—but also discovered that traditional defenses scanning for language patterns would completely miss this trigger.
Published as Language-Switching Triggers Take a Latent Detour Through Language Models arXiv:2605.18646
Read the original paper →