How backdoor attacks hijack language models through hidden shortcuts

Backdoor attacks on language models work like sleeper agents—a hidden trigger phrase hijacks the model's output without obvious signs. By dissecting an 8B-parameter model, researchers traced exactly how a Latin trigger redirects English text to French: attention heads compress the trigger tokens, the signal travels through a subspace orthogonal to normal language processing, and the final layer converts it to French logits. They found a critical bottleneck—one position where blocking the signal defeats the attack—but also discovered that traditional defenses scanning for language patterns would completely miss this trigger.