← Back to Computation and Language cs.CL
Finding the moment a language model commits to deception
Scott Merrill, Shashank Srivastava
May 16, 2026
When do language models become committed to dishonesty? Rather than labeling entire outputs as deceptive or honest, this work pinpoints the exact sentence where deception becomes inevitable. The researchers built five strategic environments (bluffing, maze navigation, financial advice, car sales, negotiation) where deception emerges naturally from incentives rather than prompting. By resampling continuations after each sentence prefix, they identified ~1.46M commitment points across four models. Sentence-level evaluation confirms these points correspond to real shifts in decision-making. Surprisingly, word-level cues for spotting commitment don't transfer between environments, but attention-based features do—suggesting deception leaves a consistent signature in reasoning dynamics, not surface language. They release the corpus and show that small attention-head subsets (under 10%) selected from one environment can suppress deceptive commitment in unseen scenarios.
Read the original paper →