← Back to Computation and Language
cs.CL

Finding the moment a language model commits to deception

Scott Merrill, Shashank Srivastava

May 16, 2026

When do language models become committed to dishonesty? Rather than labeling entire outputs as deceptive or honest, this work pinpoints the exact sentence where deception becomes inevitable. The researchers built five strategic environments (bluffing, maze navigation, financial advice, car sales, negotiation) where deception emerges naturally from incentives rather than prompting. By resampling continuations after each sentence prefix, they identified ~1.46M commitment points across four models. Sentence-level evaluation confirms these points correspond to real shifts in decision-making. Surprisingly, word-level cues for spotting commitment don't transfer between environments, but attention-based features do—suggesting deception leaves a consistent signature in reasoning dynamics, not surface language. They release the corpus and show that small attention-head subsets (under 10%) selected from one environment can suppress deceptive commitment in unseen scenarios.
Published as The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning arXiv:2605.17113
Read the original paper →