← Back to Computation and Language cs.CL
Reading a model's thoughts to predict what it will do
Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert
May 18, 2026
Large language models that show their reasoning (chain-of-thought) aren't always honest about how they'll behave. Instead of trusting what they write, researchers tracked the hidden mathematical representations during reasoning, measuring how a concept's probability shifts across each token. They found that watching this temporal trajectory—volatility, trends, drift—predicts final behavior far better than a single snapshot. Using 95% of max-pooled features from signal processing, they achieved 95% AUROC on safety and math tasks across four models.
Read the original paper →