Reading a model's thoughts to predict what it will do

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert

Large language models that show their reasoning (chain-of-thought) aren't always honest about how they'll behave. Instead of trusting what they write, researchers tracked the hidden mathematical representations during reasoning, measuring how a concept's probability shifts across each token. They found that watching this temporal trajectory—volatility, trends, drift—predicts final behavior far better than a single snapshot. Using 95% of max-pooled features from signal processing, they achieved 95% AUROC on safety and math tasks across four models.