Understanding how reasoning models think by watching their entropy patterns

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

Large reasoning models excel at step-by-step problem solving, but training them efficiently remains difficult because current methods rely on costly external verifiers. This work identifies entropy-gradient inversion—a geometric pattern where token entropy inversely correlates with logit gradients—as a fingerprint of reasoning capability. The authors use this insight to develop CorR-PO, a policy optimization method that embeds this inversion pattern into reward regularization during training. Experiments across multiple reasoning benchmarks show the approach consistently outperforms baselines while reducing dependence on external verification.