← Back to Computation and Language cs.CL
Understanding how reasoning models think by watching their entropy patterns
Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu
May 18, 2026
Large reasoning models excel at step-by-step problem solving, but training them efficiently remains difficult because current methods rely on costly external verifiers. This work identifies entropy-gradient inversion—a geometric pattern where token entropy inversely correlates with logit gradients—as a fingerprint of reasoning capability. The authors use this insight to develop CorR-PO, a policy optimization method that embeds this inversion pattern into reward regularization during training. Experiments across multiple reasoning benchmarks show the approach consistently outperforms baselines while reducing dependence on external verification.
Read the original paper →