← Back to Machine Learning cs.LG
How to revive a collapsed AI reasoning policy without external tweaks?
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang
May 30, 2026
When reinforcement learning trains language models on reasoning tasks, policies eventually collapse into overly confident, repetitive outputs that kill learning signals. TS-OPSD fixes this by treating the model as its own teacher: apply high-temperature smoothing to logits, distill the softer distribution back into parameters. On Qwen 4B and 8B, this lightweight reheating yields stronger checkpoints for continued RL than naive temperature adjustment or standard resumption, while preserving reasoning ability and intermediate representations.
Read the original paper →