← Back to Machine Learning
cs.LG

How to revive a collapsed AI reasoning policy without external tweaks?

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang

May 30, 2026

When reinforcement learning trains language models on reasoning tasks, policies eventually collapse into overly confident, repetitive outputs that kill learning signals. TS-OPSD fixes this by treating the model as its own teacher: apply high-temperature smoothing to logits, distill the softer distribution back into parameters. On Qwen 4B and 8B, this lightweight reheating yields stronger checkpoints for continued RL than naive temperature adjustment or standard resumption, while preserving reasoning ability and intermediate representations.
Published as Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning arXiv:2606.00755
Read the original paper →