← Back to Computation and Language
cs.CL

Why does reinforcement learning work so well for language models?

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

May 20, 2026

Most successful RL methods for fine-tuning language models skip the critic network entirely, yet still work well—but why remains unclear. These researchers proved that critic-free methods implicitly compute value gradients through their backward passes, meaning the model learns to optimize rewards without explicitly modeling them. They also showed this effect holds for real transformer policies, with error bounds tied to randomness and policy uncertainty. The theory yields a practical criterion for when RL will yield the biggest improvements along a training trajectory.
Published as Value-Gradient Hypothesis of RL for LLMs arXiv:2605.21654
Read the original paper →