Why does reinforcement learning work so well for language models?

Most successful RL methods for fine-tuning language models skip the critic network entirely, yet still work well—but why remains unclear. These researchers proved that critic-free methods implicitly compute value gradients through their backward passes, meaning the model learns to optimize rewards without explicitly modeling them. They also showed this effect holds for real transformer policies, with error bounds tied to randomness and policy uncertainty. The theory yields a practical criterion for when RL will yield the biggest improvements along a training trajectory.