Training reasoning models efficiently with difficulty-matched questions

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

Training LLMs to improve reasoning via RL is hindered by scarcity of appropriately difficult samples and the problem that samples become too easy as models improve. D²Evo addresses this by jointly training a Questioner (generates training problems) and Solver (answers them), with the Questioner calibrated to the Solver's current capability. The method mines medium-difficulty anchors from the Solver's mistakes and generates diverse questions at matching difficulty levels. On mathematical reasoning benchmarks, D²Evo outperforms existing generation-based approaches using fewer than 2K real samples and generalizes to broader reasoning tasks. Code availability not mentioned.