Can LLMs reason better by exploring multiple paths first?

Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu

LLMs today excel at reasoning but typically commit to a single solution path. HDPO (Hint-Guided Diversified Policy Optimization) flips this: it trains models to first sketch multiple candidate approaches as "hints," then select and execute the most promising one. Two-stage training combines structured reasoning cold-start with reinforcement learning that rewards both diverse exploration and reliable selection. On reasoning benchmarks, the method boosts accuracy while visibly increasing solution diversity—mimicking how humans mentally test alternatives before deciding.