← Back to Artificial Intelligence cs.AI
Parallel reasoning with tournament-style ranking beats single-chain thinking
Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang
May 14, 2026
Most test-time compute methods scale depth — making one reasoning trace longer. OpenDeepThink scales breadth instead, maintaining a population of candidate solutions that compete in pairwise LLM-judged comparisons aggregated into a Bradley-Terry global ranking. Each round, the top three-quarters are mutated using natural-language critiques from the comparisons, and the bottom quarter is dropped. Beyond the +405 Codeforces Elo gain on Gemini, the method transfers across model sizes without retuning. On the multi-domain HLE benchmark, gains concentrate in objectively verifiable domains and reverse on subjective ones — a useful signal about where pairwise judging is reliable. The authors also release CF-73, a curated set of 73 expert-annotated Codeforces problems with 99% agreement against official verdicts.
Read the original paper →