← Back to Artificial Intelligence
cs.AI

Parallel reasoning with tournament-style ranking beats single-chain thinking

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

May 14, 2026

Most test-time compute methods scale depth — making one reasoning trace longer. OpenDeepThink scales breadth instead, maintaining a population of candidate solutions that compete in pairwise LLM-judged comparisons aggregated into a Bradley-Terry global ranking. Each round, the top three-quarters are mutated using natural-language critiques from the comparisons, and the bottom quarter is dropped. Beyond the +405 Codeforces Elo gain on Gemini, the method transfers across model sizes without retuning. On the multi-domain HLE benchmark, gains concentrate in objectively verifiable domains and reverse on subjective ones — a useful signal about where pairwise judging is reliable. The authors also release CF-73, a curated set of 73 expert-annotated Codeforces problems with 99% agreement against official verdicts.
Published as OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation arXiv:2605.15177
Read the original paper →