Parallel reasoning with tournament-style ranking beats single-chain thinking

Most test-time compute methods scale depth — making one reasoning trace longer. OpenDeepThink scales breadth instead, maintaining a population of candidate solutions that compete in pairwise LLM-judged comparisons aggregated into a Bradley-Terry global ranking. Each round, the top three-quarters are mutated using natural-language critiques from the comparisons, and the bottom quarter is dropped. Beyond the +405 Codeforces Elo gain on Gemini, the method transfers across model sizes without retuning. On the multi-domain HLE benchmark, gains concentrate in objectively verifiable domains and reverse on subjective ones — a useful signal about where pairwise judging is reliable. The authors also release CF-73, a curated set of 73 expert-annotated Codeforces problems with 99% agreement against official verdicts.