← Back to Machine Learning
cs.LG

Training AI to explore more options helps it find better answers

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

May 21, 2026

Current language models trained on single scalar rewards tend to converge to one narrow solution, which hurts performance when test-time search algorithms like AlphaEvolve need to explore many possibilities. Vector Policy Optimization (VPO) trains models to anticipate multiple reward functions and produce diverse solutions by treating rewards as vectors—like per-test-case correctness in code generation—rather than scalars. Across code generation and other tasks, VPO matches or beats standard RL baselines on pass@k metrics while unlocking problems that baseline methods cannot solve at all when paired with evolutionary search.
Published as Vector Policy Optimization: Training for Diversity Improves Test-Time Search arXiv:2605.22817
Read the original paper →