← Back to Machine Learning
cs.LG

Why training on multiple answers per question improves language models

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

May 30, 2026

Language models are typically fine-tuned with one response per prompt, even though many questions have multiple correct answers. This creates a "mode lottery" where the model learns an incomplete view of valid outputs. The authors show that keeping multiple responses per prompt reduces prediction uncertainty about the output distribution—but only when prompts are already scarce. They prove random selection of K responses is unbiased, warn that reward-based selection causes mode collapse, and validate on new benchmarks that multi-response training improves generalization most in high-diversity, low-redundancy regimes.
Published as Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization arXiv:2606.00544
Read the original paper →