Why training on multiple answers per question improves language models

Language models are typically fine-tuned with one response per prompt, even though many questions have multiple correct answers. This creates a "mode lottery" where the model learns an incomplete view of valid outputs. The authors show that keeping multiple responses per prompt reduces prediction uncertainty about the output distribution—but only when prompts are already scarce. They prove random selection of K responses is unbiased, warn that reward-based selection causes mode collapse, and validate on new benchmarks that multi-response training improves generalization most in high-diversity, low-redundancy regimes.