Can language models truly predict what consumers will say?

This paper introduces ConsumerSimBench, a benchmark designed to test whether large language models can accurately simulate how real consumers respond to topics and marketing decisions. Built from 1,553 Chinese social-media topics and decomposed into concrete yes-no judgments rather than holistic evaluations, the benchmark achieves 92.1% inter-judge agreement. Testing 13 frontier models shows Gemini-3.1-Pro achieves only 47.8% coverage of real reaction criteria, while GPT-5.2 and Claude-4.6 perform worse despite strong performance on standard technical benchmarks. A multi-agent pipeline improved one model from 32.9% to 37.6% coverage, but the results highlight a fundamental gap: models excel at technical tasks but struggle to predict what matters to real consumers in high-context social discourse.