Why search engines fail at real research questions

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma

Current benchmarks can't distinguish between today's best language models at research tasks. DeepWeb-Bench demands that models search the web, reconcile conflicting sources, and chain reasoning across multiple steps—three elements that expose real weaknesses. Testing nine frontier models shows retrieval isn't the problem (only 12–14% of failures); instead, models stumble at deriving conclusions from evidence and knowing when they're wrong (70%+ of errors). Strong models make different mistakes than weak ones: the best incomplete their reasoning, while weaker models confidently fabricate details. The benchmark is public with provenance records and evaluation code.