← Back to Artificial Intelligence cs.AI
Why search engines fail at real research questions
Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma
May 20, 2026
Current benchmarks can't distinguish between today's best language models at research tasks. DeepWeb-Bench demands that models search the web, reconcile conflicting sources, and chain reasoning across multiple steps—three elements that expose real weaknesses. Testing nine frontier models shows retrieval isn't the problem (only 12–14% of failures); instead, models stumble at deriving conclusions from evidence and knowing when they're wrong (70%+ of errors). Strong models make different mistakes than weak ones: the best incomplete their reasoning, while weaker models confidently fabricate details. The benchmark is public with provenance records and evaluation code.
Read the original paper →