← Back to Artificial Intelligence
cs.AI

Why search engines fail at real research questions

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma

May 20, 2026

Current benchmarks can't distinguish between today's best language models at research tasks. DeepWeb-Bench demands that models search the web, reconcile conflicting sources, and chain reasoning across multiple steps—three elements that expose real weaknesses. Testing nine frontier models shows retrieval isn't the problem (only 12–14% of failures); instead, models stumble at deriving conclusions from evidence and knowing when they're wrong (70%+ of errors). Strong models make different mistakes than weak ones: the best incomplete their reasoning, while weaker models confidently fabricate details. The benchmark is public with provenance records and evaluation code.
Published as DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation arXiv:2605.21482
Read the original paper →