DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench is a new benchmark designed to evaluate deep research capabilities of language models. It emphasizes the need for extensive evidence collection, cross-source reconciliation, and long-horizon reasoning. The benchmark aims to provide clearer insights into model performance and weaknesses compared to existing evaluation methods.
- ▪DeepWeb-Bench introduces a more challenging evaluation framework for frontier language models.
- ▪The benchmark focuses on four capability families: Retrieval, Derivation, Reasoning, and Calibration.
- ▪Evaluation results show that derivation and calibration failures account for over 70% of errors in model performance.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.21482 (cs) [Submitted on 20 May 2026] Title:DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation Authors:Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma View a PDF of the paper titled DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation, by Sixiong Xie and 9 other authors View PDF HTML (experimental) Abstract:Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.