Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
A new benchmark has been introduced to evaluate deep research agents (DRAs) on their ability to produce structured analytical deliverables. The study assessed three leading DRAs using a set of 42 prompts authored by subject matter experts. Results indicated low acceptance rates across the agents, highlighting distinct strengths and weaknesses in their performance.
- ▪The benchmark targets the structured analytical deliverables typical in management consulting work.
- ▪Three frontier agents were evaluated: Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro.
- ▪Acceptance rates for the agents were uniformly low, with Gemini at 21.4%, o3 at 9.5%, and Claude also at 9.5%.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.17554 (cs) [Submitted on 17 May 2026] Title:Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps Authors:Tanmay Asthana, Aman Saksena, Divyansh Sahu View a PDF of the paper titled Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps, by Tanmay Asthana and 2 other authors View PDF HTML (experimental) Abstract:Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.