DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
The research aims to comprehensively evaluate the capabilities of Deep Research Agents.
Code | Website | Paper | Eval Dataset | Total models: 19 | Last Update: 15 July 2025
Race judge model: gemini-2.5-pro | Fact-checking models: gemini-2.5-flash
Model Categories
10
🚀 gemini-2.5-pro-deepresearch
48.92
48.45
43.73
49.29
49.77
75.01
165.34
Deep Research Agent

📊 Column Descriptions

  • Rank: Model ranking based on overall score
  • model: Model name (🚀 = Deep Research Agent)
  • overall: Overall Score (weighted average of all metrics)
  • comp.: Comprehensiveness - How thorough and complete the research is
  • insight: Insight Quality - Depth and value of analysis
  • inst.: Instruction Following - Adherence to user instructions
  • read.: Readability - Clarity and organization of content
  • c.acc.: Citation Accuracy - Correctness of references
  • eff.c.: Effective Citations - Relevance and quality of sources
  • category: Model category