DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
The research aims to comprehensively evaluate the capabilities of Deep Research Agents.
Code | Website | Paper | Eval Dataset | Total models: 21 | Last Update: 02 August 2025
Race judge model: gemini-2.5-pro | Fact-checking models: gemini-2.5-flash
Model Categories
10
49.71
49.51
49.45
50.12
47.22
75.01
165.34
Deep Research Agent
Proprietary

📊 Column Descriptions

  • Rank: Model ranking based on overall score
  • model: Model name (🚀 = Deep Research Agent)
  • overall: Overall Score (weighted average of all metrics)
  • comp.: Comprehensiveness - How thorough and complete the research is
  • insight: Insight Quality - Depth and value of analysis
  • inst.: Instruction Following - Adherence to user instructions
  • read.: Readability - Clarity and organization of content
  • c.acc.: Citation Accuracy - Correctness of references
  • eff.c.: Effective Citations - Relevance and quality of sources
  • category: Model category
  • license_type: The software license type of the model/service

💡 Tip: Model names are clickable when links are available. Visit the GitHub repositories for more details!