DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
The research aims to comprehensively evaluate the capabilities of Deep Research Agents.
Code | Website | Paper | Eval Dataset | Total models: 23 | Last Update: 07 November 2025
Race judge model: gemini-2.5-pro | Fact-checking models: gemini-2.5-flash
Model Categories
1 🥇
49.71
50.06
50.76
51.31
49.72
32.94
165.34
Deep Research Agent
Proprietary

📊 Column Descriptions

  • Rank: Model ranking based on overall score
  • model: Model name (🚀 = Deep Research Agent)
  • overall: Overall Score (weighted average of all metrics)
  • comp.: Comprehensiveness - How thorough and complete the research is
  • insight: Insight Quality - Depth and value of analysis
  • inst.: Instruction Following - Adherence to user instructions
  • read.: Readability - Clarity and organization of content
  • c.acc.: Citation Accuracy - Correctness of references
  • eff.c.: Effective Citations - Relevance and quality of sources
  • category: Model category
  • license_type: The software license type of the model/service

💡 Tip: Model names are clickable when links are available. Visit the GitHub repositories for more details!