Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
深度研究代理能否检索和组织?通过专家分类法评估合成差距
Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang
AI总结 本文提出TaxoBench基准,评估深度研究代理在检索和组织论文方面的能力,发现两者在能力与对齐方面均存在瓶颈。
详情
深度研究代理越来越多地自动化文献综述生成,但它们是否能像人类专家一样检索关键论文并将其组织成专家级分类法仍不清楚。现有基准强调写作质量和引用正确性,而标准聚类指标忽略层次结构。我们引入TaxoBench,一个包含72篇高引LLM综述、专家编写的分类树和3,815篇映射到论文类别的论文的基准。TaxoBench评估(1)检索通过召回率/精确率/F1,以及(2)在叶级别(论文到类别分配)和层次级别通过两个新指标:无序语义树编辑距离(US-TED/US-NTED)和语义路径相似性(Sem-Path)。支持两种模式:深度研究(主题-only,端到端)和自下而上(提供专家论文集,仅组织)。为了区分与单一专家参考的分歧与真正的模型失败,我们明确将发现分为能力基于(参考自由)和对齐基于(参考依赖)组。评估7个深度研究代理和12个前沿LLM揭示了双重瓶颈。在能力方面,最好的代理只能检索专家引用论文的20.92%,1,000个模型分类法显示75.9%的兄弟节点重叠,51.2%的MECE违规,和83.4%的结构不平衡,所有这些在没有参考的情况下都可以检测到。在对齐方面,所有12个LLM收敛到Sem-Path 28-29%,远低于三个独立人工标注组在相同论文集上达到的47-58%。我们的基准在https://github.com/KongLongGeFDU/TaxoBench上公开可用。
Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.