Traditional statistical representations outperform generative AI in identifying expert peer reviewers
传统统计表示在识别专家同行评审员中优于生成式AI
Vicente Amado Olivo, Tereza Jerabkova, Jakub Klencki, John Carpenter, Mario Malički, Ferdinando Patat, Louis-Gregory Strolger, Wolfgang Kerzendorf
AI总结 本研究通过对比统计方法和AI驱动的方法,发现传统统计表示在识别领域专家方面优于生成式AI,强调了细粒度词汇在区分子领域专家中的重要性。
详情
科学投稿的指数级增长已对同行评审系统造成了压力。尽管全球研究人员数量迅速增加,但这种前所未有的规模使传统的手工专家识别方法变得不可行。因此,机构自然转向大型语言模型(LLMs)来自动化复杂的过程,如专家评审员识别。然而,这些新模型在准确识别领域专家方面的可靠性缺乏严格评估。我们对统计和AI驱动的专家识别方法进行了全面的实证评估,以基准其可靠性和局限性。将专家识别框架为信息检索问题,我们利用一个主要国际天文台的分布式同行评审系统,其中提案作者身份作为我们的领域专家代理真实值。评估六个在天文台和计算机科学会议中使用的检索方法,我们证明传统统计表示优于生成式AI。具体而言,词频-逆文档频率成功在前25个推荐中识别出标注专家79.5%的时间,而GPT-4o mini为51.5%。我们的结果表明,区分子领域专家需要细粒度的词汇,这被生成方法中的语义平滑所掩盖。通过建立一个严格评估框架来自动化同行评审,我们证明了透明且可重复的统计表示在专门的科学任务中仍优于计算成本高的LLMs。
The exponential growth of scientific submissions has strained the peer review system. Despite the rapidly expanding global pool of researchers, this unprecedented scale has rendered the previous approach of manual expert identification unfeasible. Therefore, institutions have naturally turned to Large Language Models (LLMs) to automate intricate processes like expert reviewer identification. However, the reliability of these new models in accurately identifying domain experts lacks rigorous evaluation. We conduct a comprehensive empirical evaluation of statistical and AI-driven expertise identification methodologies to benchmark their reliability and limitations. Framing expert identification as an information retrieval problem, we utilize the distributed peer review system of a major international astronomical observatory, where proposal authorship serves as our proxy ground truth for domain expertise. Evaluating six retrieval methodologies utilized across observatories and computer science conferences, we demonstrate that traditional statistical representations outperform generative AI. Specifically, Term Frequency-Inverse Document Frequency successfully identified a labeled expert within the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. Our results highlight that distinguishing subfield expertise requires fine-grained vocabulary, which is obscured by the semantic smoothing in generative methods. By establishing a rigorous evaluation framework for automated peer review, we demonstrate that transparent and reproducible statistical representations still outperform computationally expensive LLMs in specialized scientific tasks.