On the Limits of LLM-as-Judge for Scientific Novelty Assessment
论LLM作为评审在科学新颖性评估中的局限性
Soumitra Sinhahajari, Navonil Majumder, Soujanya Poria
AI总结 本文通过构建RQ-Bench基准,发现LLM评审对模型生成的研究问题产生新颖性幻觉,而人类专家则持相反意见,揭示了LLM在评估科学新颖性时的可靠性问题。
详情
LLM越来越多地被用于生成和评判科学想法。这使得新颖性评估成为一个核心问题。完整想法的评估很困难,因为它通常需要判断方法、可行性及其经验前景。因此,我们研究一个更清晰的上游对象:研究问题(RQ)。RQ生成是科学构思的前提,并且RQ可以与真实论文中探讨的问题进行比较。我们引入了RQ-Bench,一个基于近期arXiv论文构建的基准。对于每篇论文,我们从其引用的背景、空白和贡献中重建作者锚定的RQ。这些RQ并非针对同一背景的唯一有效问题。它们是用于测试新颖性判断的作者锚定参考点。我们使用独立LLM评审、比较LLM评审和人类专家评估来评估模型生成的RQ。LLM评审一致地将模型生成的RQ评为高度新颖,产生新颖性幻觉;在比较评估中,这种偏好甚至更强。然而,领域专家得出相反结论,更偏好作者锚定的参考问题。我们进一步发现,许多生成的RQ狭窄或受限于来源,这是LLM评审通常忽略的维度,除非明确测试。总体而言,LLM评审与人类专家之间矛盾的新颖性评估引发了关于使用LLM评估研究问题科学新颖性可靠性的严重担忧。
LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.