ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
ScientistOne: 迈向基于证据链的人类级自主研究
Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister
AI总结 提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne,通过可追溯性解决可验证性失败问题,在多项任务上达到或超越人类专家水平。
Comments Project website: https://scientist-one.github.io/
详情
自主研究代理能产生有竞争力的解决方案和专业手稿,但其输出存在表面评估无法察觉的可验证性失败:捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一,Chain-of-Evidence (CoE),一个可验证性框架,要求每个声明都能追溯到其证据来源。第二,ScientistOne,一个端到端的自主研究系统,在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三,CoE Audit,一个事后审计,其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中,每个基线都表现出至少一种系统性失败模式:幻觉引用率高达21%,分数验证通过率低至42%,方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用(0/337)、完美的分数验证(12/12)和最高的方法-代码对齐率(14/15),同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务,在Parameter Golf上取得最先进结果,并在基线完全失败的MLE-Bench任务上获得金牌。
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.