AI中文摘要
自主研究系统日益使科学工作流程可执行:代理可以提出想法、运行代码、检查结果并起草论文。但可执行的工作流程本身并不产生研究判断。我们分析了当前系统在试验经验积累方面的不足:弱证据变成散文,试点信号变成广泛声明,记忆保持文本,重复的过程失败不改变后来的行为。我们引入Sibyl-AutoResearch,一个自我进化的AutoResearch框架,围绕科学试验与错误机制构建。一个机制让代理运行有界试验,保存积极和消极结果,并将教训路由到后来的规划、验证、声明范围、调度、批评、写作和机制修复。我们通过两个可审计的转换单元正式化这一过程:试验到行为转换,将试验信号链接到后来的研究行动,以及试验到机制行为转换,将重复的过程失败链接到系统更新。我们实现了该框架在SIBYL中,这是一个基于文件的自主研究系统,暴露了状态、角色、记忆、门、和制品痕迹所需以检查这些转换路径。回顾性审计识别出八个高置信度的转换事件,中位延迟为一个迭代,最大延迟为三个迭代。一个恢复失败注册表进一步展示了如何通过五个自然发生的失败类别,包括重复结果、过时数字和不支持的统计数据,被阻止、降级或路由到后来的修复。这些痕迹不建立比较性能的主张;它们表明所提出的转换单元可以从现实的自主研究工作空间中恢复。SIBYL框架和系统可在https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem上获得。
英文摘要
Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem.