DeployBench: Benchmarking LLM Agents for Research Artifact Deployment
DeployBench: 用于研究工件部署的LLM智能体基准测试
Yuanli Wang, Yaoyao Qian, Yue Zhang, Hanhan Zhou, Jindan Huang, Tianfu Fu, Qiuyang Mang, Huanzhi Mao, Wenhao Chai, Wendong Fan, Liqiang Jing
AI总结 提出DeployBench基准,包含51个跨领域的研究工件部署任务,评估LLM智能体在环境设置中的表现,发现主要失败原因是智能体过早停止且验证不充分。
详情
LLM智能体在软件工程和机器学习研究任务上取得了快速进展,但这些进展通常假设可以访问一个可运行的执行环境。对于随已发表论文发布的研究工件,从新机器设置这样的环境仍然是一个主要瓶颈。现有的环境设置基准并未涵盖研究工件部署的全部范围,这涉及多语言工具链、超越容器(例如GPU/CUDA和内核配置)的系统级依赖以及遗留工件兼容性。我们引入了DeployBench,一个包含51个跨领域研究工件部署任务的多领域基准,涵盖AI/ML、计算机系统和科学计算,覆盖所有这些维度。每个任务由一个隐藏的管道验证,该管道执行论文指定的实验并检查其输出。使用OpenHands评估四个最先进的LLM,通过率从7.8%到51.0%。失败主要由完成判断问题主导:154个失败中有97个是智能体终止的自我停止,其中智能体的预完成检查验证的目标与论文特定任务要求不同或更弱。DeployBench突出了当前智能体与自主部署之间的差距,并为科学研究智能体提供了一个现实的测试平台。
LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.