Probe-and-Refine Tuning of Repository Guidance for Coding Agents
代码代理的仓库指导的探测与精炼调优
Asa Shepard, Jeannie Albrecht
AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。
详情
基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。
LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.