代码大模型 / AI 编程 - arXivDaily 专题

2606.20512 2026-06-19 cs.SE cs.LG 新提交 90%

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

发表机构 * Williams College（威廉姆斯学院）

专题命中软件智能体：提出编码代理仓库指导的探测与精炼调优

AI总结提出探测与精炼调优方法，通过合成bug修复探测迭代诊断和修补仓库指导文件，在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率，优于静态知识库的28.3%和无指导基线的25.5%。

详情

AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识（哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复），这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令，但它们是否有帮助存在争议：最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中，我们展示了指导的产生方式才是决定性变量，并引入了\emph{探测与精炼调优}：一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程，使用单次LLM调用，在调优期间没有代理循环或工具使用。在SWE-bench Verified上，使用Qwen3.5-35B-A3B进行200步的四个独立试验中，探测与精炼实现了33.0%的平均解决率，而用于初始化的静态知识库为28.3%，无指导基线为25.5%（两个探测与精炼对比的p < 0.001）。改进来自覆盖率而非精确度：精炼后的指导为14.5个百分点（pp）更多的实例生成了可评估的补丁，而每个补丁的精确度在统计上保持不变（约59%，p = 0.119），表明改进的指导帮助代理到达正确的文件，而不是提高它们所做更改的质量。此外，一个步骤预算实验表明，指导让代理能够更有效地利用更大的步骤预算，而一个跨模型实验（使用NVIDIA-Nemotron-3-Nano-30B-A3B）发现，当模型无法生成足够诊断性的输出时，调优循环会退化，尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

URL PDF HTML ☆

赞 0 踩 0

2606.20243 2026-06-19 cs.SE cs.MA 新提交 90%

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Phoenix: 通过多智能体LLM实现安全的GitHub问题解决

Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

专题命中软件智能体：多智能体LLM系统解决GitHub问题

AI总结提出多智能体LLM系统Phoenix，通过六个专业智能体和七层安全控制，在SWE-bench Lite子集上达到75%的解决率，并在真实问题中保持100%正确性。

详情

AI中文摘要

我们提出Phoenix，一个多智能体LLM系统，能够从分类到拉取请求创建解决GitHub问题，结合了七层安全控制与基线感知测试评估策略。Phoenix将工作分解给六个专业智能体：规划器、复现器、编码器、测试器、故障分析器和拉取请求（PR）智能体，所有智能体由基于标签的GitHub webhook状态机协调。在打开拉取请求之前，每次更改都会与基线测试运行进行对比。在SWE-bench Lite的24个实例子集上，在生产webhook路径上运行，Phoenix oracle解决了75%的实例，且成功运行中没有出现通过到通过的回归；这个精心挑选的子集不能直接与完整分割排行榜结果比较，我们讨论了比较的局限性。在14个仓库的42个真实问题上的补充试点实现了100%的正确性保持（CP；硬级别平均122秒）。人工检查显示，大约一半的拉取请求是定位良好的修复。另一半将代码放置在错误路径上，这是规划器定位的局限性，我们正在通过检索来解决。我们还报告了部署失败模式（WAF过滤、令牌过期、权限边界、不稳定的CI），这些模式促使了每种安全机制的引入。

英文摘要

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

URL PDF HTML ☆

赞 0 踩 0

2606.19380 2026-06-19 cs.SE cs.LG 新提交 90%

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor：编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program（Anthropic Fellow 项目）； Constellation

专题命中软件智能体：研究编码代理的失败模式并提出缓解框架。

AI总结提出AgentArmor框架，通过系统提示增强、命令分类器、三振政策等机制，缓解编码代理因规范不足、能力错误和工具错误导致的失败，显著提升安全性。

详情

AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中，我们研究这些失败模式源于三种不同的机制：规范不足，即默认模型行为不安全；能力错误，即安全动作可用但模型因偏见或能力限制而未遵循；以及代理工具错误，即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制，每个评估都受实际部署失败的启发，总计20个编码环境和59个合成转录模板。基于此评估，我们提出AgentArmor，一种代理工具修改，以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具，我们证明AgentArmor在统计显著数量的样本上更安全。因此，我们为当前编码代理提出具体缓解措施，并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

URL PDF HTML ☆

赞 0 踩 0

2606.14066 2026-06-19 cs.SE 新提交 90%

FastContext: Training Efficient Repository Explorer for Coding Agents

FastContext: 为编码智能体训练高效的仓库探索器

Shaoqiu Zhang, Maoquan Wang, Yuling Shi, Yuhang Wang, Xiaodong Gu, Yongqiang Yao, Tori Gong, Sheng Chen, Rao Fu, Anisha Agarwal, Spandan Grag, Gabriel Ryan, Colin Merkel, Yufan Huang, Shengyu Fu

专题命中软件智能体：编码智能体仓库探索器

AI总结提出专用探索子智能体FastContext，通过并行工具调用和专注上下文生成，分离仓库探索与问题解决，在SWE-bench等任务上提升修复率达5.5%，降低编码智能体token消耗达60%。

Comments 34 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLM）编码智能体在软件工程任务上取得了强劲成果，但仓库探索仍是主要瓶颈：定位相关代码消耗大量token预算，并用不相关的片段污染智能体的上下文。在大多数智能体中，同一个模型既探索仓库又解决问题，将探索性读取和搜索留在求解器的历史记录中。我们提出FastContext，一个专用的探索子智能体，将仓库探索与求解分离。按需调用时，FastContext发出并行工具调用，并返回简洁的文件路径和行范围作为聚焦上下文。FastContext由专门的探索模型驱动，参数规模从4B到30B。我们从强参考模型轨迹中引导这些模型，并使用任务导向的奖励进行细化，以实现广泛的首次搜索、多轮证据收集和精确的引用生成。在SWE-bench Multilingual、SWE-bench Pro和SWE-QA上，将FastContext集成到Mini-SWE-Agent中，端到端修复率提升高达5.5%，同时编码智能体token消耗降低高达60%，且开销极小。这些结果表明，仓库探索可以与求解分离，并由专门模型有效处理。代码和数据：此 https URL

英文摘要

Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent's context with irrelevant snippets. In most agents, the same model explores the repository and solves the task, leaving exploratory reads and searches in the solver's history. We present FastContext, a dedicated exploration subagent that separates repository exploration from solving. Invoked on demand, FastContext issues parallel tool calls and returns concise file paths and line ranges as focused context. FastContext is powered by specialized exploration models spanning 4B--30B parameters. We bootstrap them from strong reference-model trajectories and refine them with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation. Across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates up to 5.5% while reducing coding-agent token consumption up to 60%, with marginal overhead. These results show that repository exploration can be separated from solving and handled effectively by specialized models. Code and data: https://github.com/microsoft/fastcontext

URL PDF HTML ☆

赞 0 踩 0

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交 80%

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前：挖掘多智能体协调

Dipankar Sarkar

发表机构 * Arizona State University（亚利桑那州立大学）

专题命中软件智能体：提出grite协调基板，减少多编码智能体冲突。

AI总结针对自主编码智能体在拉取请求中协调不足的问题，提出基于git的协调基板grite，通过事件日志减少重复和冲突工作，提升吞吐量，并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情

AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求，然而大规模研究发现，它们的拉取请求虽然生成更快，但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前，即并发智能体如何声明、划分和碰撞共享工作。我们通过grite（我们的开源协调基板）来研究这一过程，它不需要中央服务器，并将其记录存储在git本身内部，因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明：(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%，而有效吞吐量增加了三倍以上；(ii) 每个智能体的日志副本收敛到相同状态，没有写入被静默丢弃，而基于文件的跟踪器会丢失并发写入；(iii) 该日志是一个可挖掘的工件，从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息，其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.20487 2026-06-19 cs.CL 新提交 70%

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划：跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Tsinghua University（清华大学）

专题命中软件智能体：涉及API-CLI-GUI执行和失败恢复

AI总结提出分层重规划框架H-RePlan，通过统一API-CLI-GUI执行和跨层失败抽象，区分设备本地策略恢复与全局重规划，在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情

AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备，要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配，但恢复仍然粗粒度：当执行失败时，它们通常重试相同策略、重新分配子任务或修改全局计划，而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan}，一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略，并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力，我们引入\textbf{HeraBench}，一个故障注入基准，它在Linux和Android设备上构建跨设备工作流，并注入策略级和设备级故障。实验表明，H-RePlan显著优于单策略和粗粒度多设备基线，实现了更高的完成率、指令遵循率和完美通过率，同时降低了可靠端到端成功所需的令牌成本。这些结果表明，范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

URL PDF HTML ☆

赞 0 踩 0