arXivDaily arXiv每日学术速递 周一至周五更新
2606.20512 2026-06-19 cs.SE cs.LG 新提交

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。

详情
AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

2606.20373 2026-06-19 cs.SE cs.AI 新提交

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass:基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

AI总结 提出AutoPass多智能体框架,通过查询编译器内部状态和中间表示,利用运行时反馈迭代优化编译选项,无需训练即可提升性能,在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情
AI中文摘要

大型语言模型(LLM)在代码编译任务中展现出潜力,但由于复杂的微架构效应和噪声运行时测量,将其应用于运行时性能调优较为困难。我们提出AutoPass,一个用于编译器性能调优的多智能体框架,它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同,AutoPass向LLM开放编译器,使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置,以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行,无需离线训练或任务特定的微调,因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass,并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法,在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

2606.20324 2026-06-19 cs.SE cs.LG 新提交

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

一种模型驱动的方法用于开发强化学习环境族

Xiaoran Liu, Istvan David

AI总结 提出一种模型驱动方法,通过混合遗传算法和模型转换自动生成强化学习训练环境族,以解决手动开发环境族耗时且易错的问题,并在野火缓解场景中验证了其有效性。

详情
AI中文摘要

虚拟训练环境是软件密集型系统,强化学习(RL)智能体在其中学习、适应并展示有意义的行为。虚拟训练环境为在现实环境中训练智能体提供了一种安全且成本效益高的替代方案。然而,为了收敛,大多数现实的RL问题需要在多个相似但略有不同的环境中进行训练——即环境变体族。环境族的典型开发过程是一项劳动密集型且容易出错的手动工作,难以扩展。为了缓解这些问题,本文提出了一种模型驱动的方法来开发RL训练环境族。为了获得环境族,我们开发了一种方法和原型工具。在我们的方法中,一种混合遗传算法——基于种群的全局搜索和启发式局部搜索的结合——生成环境族。变异和约束被表达为模型转换,并通过最先进的模型转换引擎操作化为搜索过程。我们在野火缓解场景和课程学习(一种依赖于环境族的特定学习范式)中展示了我们方法的有效性。

英文摘要

Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents in real-world settings. However, to converge, most realistic RL problems require training in multiple, mostly similar but slightly different environments - i.e., families of environment variants. The typical development process of environment families is a labor-intensive and error-prone manual endeavor that does not scale well. To alleviate these issues, in this paper, we propose a model-driven approach for developing families of RL training environments. To obtain the family of environments, we develop an approach and prototype tool. In our approach, a hybrid genetic algorithm - a combination of population-based global search and heuristic local search - generates environment families. Mutations and constraints are expressed as model transformations and are operationalized into a search process by a state-of-the-art model transformation engine. We demonstrate the soundness of our approach in a wildfire mitigation scenario and curriculum learning - a particular learning paradigm that relies on environment families.

2606.20295 2026-06-19 cs.SE cs.CL 新提交

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2606.20243 2026-06-19 cs.SE cs.MA 新提交

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Phoenix: 通过多智能体LLM实现安全的GitHub问题解决

Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

AI总结 提出多智能体LLM系统Phoenix,通过六个专业智能体和七层安全控制,在SWE-bench Lite子集上达到75%的解决率,并在真实问题中保持100%正确性。

详情
AI中文摘要

我们提出Phoenix,一个多智能体LLM系统,能够从分类到拉取请求创建解决GitHub问题,结合了七层安全控制与基线感知测试评估策略。Phoenix将工作分解给六个专业智能体:规划器、复现器、编码器、测试器、故障分析器和拉取请求(PR)智能体,所有智能体由基于标签的GitHub webhook状态机协调。在打开拉取请求之前,每次更改都会与基线测试运行进行对比。在SWE-bench Lite的24个实例子集上,在生产webhook路径上运行,Phoenix oracle解决了75%的实例,且成功运行中没有出现通过到通过的回归;这个精心挑选的子集不能直接与完整分割排行榜结果比较,我们讨论了比较的局限性。在14个仓库的42个真实问题上的补充试点实现了100%的正确性保持(CP;硬级别平均122秒)。人工检查显示,大约一半的拉取请求是定位良好的修复。另一半将代码放置在错误路径上,这是规划器定位的局限性,我们正在通过检索来解决。我们还报告了部署失败模式(WAF过滤、令牌过期、权限边界、不稳定的CI),这些模式促使了每种安全机制的引入。

英文摘要

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

2606.20230 2026-06-19 cs.SE 新提交

SysML Modeling of Digital Twins for Renewable Energy Communities

可再生能源社区数字孪生的SysML建模

Mohammad Samadi, Luís Miguel Pinho, Andrey Sadovykh, Gabriela Lucas

AI总结 针对可再生能源社区数字孪生工程中的异构性挑战,提出基于SysML的MBSE工作流,通过设备分类和社区组织视图建模,并引入SAREF4ENER本体弥补语义鸿沟。

Comments Presented at the Workshop on Digital Twin Experiences and Model-Based Testing Methods, 12 June 2026, Västerås, Sweden, co-located with the 30th Ada-Europe International Conference on Reliable Software Technologies (AEiC 2026)

详情
AI中文摘要

可再生能源社区(REC)正成为本地和全球共享可再生能源发电、存储和灵活负载的关键组织模型。由于涉及设备、合同和运行时数据的异构性,REC数字孪生的工程变得困难。在本文中,我们朝着REC数字孪生的基于模型的系统工程(MBSE)工作流迈出了第一步。从经过工业验证的REC领域模型出发,我们使用开源Modelio工具在SysML中重新表达了一个代表性的房屋子集,生成了两个块定义图——一个设备分类和一个社区组织视图。然后,我们讨论了普通SysML留下的四个语义鸿沟,并概述了如何将SAREF4ENER本体作为参考包导入以弥合这些鸿沟。将SysML与基于SAREF的智能能源数字孪生语义相结合在很大程度上仍未探索,我们将本文定位为沿着这条线的第一步。

英文摘要

Renewable Energy Communities (RECs) are emerging as a key organizational model for local and global sharing of renewable generation, storage, and flexible loads. Engineering Digital Twins of RECs is made difficult by the heterogeneity of devices, contracts, and runtime data involved. In this paper, we take a first step toward a Model-Based Systems Engineering (MBSE) workflow for REC's Digital Twins. Starting from an industrially-validated REC domain model, we re-express a representative house subset in SysML using the open-source Modelio tool, yielding two Block Definition Diagrams - a device taxonomy and a community organizational view. We then discuss four semantic gaps that plain SysML leaves open and sketch how the SAREF4ENER ontology could be imported as a reference package to close them. Combining SysML with SAREF-based semantics for smart-energy Digital Twins remains largely unexplored, and we position this paper as a first step along that line.

2606.20173 2026-06-19 cs.SE 新提交

Qiskit Code Migration with LLMs

使用大语言模型进行Qiskit代码迁移

Jose Manuel Suarez, Luis Mariano Bibbo, Joaquin Bogado, Alenandro Fernandez

AI总结 针对量子软件开发套件版本演进导致的代码维护问题,提出结合大语言模型与检索增强生成(RAG)的混合方法,利用自动生成的迁移场景分类体系引导模型,实现Qiskit代码跨版本自动迁移,有效减少幻觉并提升迁移建议质量。

详情
AI中文摘要

量子开发套件(QDK)的快速演进引入了一种特定形式的技术债务,损害了代码可维护性并阻碍了软件复用。在量子软件工程(QSE)这一专业领域,高质量训练数据的稀缺和新兴框架的高波动性加剧了这一挑战,常导致通用大语言模型(LLM)产生不可靠或幻觉结果。本文提出一种将LLM与检索增强生成(RAG)相结合的混合方法,用于自动化Qiskit代码的跨版本迁移。所提方法通过利用自动生成的迁移场景分类体系作为结构化、版本特定的知识源来指导模型,从而提升迁移建议的精度和可靠性。该方法通过一个自动化、可扩展的工作流实现,评估了不同检索方案(无约束和限制性)下的LLM(Google Gemini Flash-2.5和OpenAI Gpt-oss-20b)。结果表明,基于分类体系的RAG架构,特别是在限制性方案下,显著减少了幻觉并提高了描述质量,其中Google Gemini Flash-2.5在检测复杂重构场景方面表现出更优性能。这些发现证实了这种以数据为中心的方法在促进技术独立性、提供缓解API过时问题的鲁棒智能助手方面的潜力,从而确保量子算法在快速变化的生态系统中的长期可用性,并降低量子软件工程(QSE)的学习曲线。

英文摘要

The rapid evolution of Quantum Development Kits (QDKs) introduces a specific form of technical debt that compromises code maintainability and hinders software reuse. In the specialized domain of Quantum Software Engineering (QSE), this challenge is intensified by the scarcity of high-quality training data and the high volatility of emerging frameworks, which often lead general-purpose Large Language Models (LLMs) to produce unreliable or hallucinated results. This paper proposes a hybrid approach integrating LLMs with Retrieval-Augmented Generation (RAG) to automate the migration of Qiskit code across versions. The proposed methodology enhances the precision and reliability of migration suggestions by leveraging an automatically generated taxonomy of migration scenarios as the structured, version-specific knowledge source to guide the models. The approach is implemented through an automated, extensible workflow evaluating LLMs (Google Gemini Flash-2.5 and OpenAI Gpt-oss-20b) under different retrieval schemes (unconstrained and restrictive). Results demonstrate that the taxonomy-based RAG architecture, particularly under the restrictive scheme, significantly reduces hallucinations and improves descriptive quality, with Google Gemini Flash-2.5 showing superior performance in detecting complex refactoring scenarios. These findings confirm the potential of this data-centric methodology to foster technological independence and provide robust, intelligent assistants that mitigate API obsolescence, ensuring the long-term availability of quantum algorithms within a rapidly shifting ecosystem and flattening the learning curve within Quantum Software Engineering (QSE).

2606.20158 2026-06-19 cs.SE 新提交

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

AI总结 本文在当代AI编码代理背景下重新审视N版本编程,通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响,发现常见模式故障,但多数投票三版本单元显著降低故障数,证明该策略的工程实用性。

详情
AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验,我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范,我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示,与Knight-Leveson的发现一致,存在大量的共模故障。进一步分析表明,许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益:在多数投票的三版本单元中,平均故障数从单版本的387.44下降到三版本的130.99,并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据,表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

2606.20129 2026-06-19 cs.SE 新提交

Learning Critical Testing Literacy Through Puzzles: an Experience Report

通过谜题学习关键测试素养:经验报告

Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

AI总结 本文报告了使用谜题教授关键测试素养(CTL)的13次工作坊经验,发现参与者通过解谜、汇报和反思的完整序列学习效果显著,并开发了开源分析工具。

详情
AI中文摘要

在本文中,我们报告了使用谜题学习CTL的工作坊经验和收获。背景:软件测试重要但难以教授。我们引入了一个基于谜题的学习活动知识体系来教授CTL,该体系基于关键测试者认知模型,形成了P4TEST教学框架。我们与学生、测试人员、教师和小学生共举办了13次工作坊,评估基于谜题的关键测试素养教学。经验:在11次工作坊中,我们采用半结构化方法,变化谜题、材料和时长。在另外两次工作坊中,我们引入了工作手册和出声思考环节,以收集更多关于学习体验的数据。观察:参与者普遍认为自己在解谜时进行实验。学生倾向于收敛于解决方案,而专业人员继续探索。情绪在行为中可见,但难以通过书面反思单独浮现。出声思考环节揭示了即时推理;书面反思引发了更多元认知反思。主题“意义建构/行动中反思”捕捉了参与者如何构建问题、应对死胡同和转变策略。反思:谜题本身并非干预手段;解谜、汇报和反思的完整序列才是。更刻意地设计这一序列是未来的工作。我们还开发了一个带有内置分析功能的开源网络应用程序,用于定制工作坊。

英文摘要

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-aloud sessions revealed immediate reasoning; written reflections elicited more meta-cognitive reflection. The theme Sensemaking / reflection-in-action captured how participants framed problems, navigated dead ends, and shifted strategies. Reflections: Puzzles are not the intervention: the entire sequence of solving, debriefing, and reflecting is. Designing that sequence more deliberately is the work ahead. We also developed an open-source web application with built-in analytics to customise workshops.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 新提交

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.19992 2026-06-19 cs.SE cs.AI 新提交

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点:工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

AI总结 提出ToolPro,将工具意图表示为可执行程序,通过约束引导构建、效应感知重放和策略决策,在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情
AI中文摘要

在智能体网络时代,基于LLM的智能体越来越多地将网络服务作为工具调用,然而大多数接口仍然是\emph{静态端点},难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro,它将智能体的工具意图表示为一个\emph{可执行工具程序},该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放,以及一个基于配置文件的策略,该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro,并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%,客户端流量减少了高达96.1%,在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

2606.19988 2026-06-19 cs.SE 新提交

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成:从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

AI总结 提出SolidityBench基准和SolidityScore指标,评估多种LLM方法在仓库级Solidity代码生成中的表现,发现监督微调最有效。

Comments 33 pages

详情
AI中文摘要

大语言模型(LLMs)在通用代码生成方面表现出强大的能力,但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域,生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足,其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距,我们引入了SolidityBench,一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore,一种基于Solidity的语义度量,强调领域关键结构,如安全修饰符、合约声明和Solidity特定关键词。使用该基准,我们评估了代表性的代码LLM,包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama,涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明,通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中,检索增强生成表现最佳,而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言,我们的研究为仓库级Solidity代码生成提供了全面的基准,并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

2606.19830 2026-06-19 cs.SE cs.CL 新提交

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER:专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

AI总结 提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench,通过设计确定性验证流程,从24万仓库中筛选出8133个已验证项目,评估9个前沿模型发现项目规模增大时能力急剧下降。

详情
AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展,但由于缺乏大规模数据集和确定性评估方法,专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench,这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是,Game Jam竞赛(开发者在严格时间限制下构建完整游戏的社区活动)产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式,我们设计了一个从文件完整性到运行时行为收集的确定性验证流程,从超过24万个仓库中提炼出8133个已验证项目。其中,300个手动验证的项目构成JamBench;其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务,通过结合编译通过率、结构完整性得分(SCS)和行为对齐得分(BAS)的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖,运行时通过率从小型项目的80.4%下降到大型项目的5.7%(Task2a)。代码代理提高了编译率,但在运行时行为质量上没有带来提升,表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

2606.19814 2026-06-19 cs.SE 新提交

CoRaCommit: A VS Code Extension for Commit Message Generation with Exemplar Retrieval

CoRaCommit: 一种基于范例检索的提交消息生成的 VS Code 扩展

Chaoran Cai, Bo Xiong, Chong Wang, Lulu He, Peng Liang

AI总结 提出 CoRaCommit VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型生成候选消息并基于用户反馈动态推荐,在 ApacheCM 数据集上优于现有扩展。

Comments 17 pages, 6 images, 3 tables, Manuscript submitted to a Journal (2026)

详情
AI中文摘要

提交消息是描述代码变更意图的关键文本制品,在版本控制、代码审查和历史追踪中扮演重要角色。然而,实践中提交消息主要由人工编写,耗时且常导致质量不一致和表达不统一。现有的用于提交消息生成的 VS Code 扩展通常直接基于代码差异调用大语言模型,而不利用相似提交范例作为参考,且很少支持用户反馈驱动的大语言模型推荐。为解决这些局限,本文提出 CoRaCommit,一种 VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型进行候选提交消息比较,并基于用户反馈动态推荐大语言模型,从而增强提交消息生成。在 ApacheCM 数据集的 945 个提交上的实验结果表明,CoRaCommit 在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上优于现有 VS Code 扩展,证明了检索增强上下文对提交消息生成的有效性。

英文摘要

Commit messages are essential textual artifacts that describe the intent behind code changes, and play a critical role in version control, code review, and historical tracking. However, in practice, commit messages are primarily authored manually, which is time-consuming and often results in inconsistent quality and non-uniform expression. Existing VS Code extensions for commit message generation typically directly invoke large language models based on the code diff, without leveraging similar commit exemplars as references, and rarely support user feedback-driven LLM recommendation. To address these limitations, this paper presents CoRaCommit, a VS Code extension that enhances commit message generation by retrieving similar commit exemplars as prompt context, invoking multiple LLMs in parallel for candidate commit message comparison, and dynamically recommending LLMs based on user feedback. Experimental results on 945 commits from the ApacheCM dataset show that CoRaCommit outperforms existing VS Code extensions across BLEU, CIDEr, METEOR, and ROUGE-L metrics, demonstrating the effectiveness of retrieval-augmented context for commit message generation.

2606.19799 2026-06-19 cs.SE cs.LG 新提交

The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

TensorFlow和Keras应用中不良编码实践的隐藏环境成本:资源泄漏与碳排放研究

Bashar Abdallah, Gustavo Santos, Rola Al Bataineh, Alain Abran, Mohammad Hamdaqa

AI总结 研究TensorFlow/Keras中两种资源泄漏气味(IMR和UTR)对能耗和碳排放的影响,实验表明两者分别增加约32%和46%的电力消耗,证明资源泄漏显著降低ML能效并增加环境负担。

详情
AI中文摘要

效率和可持续性是机器学习(ML)应用开发和部署中的关键考量。在影响可持续性的因素中,ML代码中的资源泄漏可能引入隐藏的低效率,从而增加能源消耗和CO2排放。尽管如此,量化其环境影响的实证证据仍然有限。这篇新兴结果论文对两种常见的资源泄漏气味,即不当模型重用(IMR)和未释放张量引用(UTR),及其对TensorFlow和Keras工作负载中能源消耗和CO2排放的影响进行了初步实证研究。通过执行相同的训练任务,并与无气味基线进行比较,对每种气味进行了受控实验。我们的初步结果表明,两种气味都持续增加了估计的用电量和碳排放。IMR和UTR分别使电力消耗增加约32%和46%,CO2排放也成比例增加。配对统计检验表明这些差异是系统性的且具有统计显著性,提供了初步的实证证据,表明资源泄漏气味可能降低ML的能效和环境可持续性。这些发现表明,资源泄漏气味对软件质量和可持续性构成可衡量的风险,强调了将资源生命周期管理和能效考虑纳入ML开发的重要性。

英文摘要

Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.

2606.19795 2026-06-19 cs.SE cs.AI 新提交

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化:一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

AI总结 本文从交接有效性角度出发,将EDA流程中的代理系统分为三类,并提出五层代理通信协议,以解决多阶段、多工具间的状态传递和验证问题。

详情
AI中文摘要

电子设计自动化(EDA)本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前,跨越工具、会话和组织边界。每次传递都携带显式和隐式需求,这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具,将检索到的知识嵌入可执行脚本,并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策,传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件,并携带足够的上下文、证据和来源以供下游使用时,交接是有效的。我们回顾了82个系统,并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别,我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议(EACP),涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 新提交

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

AI总结 针对OpenSIL固件单元测试因构建约束易失败的问题,提出LLM引导的多智能体自动化测试生成与迭代修复流程,在76个函数中73个生成可编译测试,行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情
AI中文摘要

验证底层C固件中的变更成本高昂,因为单元测试(UT)在严格的构建约束下非常脆弱,缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库(openSIL)固件代码库引入了一种自动化的UT编写工作流程,通过大语言模型(LLM)引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用,以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法,并以时间、成本和令牌使用量作为次要指标。在76个被测函数中,该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下,平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中,仅使用行覆盖率指导时平均行覆盖率达到98.8%,与向量数据库检索结合时达到94.7%。结果表明,自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率,同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

2606.19644 2026-06-19 cs.SE 新提交

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果:基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

AI总结 通过分析265个开发者与ChatGPT的交互,研究提示结构(上下文、具体性、验证)对LLM辅助开发中代码生成、采纳和集成深度的影响,发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)驱动的工具(如ChatGPT)越来越多地用于协作软件工程工作流,但提示结构如何影响下游拉取请求(PR)结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标,对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究,我们使用三个维度操作化提示结构:上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断,发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定;上下文被LLM系统性地低估;验证仍然难以一致评估,这促使采用人类-LLM混合注释策略。使用这个经过验证的框架,我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强;验证成为代码采纳的主要预测因子;集成深度与上下文关联最强。总体而言,我们的发现表明,提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响,通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前:挖掘多智能体协调

Dipankar Sarkar

AI总结 针对自主编码智能体在拉取请求中协调不足的问题,提出基于git的协调基板grite,通过事件日志减少重复和冲突工作,提升吞吐量,并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情
AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求,然而大规模研究发现,它们的拉取请求虽然生成更快,但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前,即并发智能体如何声明、划分和碰撞共享工作。我们通过grite(我们的开源协调基板)来研究这一过程,它不需要中央服务器,并将其记录存储在git本身内部,因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明:(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%,而有效吞吐量增加了三倍以上;(ii) 每个智能体的日志副本收敛到相同状态,没有写入被静默丢弃,而基于文件的跟踪器会丢失并发写入;(iii) 该日志是一个可挖掘的工件,从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息,其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

2606.19613 2026-06-19 cs.SE cs.AI 新提交

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

AI总结 提出StaminaBench基准,通过100轮连续变更请求测试编码智能体的耐力,发现所有模型在5-6轮内失败,而测试反馈和重试机制可将通过轮数提升12倍。

详情
AI中文摘要

我们引入了StaminaBench,一个衡量编码智能体耐力的基准:它们在失败前能处理多少连续交互轮次(变更请求)。与流行的任务解决率指标不同,这符合实际编码风格,其中会话运行数十或数百轮。在StaminaBench中,智能体实现一个REST API服务器,并在可调数量的程序生成的后续变更请求(实验中为100个)上进行修改,导致代码库最多达6000行。测试完全以编程方式生成,无需LLM参与,确保可重复性和可靠性;变更序列来自硬编码或LLM驱动的采样器,两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行,并通过HTTP与基准通信,使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景(每个100轮)上的表现,发现:(1)所有测试模型在5-6轮内失败,确认了无彻底测试的编码风格会产生错误;(2)将测试反馈传递给智能体并允许重试,可将通过轮数提升最多12倍;(3)良好的框架是强性能所必需的:更强的模型在其最佳和最差框架之间表现出高达6倍的差距,而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务,以促进对多轮编码智能体行为的进一步研究。基准代码和数据:此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

2606.19605 2026-06-19 cs.SE cs.AI 新提交

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19409 2026-06-19 cs.SE cs.PL 新提交

OpenRath: Session-Centered Runtime State for Agent Systems

OpenRath: 面向会话的代理系统运行时状态

Fukang Wen, Zhijie Wang, Ruilin Xu

AI总结 针对代理系统运行时状态碎片化问题,提出以Session为核心的一等运行时抽象,支持分支、检查、重放、后端感知和组合,使fork、merge和replay成为显式运行时操作。

详情
AI中文摘要

现代代理系统常常遭受碎片化的运行时状态:对话记录、工具效果、内存事件、工作区放置、分支来源和重放证据被分别记录,难以检查或重现。OpenRath通过一个类似PyTorch的编程模型来解决这个问题,适用于多代理、多会话系统。这里的类比涉及中心一等运行时抽象的角色,而非张量计算。其核心抽象是Session,即在代理和工作流之间传递的运行时值。Session是可分支、可检查、可重放、后端感知且可组合的。它记录对话片段、沙箱放置、谱系元数据、令牌使用、待处理工作和工具证据,同时定义内存交互进入运行时记录的位置。由于此状态由程序执行中使用的同一值携带,fork、merge和replay成为显式的运行时操作,而非从外部痕迹重建的状态。OpenRath进一步定义了Sandbox、Tool、Agent、Memory、Workflow和Selector,其中Selector将控制流转化为运行时路由的决策。本报告介绍了编程模型、架构、审计里程碑和证据协议。其主张仅限于受控的运行时属性,而广泛的定量比较、实时提供者质量、可选后端可用性和内存质量留待后续评估。核心论点是Session为代理系统提供了一个一等运行时值,用于可审计的组合。

英文摘要

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

2606.19407 2026-06-19 cs.SE cs.AI 新提交

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!:用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

AI总结 提出JustDiag诊断论证引擎,通过维护显式的过程状态(证据、发现、竞争假设、冲突和下一步检查)来支持可问责的根本原因分析,在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情
AI中文摘要

大型语言模型可以生成流畅的根本原因分析,但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中,工程师需要知道哪些证据支持诊断,考虑了哪些替代方案,哪里存在矛盾,以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白,这是一个用于RCA的诊断论证引擎,它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统,该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比,JustDiag获得了更强的结果和过程分数,同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明,可问责的RCA需要显式的诊断论证工件和过程感知评估,而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

2606.19395 2026-06-19 cs.SE 新提交

DevOps and General Developers: Insights from Stack Overflow's 2023 Survey

DevOps 与普通开发者:来自 Stack Overflow 2023 年调查的见解

Hasan Abdulla, Fatema AlJazeeri, Fawzi AlBalooshi, Jaflah Al-Ammary

AI总结 通过分析 Stack Overflow 2023 年调查数据,比较 DevOps 专家与普通开发者在工具、技术、方法论和人口统计上的差异,发现两者角色互补,工具偏好无显著差异。

Comments 17 pages, 11 tables, research paper based on the 2023 Stack Overflow Developer Survey data analysis

详情
AI中文摘要

目的:调查 DevOps 专家和普通软件开发者在当前软件开发环境中不同的角色,考察他们在工具、技术、方法论和人口统计方面的不同使用情况。此外,区分这两个专业群体在该领域的独特贡献和挑战。设计/方法论/方法:研究采用定量方法分析 Stack Overflow 2023 年开发者调查数据。重点比较 DevOps 专家和普通开发者在技术偏好、人口统计信息和专业经验方面的差异,突出关键趋势和差异。数据分析使用 Python 的 Pandas 库进行。发现:研究表明,DevOps 专家和普通开发者在工具和技术偏好上没有显著差异,突出了他们的互补角色。DevOps 专家和普通开发者都使用 Docker 和 Kubernetes 等工具,强调效率和自动化。而普通开发者根据不同的角色需求使用多样化的工具,人口统计趋势显示普通开发者更年轻,DevOps 专业人员处于职业生涯中期。这一年龄范围反映了 DevOps 经验的增长,两个群体都在适应技术行业不断发展的远程和混合工作模式。实际意义:这项研究提供了对软件开发中动态角色的视角,强调了 DevOps 日益增长的重要性。它是学术和行业专业人士了解软件开发角色不断演变的宝贵资源。原创性/价值:这项研究填补了现有文献中关于软件开发角色动态演变的重要空白。

英文摘要

Purpose: To investigate the distinct roles of DevOps specialists and general software developers, examining their varying use of tools, technologies, methodologies, and demographics in the current software development environment. In addition, to differentiate these two professional groups regarding their unique contributions and challenges in the field. Design/Methodology/Approach: The research uses a quantitative approach to analyze data from the Stack Overflow 2023 Developer Survey. It focuses on a comparative analysis of technological preferences, demographic information, and professional experiences between DevOps specialists and general developers, highlighting key trends and differences. The data analysis was conducted using Python's Pandas library for data analysis. Findings: The research indicates no significant difference in the tool and technology preferences between DevOps specialists and general software developers, highlighting their complementary roles. DevOps specialists and general software developers use tools like Docker and Kubernetes, emphasizing efficiency and automation. While general developers employ diverse tools for various role demands, demographic trends show younger general developers and mid-career DevOps professionals. This age range reflects growing experience in DevOps, and both groups are adapting to remote and hybrid work models in the evolving tech industry. Practical Implications: This research offers perspectives on the dynamic roles within software development, emphasizing the growing importance of DevOps. It is a valuable resource for academic and industry professionals to understand the evolving dynamics in software development roles. Originality/Value: This research fills a significant gap in the existing literature regarding the evolving dynamics of software development roles.

2606.19390 2026-06-19 cs.SE cs.AI 新提交

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化:一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

AI总结 提出一种协议驱动框架,通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测,结合静态与运行时证据生成CSAF VEX公告,经密码签名和确定性重放验证,在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

详情
AI中文摘要

提出一种协议驱动框架,将SBOM和AIBOM工件绑定到确定性环境捕获和结构化运行时遥测。利用声明的工件、观察到的激活条件和强制执行的策略计算可利用性。从静态和运行时证据生成CSAF VEX公告,经密码签名并通过确定性重放验证。评估使用约10000个组件条目,涵盖50到5000个组件的合成自主AI工作负载,并整合OSV、GitHub Advisory、KEV和EPSS数据集。

英文摘要

A protocol driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10000 component entries across synthetic Agentic AI workloads 50 to 5000 components, incorporating OSV, GitHub Advisory, KEV, and EPSS datasets.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.19387 2026-06-19 cs.SE cs.AI 新提交

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 新提交

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态:挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

AI总结 本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作,揭示了校准类别的关键影响,并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

自主代理的运行时监视器通常对累积的内部状态(行为基线、漂移统计量,或在我们之前工作中的建模情感状态)设置阈值。我们之前报告了一个状态饱和陷阱:在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理(Modgil 2026)上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0,因此其指数衰减从未运作:已发布的陷阱是一个纯累加器的结果。我们更正了记录(勘误,v2)并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间(每次观测,如CUSUM)还是挂钟时间(半衰期以秒计,如情感模型和EMA基线)校准的。在固定速率流上两者一致;在代理流上,动作间时间变化几个数量级,它们不一致。在20条轨迹上对均匀间隔(dt在{0..600}秒内)的预注册扫描显示,挂钟水平触发器有两个机制:在dt<=1秒时恒定警报(20/20;中位数18次触发);在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒(p90 2.33秒);真实编码节奏位于陷阱机制内,在修正机制下证实了经验发现。该结构是校准类别的属性,而非引擎:在原始误差流上的最小挂钟累加器重现了相同的悬崖,而相同流上的样本时间CUSUM恰好是dt不变的(20/20)。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论,挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制;转换检测在每个节奏下都逃脱了陷阱,但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

2606.19382 2026-06-19 cs.SE cs.AI 新提交

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO:基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

AI总结 提出DynAMO引擎,采用先规划后执行架构生成可验证工作流图,支持顺序与并行执行,通过动态识别独立任务提升效率,在工业基准上实现1.6倍延迟降低,并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情
AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化,但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO(动态资产管理编排),一个部署就绪的引擎,采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流(拓扑执行)和并行工作流(依赖感知并发)。通过动态识别独立任务,DynAMO在保持结构正确性和安全性的同时,通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中,DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍,在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后,延迟分解显示LLM推理和编排仍占执行时间的90%以上,表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%,并且DynAMO在受控故障注入下保持正确的功能行为(任务完成、智能体排序和输出质量),同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行,并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取:this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO