arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 11 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.20363 2026-06-19 cs.AI 新提交 90%

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

为计算机使用智能体自动生成SKILL.md:基于交互轨迹挖掘

Yuexing Hao, Xiaomin Li

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

专题命中 软件智能体 :从GUI轨迹挖掘技能库用于计算机使用智能体

AI总结 提出三阶段流水线从GUI轨迹中挖掘可读技能库,但发现可读性不保证下游策略提升,GRPO仅带来微小改进,揭示当前方法的局限性。

详情
AI中文摘要

显式技能库使计算机使用智能体更易于检查,但尚不清楚是否可以从交互数据中挖掘此类库以改进下游策略。我们通过一个三阶段流水线研究这个问题:分割GUI轨迹,将片段聚类为候选技能,并从生成的注释中训练技能感知策略。挖掘的聚类在源基准上是可读的:八个聚类中有五个对InteraSkill Workflows标签的纯度至少为0.95。然而,可读性并不意味着可迁移。GRPO仅将IW技能步骤准确率从18.5%提高到20.5%,使BrowseComp+基本不变,并在关键源域指标上低于简单的频率先验。因此,我们将该方法作为诊断性研究呈现:轨迹挖掘可以暴露可检查的技能结构,但当前的边界检测器、无序片段表示和离线奖励模型不足以实现可靠的跨域策略改进。

英文摘要

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交 90%

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Concordia University(康科迪亚大学) University of Toronto(多伦多大学) McMaster University(麦马斯特大学)

专题命中 软件智能体 :研究移动代理,比较GUI和CLI范式。

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.20512 2026-06-19 cs.SE cs.LG 新提交 85%

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

发表机构 * Williams College(威廉姆斯学院)

专题命中 软件智能体 :聚焦编码代理的仓库指导优化

AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。

详情
AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

2606.20487 2026-06-19 cs.CL 新提交 85%

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划:跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Tsinghua University(清华大学)

专题命中 软件智能体 :跨设备智能体系统的分层恢复框架

AI总结 提出分层重规划框架H-RePlan,通过统一API-CLI-GUI执行和跨层失败抽象,区分设备本地策略恢复与全局重规划,在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情
AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备,要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配,但恢复仍然粗粒度:当执行失败时,它们通常重试相同策略、重新分配子任务或修改全局计划,而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan},一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略,并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力,我们引入\textbf{HeraBench},一个故障注入基准,它在Linux和Android设备上构建跨设备工作流,并注入策略级和设备级故障。实验表明,H-RePlan显著优于单策略和粗粒度多设备基线,实现了更高的完成率、指令遵循率和完美通过率,同时降低了可靠端到端成功所需的令牌成本。这些结果表明,范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

2606.20158 2026-06-19 cs.SE 新提交 85%

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

专题命中 软件智能体 :编码代理作为智能体进行N版本编程。

AI总结 本文在当代AI编码代理背景下重新审视N版本编程,通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响,发现常见模式故障,但多数投票三版本单元显著降低故障数,证明该策略的工程实用性。

详情
AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验,我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范,我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示,与Knight-Leveson的发现一致,存在大量的共模故障。进一步分析表明,许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益:在多数投票的三版本单元中,平均故障数从单版本的387.44下降到三版本的130.99,并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据,表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

2606.19930 2026-06-19 cs.HC 新提交 85%

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

MobileForge:基于分层反馈引导策略优化的移动GUI智能体免标注适配

Guangyi Liu, Pengxiang Zhao, Gao Wu, Yiwen Yin, Mading Li, Liang Liu, Congxiao Liu, Zhang Qi, Mengyan Wang, Liang Guo, Yong Liu

专题命中 软件智能体 :提出移动GUI智能体免标注适配系统MobileForge

AI总结 提出MobileForge系统,通过MobileGym环境实现任务生成与评估,结合分层反馈引导策略优化(HiFPO)将轨迹结果、步骤反馈和修正提示转化为步骤级GRPO更新,实现移动GUI智能体免标注适配,在AndroidWorld上达到67.2% Pass@3。

Comments Project page: https://mobile-forge.github.io/

详情
AI中文摘要

基于MLLM的移动GUI智能体在UI理解和动作执行方面取得了显著进展,但将它们适配到真实目标应用仍然成本高昂,因为移动应用数量众多、频繁更新,且难以用人工编写的任务、演示或奖励标签覆盖。现有的免标注GUI学习减少了人工监督,但缺乏将目标应用探索、课程挖掘、轨迹执行和反馈连接起来的统一基础,而策略优化通常依赖于孤立的轨迹和难以转化为可靠改进信号的粗粒度奖励。我们提出MobileForge,一个用于移动GUI智能体的免标注适配系统。MobileForge包含MobileGym,它将任务生成和轨迹评估基于真实移动应用交互,以及分层反馈引导策略优化(HiFPO),它将轨迹结果、步骤级过程反馈和修正提示转化为提示上下文化的步骤级GRPO更新。仅使用自动生成的免标注适配数据,MobileForge将Qwen3-VL-8B适配到AndroidWorld上67.2%的Pass@3,接近使用封闭数据的GUI专用GUI-Owl-1.5-8B基础模型的69.0%。MobileForge适配的ForgeOwl-8B进一步在AndroidWorld上达到77.6%的Pass@3,在域外MobileWorld GUI-only分割上达到41.0%的成功率,在我们的评估中建立了最强的开放数据移动GUI智能体。代码、数据和训练模型将在该URL发布。

英文摘要

MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at https://mobile-forge.github.io/.

2606.19926 2026-06-19 cs.HC 新提交 85%

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

MemGUI-Agent: 一种具有主动上下文管理的端到端长时移动GUI智能体

Guangyi Liu, Gao Wu, Congxiao Liu, Pengxiang Zhao, Liang Liu, Mading Li, Qi Zhang, Mengyan Wang, Liang Guo, Yong Liu

专题命中 软件智能体 :提出长时移动GUI智能体MemGUI-Agent

AI总结 提出MemGUI-Agent,通过主动上下文管理机制(ConAct)将上下文管理作为一等动作,解决长时任务中提示膨胀和关键信息稀释问题,在8B模型上达到最佳性能。

Comments 33 pages, 6 figures. Project page: https://memgui-agent.github.io/

详情
AI中文摘要

基于MLLM的移动GUI智能体在短时任务上取得了显著进展,但在需要跨多步和应用转换保留中间事实的长时任务上仍不可靠。我们将此限制归因于ReAct风格的提示,它被动地累积每一步的记录,导致提示膨胀和关键跨应用事实的稀释。为了解决这个问题,我们引入了MemGUI-Agent,一种具有主动上下文管理的端到端长时移动GUI智能体。MemGUI-Agent建立在Context-as-Action (ConAct)之上,它将上下文管理作为与选择UI动作相同的策略发出的一等动作。ConAct不是被动地追加历史,而是维护三个结构化的上下文字段:折叠的动作历史、折叠的UI状态和最近的步骤记录,在保持上下文紧凑的同时保留关键的UI事实。为了使主动上下文管理跨模型规模可学习,我们构建了MemGUI-3K,一个包含2956条轨迹的数据集,带有完整的ConAct注释,用于监督训练和离线分析。在MemGUI-3K上训练8B模型产生了MemGUI-8B-SFT,一个8B的MemGUI-Agent,它在MemGUI-Bench上实现了最佳的开源8B性能,并泛化到分布外的MobileWorld基准测试。代码、数据和训练好的模型将在以下网址发布:https://this URL。

英文摘要

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

2606.14066 2026-06-19 cs.SE 新提交 85%

FastContext: Training Efficient Repository Explorer for Coding Agents

FastContext: 为编码智能体训练高效的仓库探索器

Shaoqiu Zhang, Maoquan Wang, Yuling Shi, Yuhang Wang, Xiaodong Gu, Yongqiang Yao, Tori Gong, Sheng Chen, Rao Fu, Anisha Agarwal, Spandan Grag, Gabriel Ryan, Colin Merkel, Yufan Huang, Shengyu Fu

专题命中 软件智能体 :专用探索子智能体

AI总结 提出专用探索子智能体FastContext,通过并行工具调用和专注上下文生成,分离仓库探索与问题解决,在SWE-bench等任务上提升修复率达5.5%,降低编码智能体token消耗达60%。

Comments 34 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)编码智能体在软件工程任务上取得了强劲成果,但仓库探索仍是主要瓶颈:定位相关代码消耗大量token预算,并用不相关的片段污染智能体的上下文。在大多数智能体中,同一个模型既探索仓库又解决问题,将探索性读取和搜索留在求解器的历史记录中。我们提出FastContext,一个专用的探索子智能体,将仓库探索与求解分离。按需调用时,FastContext发出并行工具调用,并返回简洁的文件路径和行范围作为聚焦上下文。FastContext由专门的探索模型驱动,参数规模从4B到30B。我们从强参考模型轨迹中引导这些模型,并使用任务导向的奖励进行细化,以实现广泛的首次搜索、多轮证据收集和精确的引用生成。在SWE-bench Multilingual、SWE-bench Pro和SWE-QA上,将FastContext集成到Mini-SWE-Agent中,端到端修复率提升高达5.5%,同时编码智能体token消耗降低高达60%,且开销极小。这些结果表明,仓库探索可以与求解分离,并由专门模型有效处理。代码和数据:此 https URL

英文摘要

Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent's context with irrelevant snippets. In most agents, the same model explores the repository and solves the task, leaving exploratory reads and searches in the solver's history. We present FastContext, a dedicated exploration subagent that separates repository exploration from solving. Invoked on demand, FastContext issues parallel tool calls and returns concise file paths and line ranges as focused context. FastContext is powered by specialized exploration models spanning 4B--30B parameters. We bootstrap them from strong reference-model trajectories and refine them with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation. Across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates up to 5.5% while reducing coding-agent token consumption up to 60%, with marginal overhead. These results show that repository exploration can be separated from solving and handled effectively by specialized models. Code and data: https://github.com/microsoft/fastcontext

2606.20520 2026-06-19 cs.CR cs.AI cs.DC cs.LG 新提交 80%

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

主权执行代理:在智能体控制平面中强制执行证书绑定权限

Jun He, Deying Yu

专题命中 软件智能体 :自主代理执行时的权限控制机制

AI总结 针对自主代理在生产环境中执行变更时缺乏强制权限验证的问题,提出主权执行代理(SEB),通过证书验证、状态检查和范围身份实现运行时强制权限控制,并在AWS和Kubernetes上验证了其安全性和性能。

Comments 19 pages, 6 figures, 10 tables

详情
AI中文摘要

自主代理越来越多地连接到云、部署和数据控制工作流,但生产环境的变更权限不应存在于非确定性推理过程中。现有的访问控制机制授权身份,而保证层认证提议的操作;两者单独都无法在变更时刻提供对认证权限的强制执行点。本文介绍了主权执行代理(SEB),一种用于证书绑定智能体基础设施的运行时强制边界。SEB消耗由主权保证边界(SAB)颁发的证书,验证请求的变更与认证的执行合约匹配,检查有效期窗口、策略时期、撤销时期和实时状态漂移,铸造范围执行身份,调用基础设施API,并记录签名的决策和结果记录。通过分离提议、准入和执行,SEB将认证权限转化为短暂的、可撤销的、可审计的运行时能力,前提是生产变更API拒绝非代理身份。我们展示了SEB执行模型、证书和重放验证谓词、范围身份语义、绕过预防部署模式、失败行为以及一个具体的原型实现。我们在AWS和Kubernetes集群上评估了原型,测量了延迟开销、撤销传播、漂移检测以及故障注入下的安全性。

英文摘要

Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 新提交 80%

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态:挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

发表机构 * Modint Intelligence(Modint智能科技)

专题命中 软件智能体 :研究自主代理的运行时监视器。

AI总结 本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作,揭示了校准类别的关键影响,并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

自主代理的运行时监视器通常对累积的内部状态(行为基线、漂移统计量,或在我们之前工作中的建模情感状态)设置阈值。我们之前报告了一个状态饱和陷阱:在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理(Modgil 2026)上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0,因此其指数衰减从未运作:已发布的陷阱是一个纯累加器的结果。我们更正了记录(勘误,v2)并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间(每次观测,如CUSUM)还是挂钟时间(半衰期以秒计,如情感模型和EMA基线)校准的。在固定速率流上两者一致;在代理流上,动作间时间变化几个数量级,它们不一致。在20条轨迹上对均匀间隔(dt在{0..600}秒内)的预注册扫描显示,挂钟水平触发器有两个机制:在dt<=1秒时恒定警报(20/20;中位数18次触发);在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒(p90 2.33秒);真实编码节奏位于陷阱机制内,在修正机制下证实了经验发现。该结构是校准类别的属性,而非引擎:在原始误差流上的最小挂钟累加器重现了相同的悬崖,而相同流上的样本时间CUSUM恰好是dt不变的(20/20)。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论,挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制;转换检测在每个节奏下都逃脱了陷阱,但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

2606.17128 2026-06-19 cs.AR 新提交 80%

Shift-Left High-Level Synthesis Verification via Knowledge-Augmented LLM Agent

通过知识增强的LLM智能体实现左移高层次综合验证

Zhihan Xiao, Hongbing Lang, Zhe Zhao, Luke Ztz Hu, Songping Mai

专题命中 软件智能体 :知识增强的LLM智能体用于HLS验证

AI总结 提出一种知识增强的智能体驱动左移验证框架,通过双层级一致性检查、符号执行和HLS验证知识图谱,在综合前自动验证C与HLS-C的功能一致性,覆盖率达98.26%。

详情
AI中文摘要

高层次综合(HLS)通过将C/C++程序转换为硬件实现,实现了快速硬件开发。在HLS设计流程中,黄金C规范与面向HLS的C实现之间的功能一致性验证是一项关键但劳动密集型的任务。尽管大型语言模型(LLMs)最近在自动化测试平台生成方面显示出潜力,但其随机性常常导致覆盖率不足、验证环境不一致以及等价性检查结果不可靠。为了解决这些限制,我们提出了一种知识增强的、智能体驱动的左移验证框架,用于在综合前自动检查黄金C与HLS-C之间的功能一致性。该框架引入了一种双层级一致性检查机制,该机制共同强制配对测试平台之间的静态结构对齐和动态行为等价性,同时集成符号执行和覆盖率驱动的细化以提高验证完整性。此外,我们构建了一个异构的HLS验证知识图谱,为测试平台生成提供拓扑感知推理先验,并设计了一个自主验证智能体来协调跨异构工具链的迭代细化和故障诊断。在107个HLS基准对上的实验结果表明,所提出的框架实现了98.26%的平均覆盖率和95.33%的动态一致性,优于代表性的基于AST、检索增强和迭代智能体的基线。此 https URL

英文摘要

High-Level Synthesis (HLS) relies on transforming original C specifications into synthesizable HLS-oriented C (HLS-C) implementations. Functional consistency verification between original C specifications and HLS-C implementations is a critical yet labor-intensive task in HLS design flows. While Large Language Models (LLMs) have recently shown promise in automated testbench generation, their stochastic nature often leads to insufficient coverage, inconsistent verification environments, and unreliable equivalence checking results. To address these limitations, we propose a knowledge-augmented, agent-driven shift-left verification framework for automated functional consistency checking between original C and HLS-C implementations before synthesis. The framework introduces a Dual-Tier Consistency Checking mechanism that jointly enforces static structural alignment and dynamic behavioral equivalence between paired testbenches, while integrating symbolic execution and coverage-driven refinement to improve verification completeness. Furthermore, we construct a heterogeneous HLS Verification Knowledge Graph to provide topology-aware reasoning priors for testbench generation, and design an autonomous verification agent to orchestrate iterative refinement and failure diagnosis across heterogeneous toolchains. Experimental results on 107 HLS benchmark pairs demonstrate that the proposed framework achieves 0.9826 average coverage and 0.9533 dynamic consistency, outperforming representative AST-based, retrieval-augmented, and iterative agent-based baselines. https://github.com/cz-5f/HLS-LeVeri.git