arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 11 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.20058 2026-06-19 cs.AI 新提交 90%

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

面向企业级AI规模的自驱动事件驱动多智能体编排

Harsh Rao Dhanyamraju, Leonidas Raghav, Aaron Lee

发表机构 * SAP SE(SAP股份有限公司)

专题命中 多智能体 :提出多智能体编排框架,处理企业级事件驱动任务。

AI总结 针对企业级AI中多智能体系统在规模扩展时性能下降的问题,提出任务管理器通过优先级推理、事件合并和抢占机制,在200个生产场景中验证其降低高优先级延迟14-75%,提升相关事件正确率超20个百分点。

详情
AI中文摘要

企业AI旨在朝着跨专业智能体的持续事件监控、检测和行动方向发展,然而现有的多智能体系统大多假设离散的请求-响应工作流,并且在企业规模下仍未得到充分探索。我们在208个源自生产的场景中评估了DAG Plan and Execute和ReAct,这些场景涵盖个人(少于10个智能体)、部门(20-80个)和企业(200个)规模,并引入了一个任务管理器,通过优先级推理、相关事件合并和抢占实现持续运行。结果表明,规模而非任务复杂性主导了编排性能:两种架构在小规模下表现良好,但在企业规模下性能下降,因为智能体发现噪声成为主要瓶颈,简单任务的下降幅度比复杂任务更严重。DAG Plan and Execute在较小规模下提供更高的精度和结构化并行化,但其较高的开销在企业规模下恶化;ReAct通过增量处理故障而更具鲁棒性。任务管理器将高优先级队列延迟降低了14-75%,并在企业规模下将相关事件正确性提高了超过20个百分点。

英文摘要

Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi-agent systems largely assume discrete request-response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production-derived enterprise scenarios spanning Persona (<10 agents), Department (20-80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related-event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high-priority queue latency by 14-75% and improves related-event correctness by over 20 percentage points at enterprise scale.

2606.19782 2026-06-19 cs.AI cs.CL 新提交 90%

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA:一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute(向量研究所)

专题命中 多智能体 :多智能体管道用于金融图表问答,强调可审计性。

AI总结 提出多智能体管道AgentFinVQA,通过分解查询步骤并记录可追溯的模型评估包,在金融图表问答中实现可审计性与本地部署,在FinMME上提升准确率7.68个百分点。

详情
AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性:从业者必须在采取行动之前知道哪些答案值得信任,而且许多机构无法将客户数据发送给外部模型提供商。然而,现有的图表问答智能体注重准确性且不透明,并且大多数假设专有API访问;据我们所知,没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA,一个多智能体管道,将每个查询分解为规划、OCR、图例定位、视觉检查和验证,每个样本记录在可追溯的模型评估包(MEP)中。在FinMME上,AgentFinVQA在使用专有主干(Gemini-3 Flash;71.24% vs. 63.56%,McNemar p ≈ 1.1×10^{-16})时比主骨干匹配的零样本基线提高+7.68个百分点,在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号(确认答案与修正答案的精确准确率分别为68.2%和55.6%),支持人在回路审查路由。错误分析表明,问题误解、图例混淆和提取错误占失败原因的近三分之二,并且是验证器检测最少的类别,为未来工作指明了明确方向。这些结果共同表明,可审计、本地部署的金融图表问答是可行的,并且开放权重系统保留了大部分准确率提升,同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

2606.19758 2026-06-19 cs.MA 新提交 90%

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

SIGMA: 用于组合式多智能体设计的技能-关联图

Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

专题命中 多智能体 :通过技能-关联图进行组合式多智能体设计。

AI总结 提出SIGMA框架,通过技能-智能体关联图将智能体构建为可复用技能的任务条件组合,并解码通信拓扑,在六个基准测试中优于基线方法,并展现出对未见技能库的鲁棒性。

Comments EMNLP2026

详情
AI中文摘要

现有的基于图的多智能体系统(MAS)设计者主要通过优化预定义智能体、角色或组上的通信拓扑来改善协作。然而,由于每个节点仍然是一个封闭集实体,这些方法难以泛化到需要未见能力组合的任务。我们提出SIGMA,一个技能-关联图框架,将智能体构建为可复用技能的任务条件组合。给定一个任务和一个技能库,SIGMA预测一个技能-智能体关联矩阵,从选定的技能中组合智能体节点嵌入,并在构建的智能体上解码通信拓扑。在执行过程中,特定技能的邮箱将消息路由到相关分配的能力,使关联结构直接可操作。在六个推理和编码基准测试中,使用三个基础LLM,SIGMA实现了最佳平均性能,并分别比最强的非组合式拓扑基线CARD提高了2.06、2.36和1.75分。它还对未见技能库表现出更强的鲁棒性,平均性能下降仅为0.96分。这些结果表明,组合式节点构建是多智能体设计中除了通信拓扑优化之外的一个互补且重要的方向。代码可在以下网址获取:https://this URL。

英文摘要

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.

2606.18325 2026-06-19 cs.CR cs.AI 新提交 90%

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA(阿拉巴马大学) Roma Tre University, Rome, Italy(罗马三大学)

专题命中 多智能体 :提出可监督多智能体入侵响应框架

AI总结 提出可监督的多智能体入侵响应框架Agentra,通过角色划分、规划-验证循环、安全网关和风险评分机制,将警报转化为结构化响应计划,在120事件语料上F1从0.61提升至0.84,有害动作率降至0.0%。

详情
AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类,导致警报生成与遏制之间存在延迟。我们提出Agentra,一个可监督的多智能体入侵响应系统(IRS)框架,它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中,通过有界的规划器-验证器审查循环验证提议的计划,通过审核安全网关筛选检索到的威胁情报,通过行动目录和风险评分门控行动,并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上,将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84,并在仅规划器配置引入不安全过度反应后,将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明,多智能体响应规划可以在保持分析师批准和可审计性的同时,提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交 85%

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

专题命中 多智能体 :提出多智能体交互记忆框架,实现异构智能体知识复用。

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.19537 2026-06-19 cs.MA cs.DC 新提交 85%

Mesh Inference: A Formal Model of Collective Intelligence Without a Center

网格推理:无中心集体智能的形式模型

Hongwei Xu

专题命中 多智能体 :多智能体无中心协作推理的数学模型

AI总结 提出网格推理形式模型,通过耦合自由能实现无中心多智能体协作推理,证明收敛唯一性、识别完备性和观测唯一性,并分析线性高斯情况下的延迟代价。

Comments 21 pages, 2 figures

详情
AI中文摘要

我们提出了网格推理的形式模型:一群独立智能体,每个持有私有状态,仅交换被接纳的、类型化的观测,在没有中央协调者且无智能体暴露的情况下,推导出任何一个智能体单独无法得出的结论。没有智能体共享权重、梯度或隐藏状态,且智能体可能跨越不同的团队、网络和组织。受“询问模型是能量最小化推理”这一观察的启发,我们将网格建模为每个智能体局部松弛的耦合自由能。我们证明,单一的接纳/发射策略控制三个性质。首先,对于任何对称或非对称的接纳,网格推理收敛到唯一答案,因为耦合总是M-矩阵。其次,它是识别完备的:当贡献视图是载波连通时,它精确推导出集中式最优解。第三,它是仅观测的:没有节点传输其内部状态,且机密性是识别的对偶。内容寻址谱系是唯一的全局侧信道。在线性高斯情况下,每个推导出的答案都是确定的,因此等于集中式最优解,延迟为O(diam^2),这是移除中心所付出的代价。这样的推导是无中心学习循环的一个环节,我们将其形式化为架构而非证明。我们提出的开放问题是,询问何时能改善集体而非破坏它:非线性闭包是推导出升级的答案还是自信的错误。据我们所知,这是网格推理的第一个形式模型。

英文摘要

We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal model of mesh inference.

2606.19494 2026-06-19 cs.AI 新提交 85%

Hidden Anchors in Multi-Agent LLM Deliberation

多智能体LLM协商中的隐藏锚点

Apurba Pokharel, Ram Dantu

发表机构 * University of North Texas(北德克萨斯大学)

专题命中 多智能体 :多智能体LLM协商的隐藏锚点动力学模型

AI总结 将多智能体LLM协商建模为闭环动力系统,每个智能体有隐藏内部信念(锚点),解释协商如何超越初始信念凸包,并通过恢复锚点预测模型行为。

Comments 13 pages, 6 figures, 7 tables

详情
AI中文摘要

多智能体LLM协商,即智能体在多轮中交换和修改答案,越来越多地被用于提高推理和准确性,但其工作原理很少被建模。这种协商反映了人类如何做出决策。作为社会性动物,我们既受到群体的影响(如DeGroot和Friedkin-Johnsen等经典意见动力学模型所捕捉的羊群效应),也受到自身内部信念的影响(这些模型未考虑)。我们将多智能体协商建模为一个闭环动力系统,其中每个智能体携带一个隐藏的内部信念(其锚点),该锚点持续拉动其意见,无论邻居如何。我们证明,仅从协商中就可以恢复该锚点,并且它解释了经典共识规则所禁止的行为:智能体对正确答案的信心可以超过任何智能体初始时的水平,从而逃离由初始信念形成的空间(凸包)。检查恢复的锚点是否也能预测未参与运行的协商(泛化),为模型是否真正由这样的锚点驱动提供了一个简单测试。在三个开放权重模型系列中,这是一个谱系,而非全有或全无。所有锚点的影响强度大致相同,但它们在锚点位置上有差异,只有当锚点远离初始意见时,协商才会逃离凸包并需要完整的闭环模型。

英文摘要

Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin--Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.

2606.18413 2026-06-19 cs.AI cs.HC 新提交 85%

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

专题命中 多智能体 :研究共享工作空间人机团队协作,涉及多智能体协调

AI总结 研究共享工作空间的人机团队协作,通过Collaborative Gym环境实验发现,缺乏协调结构时增加协作者会降低性能,而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

自动化AI代理越来越强大,但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队,其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务,我们考察何时添加模拟人类协作者能提升性能,以及何时过程损失将额外协作者变为协调开销。在1482个会话中,当团队缺乏协调贡献的结构时,添加相关协作者会降低性能。然后我们评估一种脚手架,它结合了共享群体记忆和模拟人在环(HITL)门控,其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显,产生了更高的平均性能,具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言,人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

2606.20243 2026-06-19 cs.SE cs.MA 新提交 80%

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Phoenix: 通过多智能体LLM实现安全的GitHub问题解决

Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

专题命中 多智能体 :多智能体协作解决软件工程任务

AI总结 提出多智能体LLM系统Phoenix,通过六个专业智能体和七层安全控制,在SWE-bench Lite子集上达到75%的解决率,并在真实问题中保持100%正确性。

详情
AI中文摘要

我们提出Phoenix,一个多智能体LLM系统,能够从分类到拉取请求创建解决GitHub问题,结合了七层安全控制与基线感知测试评估策略。Phoenix将工作分解给六个专业智能体:规划器、复现器、编码器、测试器、故障分析器和拉取请求(PR)智能体,所有智能体由基于标签的GitHub webhook状态机协调。在打开拉取请求之前,每次更改都会与基线测试运行进行对比。在SWE-bench Lite的24个实例子集上,在生产webhook路径上运行,Phoenix oracle解决了75%的实例,且成功运行中没有出现通过到通过的回归;这个精心挑选的子集不能直接与完整分割排行榜结果比较,我们讨论了比较的局限性。在14个仓库的42个真实问题上的补充试点实现了100%的正确性保持(CP;硬级别平均122秒)。人工检查显示,大约一半的拉取请求是定位良好的修复。另一半将代码放置在错误路径上,这是规划器定位的局限性,我们正在通过检索来解决。我们还报告了部署失败模式(WAF过滤、令牌过期、权限边界、不稳定的CI),这些模式促使了每种安全机制的引入。

英文摘要

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 新提交 80%

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

发表机构 * School of Software Design and Data Science(软件设计与数据科学学院) Seneca Polytechnic(森纳学院) Advanced Micro Devices Canada(加拿大先进微器件公司)

专题命中 多智能体 :多智能体管道用于测试生成和修复。

AI总结 针对OpenSIL固件单元测试因构建约束易失败的问题,提出LLM引导的多智能体自动化测试生成与迭代修复流程,在76个函数中73个生成可编译测试,行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情
AI中文摘要

验证底层C固件中的变更成本高昂,因为单元测试(UT)在严格的构建约束下非常脆弱,缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库(openSIL)固件代码库引入了一种自动化的UT编写工作流程,通过大语言模型(LLM)引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用,以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法,并以时间、成本和令牌使用量作为次要指标。在76个被测函数中,该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下,平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中,仅使用行覆盖率指导时平均行覆盖率达到98.8%,与向量数据库检索结合时达到94.7%。结果表明,自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率,同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

2606.19356 2026-06-19 cs.CL cs.AI 新提交 80%

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统:使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc(Synechron公司)

专题命中 多智能体 :多智能体系统中信号协议提升可靠性

AI总结 提出Argent信令协议(ASP),通过结构化质量信号区分可修复与不可修复的失败,在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情
AI中文摘要

当多智能体LLM系统产生错误答案时,并非所有失败都相同:有些答案基于正确材料但不完整,而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁(重试并希望最好),使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP),这是一种紧凑的机器可读头部,为每个AI生成的响应附带结构化质量信号:确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引,用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败,并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下,基于Array BioPharma/Ono许可协议的27个问题的文档问答基准,比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上,ASP将通过率从11.1%提升至33.3%,平均术语覆盖率从36.7%提升至65.4%;在Dobby~(8B)上,ASP产生4次失败到通过的恢复,通过率从33.3%提升至44.4%;在SmolLM3~(3B)上,ASP在每次问题中交替进行修复和遏制。总体改进显著(从12/81通过到21/81通过)。在多智能体模式下,ASP侧车位于检索智能体和下游决策智能体之间;侧车100%阻止无依据的上游输出到达下游智能体(24/27被阻止,0次无依据传播)。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).