arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 16 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.20373 2026-06-19 cs.SE cs.AI 新提交 90%

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass:基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University(陕西师范大学) Northwest University(西北大学) University of Leeds(利兹大学)

专题命中 工作流自动化 :多智能体框架自动优化编译器性能

AI总结 提出AutoPass多智能体框架,通过查询编译器内部状态和中间表示,利用运行时反馈迭代优化编译选项,无需训练即可提升性能,在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情
AI中文摘要

大型语言模型(LLM)在代码编译任务中展现出潜力,但由于复杂的微架构效应和噪声运行时测量,将其应用于运行时性能调优较为困难。我们提出AutoPass,一个用于编译器性能调优的多智能体框架,它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同,AutoPass向LLM开放编译器,使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置,以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行,无需离线训练或任务特定的微调,因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass,并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法,在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

2606.20318 2026-06-19 cs.DB 新提交 90%

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

AgenticDB: 面向数据库工作负载的代理式性能重配置

Xinyue Yang, Chaozheng Wang, Chen Zheng, Heng Zhang, Yanjun Wu

专题命中 工作流自动化 :智能体框架自动重配置数据库性能

AI总结 提出AgenticDB框架,通过运行时交互实现数据库系统级和操作系统级重配置,诊断瓶颈并积累经验,在MySQL和PostgreSQL上平均性能提升118.1%。

详情
AI中文摘要

数据库配置调优对工作负载性能至关重要,但在实际部署中进行实用调优仍然困难。现有的自动调优器大多将调优视为对DBMS旋钮值的迭代搜索。这种形式导致执行成本高,过早缩小配置空间,并且未能充分解决实际需求:从系统反馈中诊断运行时瓶颈,探索操作系统级重配置机会,稳健地执行更改,以及从先前的试验和任务中学习。我们提出AgenticDB,一个用于数据库工作负载重配置的代理式框架。AgenticDB实现了一个上下文驱动的工具,通过与目标数据库环境交互,提出DBMS级和操作系统级更改,在安全约束下应用它们,观察工作负载性能和运行时状态,并使用执行反馈来指导后续决策。这种运行时交互使AgenticDB能够诊断瓶颈,探索更广泛的DBMS和操作系统级重配置空间,避免不安全或不支持的操作,并在重配置任务内部和之间积累经验。因此,AgenticDB将数据库调优转变为一种自我改进的重配置过程,其中运行时反馈迭代地改进后续决策。我们在MySQL和PostgreSQL上使用YCSB、Sysbench和TPC-H工作负载进行了广泛实验。结果表明,AgenticDB在所有评估的工作负载上实现了最佳最终性能,平均比最强基线提高118.1%,并将总到达最佳时间减少22.6%。结果还表明,其操作系统级动作空间、稳健的执行生命周期和增强记忆的规划有助于实现更有效和实用的数据库重配置。

英文摘要

Database configuration tuning is critical for workload performance, but practical tuning on real deployments remains difficult. Existing automatic tuners mostly formulate tuning as iterative search over DBMS knob values. This formulation leads to high execution cost, prematurely narrows the configuration space, and leaves practical requirements insufficiently addressed: diagnosing runtime bottlenecks from system feedback, exploring OS-level reconfiguration opportunities, executing changes robustly, and learning from previous trials and tasks. We propose AgenticDB, an agentic framework for database workload reconfiguration. AgenticDB implements a context-grounded harness that interacts with the target database environment by proposing DBMS- and OS-level changes, applying them under safety constraints, observing workload performance and runtime states, and using execution feedback to guide subsequent decisions. This runtime interaction enables AgenticDB to diagnose bottlenecks, explore a broader DBMS- and OS-level reconfiguration space, avoid unsafe or unsupported actions, and accumulate experience within and across reconfiguration tasks. As a result, AgenticDB turns database tuning into a self-refining reconfiguration process in which runtime feedback iteratively improves later decisions. We conduct extensive experiments on MySQL and PostgreSQL using YCSB, Sysbench, and TPC-H workloads. The results show that AgenticDB achieves the best final performance on all evaluated workloads, improving over the strongest baseline by 118.1% on average and reducing aggregate time-to-best by 22.6%. The results also demonstrate that its OS-level action space, robust execution lifecycle, and memory-enhanced planning contribute to more effective and practical database reconfiguration.

2606.19790 2026-06-19 cs.CE 新提交 90%

The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

编排鸿沟:为何流程自动化在操作复杂行业中停滞不前

Jiechao Gao, Yuandong Pan. Yuangang Li, Jie Wang, Kincho Law, Michael Lepech

专题命中 工作流自动化 :分析多智能体系统在复杂行业自动化中的编排鸿沟。

AI总结 本文提出“编排鸿沟”概念,分析为何多智能体系统在物流、医疗等复杂行业自动化中失败,并给出基于约束执行和可解释性的分阶段自动化路径。

详情
AI中文摘要

智能体系统在数字原生任务上进展迅速,但几乎未触及那些协调自动化可能最重要的行业:物流、医疗运营、建筑以及许多工作分散在不兼容工具和众多参与者中的领域。我们认为原因是缺少一种抽象。在这些场景中,价值并非来自单个有能力的模型调用,而是来自编排——协调多步骤工作流、强制执行硬领域约束、管理人工审批并桥接遗留系统的运行时。我们将这一思想发展成一个可用的概念框架。我们给出了一个操作性测试来识别哪些工作流受限于编排,一种分解方法将工作流的混乱程度与其协调工作量及价值分离,以及一个特征层面的解释说明为何当今的多智能体框架留下了一个特定鸿沟。然后我们提出核心主张:正确的自动化路径是分阶段的,而哪种架构保证最重要取决于一个行业的主要摩擦来源。在监管摩擦下,约束执行是承重关键;在责任摩擦下,可解释性是承重关键。我们以这一观点所暗示的研究计划作为结尾。

英文摘要

Agentic systems have advanced quickly on digitally native tasks, yet they have barely touched the industries where coordinated automation could matter most: logistics, healthcare operations, construction, and the many sectors whose work is spread across incompatible tools and many hands. We argue that the reason is a missing abstraction. The value in these settings does not come from a single capable model invocation; it comes from \emph{orchestration}, the runtime that coordinates multi-step workflows, enforces hard domain constraints, manages human approval, and bridges legacy systems. We develop this idea into a usable conceptual frame. We give an operational test for which workflows are orchestration-bound, a decomposition that separates how tangled a workflow is from how much of its effort is coordination and what that coordination is worth, and a feature-level account of why today's multi-agent frameworks leave a specific gap. We then advance our central claim: the right automation path is staged, and which architectural guarantee carries the most weight depends on a sector's dominant source of friction. Constraint enforcement is load-bearing under regulatory friction; explainability is load-bearing under liability friction. We close with the research program this view implies.

2606.19382 2026-06-19 cs.SE cs.AI 新提交 90%

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO:基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

发表机构 * Gati Shakti Vishwavidyalaya(加蒂·沙克蒂大学) IBM Research(IBM研究院)

专题命中 工作流自动化 :提出多智能体编排引擎,生成工作流图。

AI总结 提出DynAMO引擎,采用先规划后执行架构生成可验证工作流图,支持顺序与并行执行,通过动态识别独立任务提升效率,在工业基准上实现1.6倍延迟降低,并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情
AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化,但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO(动态资产管理编排),一个部署就绪的引擎,采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流(拓扑执行)和并行工作流(依赖感知并发)。通过动态识别独立任务,DynAMO在保持结构正确性和安全性的同时,通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中,DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍,在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后,延迟分解显示LLM推理和编排仍占执行时间的90%以上,表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%,并且DynAMO在受控故障注入下保持正确的功能行为(任务完成、智能体排序和输出质量),同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行,并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取:this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 新提交 85%

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

专题命中 工作流自动化 :训练LLM作为长期生命周期智能体。

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.19795 2026-06-19 cs.SE cs.AI 新提交 85%

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化:一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Primarius Technologies(Primarius技术公司)

专题命中 工作流自动化 :讨论基于LLM的代理在EDA中的交接与自动化。

AI总结 本文从交接有效性角度出发,将EDA流程中的代理系统分为三类,并提出五层代理通信协议,以解决多阶段、多工具间的状态传递和验证问题。

详情
AI中文摘要

电子设计自动化(EDA)本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前,跨越工具、会话和组织边界。每次传递都携带显式和隐式需求,这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具,将检索到的知识嵌入可执行脚本,并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策,传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件,并携带足够的上下文、证据和来源以供下游使用时,交接是有效的。我们回顾了82个系统,并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别,我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议(EACP),涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

2606.19390 2026-06-19 cs.SE cs.AI 新提交 85%

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化:一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford(牛津大学) Cisco Systems(思科系统) The Alan Turing Institute(艾伦·图灵研究所) University of Warwick – WMG(沃里克大学 – WMG) University of Hull(哈罗德大学)

专题命中 工作流自动化 :提出协议驱动框架,用于自主AI工作负载的自动化。

AI总结 提出一种协议驱动框架,通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测,结合静态与运行时证据生成CSAF VEX公告,经密码签名和确定性重放验证,在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

详情
AI中文摘要

提出一种协议驱动框架,将SBOM和AIBOM工件绑定到确定性环境捕获和结构化运行时遥测。利用声明的工件、观察到的激活条件和强制执行的策略计算可利用性。从静态和运行时证据生成CSAF VEX公告,经密码签名并通过确定性重放验证。评估使用约10000个组件条目,涵盖50到5000个组件的合成自主AI工作负载,并整合OSV、GitHub Advisory、KEV和EPSS数据集。

英文摘要

A protocol driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10000 component entries across synthetic Agentic AI workloads 50 to 5000 components, incorporating OSV, GitHub Advisory, KEV, and EPSS datasets.

2606.20394 2026-06-19 cs.RO math.OC 新提交 85%

Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems

面向空间自主性的智能体自动研究:用于航空航天控制问题的可审计、LLM驱动的研究代理

Amit Jain, Richard Linares

发表机构 * Department of Aeronautics and Astronautics(航空航天学系)

专题命中 工作流自动化 :LLM驱动的研究代理自动开发航天控制策略

AI总结 提出AutoResearch框架,利用大语言模型作为离线研究代理,自动迭代开发航天控制策略,并通过内置可信层审计结果,消除种子噪声影响,在交会和对接问题上验证了有效性。

详情
AI中文摘要

航天器的制导、导航与控制功能日益通过从专家求解器中提炼的学习策略来实现。开发这样的策略本身就是一个研究过程:研究者选择架构和超参数,运行实验,并必须判断一个明显的改进是真实的还是仅仅是种子噪声。本文提出了AutoResearch框架,其中大语言模型自主驱动这一循环,用于航空航天控制问题,并结合了一个内置在循环中的可信层,该层根据问题自身测量的种子噪声对每个报告的结果进行认证。语言模型仅作为离线研究代理,负责开发控制策略;它产生的训练策略随后部署在航天器上,而模型本身从不操作飞行器。在每次迭代中,代理读取自然语言描述的问题描述和运行历史,对训练脚本提出一次编辑,执行它,并记录结果。任何报告的结果在通过相同的三项检查之前不会被认可:测量的每个问题的种子噪声、最佳配置的重新播种验证,以及代理编辑的留一法剪枝。相同的循环被原样应用于两个航空航天控制问题:Clohessy-Wiltshire相对交会问题和带有安全约束的避碰对接问题(经过禁飞区),每个问题都针对已知的最优控制基准进行了校准。在这两个问题中,经过审计的策略以多个标准差超过了测量的种子噪声;对相同参数的未定向搜索则没有。在对接问题上,差距变得明显:未定向搜索没有产生可行的策略,而学习到的策略在每个种子上都保持在禁飞区之外。

英文摘要

Spacecraft guidance, navigation, and control functions are increasingly realized as learned policies distilled from expert solvers. Developing such a policy is itself a research process: an investigator selects an architecture and hyperparameters, runs experiments, and must determine whether an apparent improvement is genuine or merely seed noise. This paper presents AutoResearch, a framework in which a large language model autonomously drives that loop for aerospace control problems, coupled with a credibility layer, built into the loop, that certifies each reported result against the problem's own measured seed noise. The language model serves only as the offline research agent that develops the control policy; the trained policy it produces is then deployed onboard the spacecraft, while the model itself never operates the vehicle. At each iteration the agent reads a plain-language problem description and the run history, proposes a single edit to the training script, executes it, and logs the outcome. No reported result is credited until it passes the same three checks: measured per-problem seed noise, reseeded verification of the best configuration, and leave-one-out pruning of the agent's edits. The same loop is applied, unchanged, to two aerospace control problems: a Clohessy-Wiltshire relative rendezvous and a safety-constrained collision-avoidance docking past a keep-out zone, each calibrated against a known optimal control benchmark. In both, the audited policy clears the measured seed noise by many standard deviations; an undirected search over the same parameters does not. On the docking problem the gap becomes categorical: undirected search yields no feasible policy, while the learned policy stays outside the keep-out zone on every seed.

2606.18191 2026-06-19 cs.AI cs.MA 新提交 85%

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

专题命中 工作流自动化 :评估AI代理预测个性化工作流的能力

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.19821 2026-06-19 cs.AI cs.LG 新提交 80%

TelcoAgent: A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability

TelcoAgent: 一种可扩展的5G多KPM预测与3GPP基础可解释性

Geon Kim, Dara Ron, Sukhdeep Singh, Suyog Moogi, Pranshav Gajjar, V V N K Someswara Rao Koduri, Een Kee Hong, Vijay K. Shah

发表机构 * NextG Wireless Lab, North Carolina State University(北卡罗来纳州立大学下一代无线实验室) Kyung Hee University(庆熙大学)

专题命中 工作流自动化 :多智能体管道用于5G KPM预测和可解释性。

AI总结 提出TelcoAgent框架,利用基础模型实现多KPM的零样本预测,通过3GPP知识图谱和可解释性管道提供可操作诊断。

Comments 6 pages, 6 figures. Submitted to IEEE GLOBECOM 2026

详情
AI中文摘要

关键性能测量(KPM)预测对于5G及下一代电信网络的主动网络管理至关重要。然而,现有的机器学习(ML)方法在可扩展性和可解释性方面存在显著局限性,限制了其在实际部署中的有效性。我们提出TelcoAgent,一个基于基础模型的框架,能够在不需站点特定训练的情况下,跨不同网络单元实现多个KPM的准确、可扩展和可解释预测。具体而言,该框架包含三个关键组件:(i) 一个自动化的三智能体管道,直接从规范文档构建第三代合作伙伴计划(3GPP)知识图谱;(ii) 一个可扩展的基于时间序列基础模型(TSFM)的预测管道,以提供准确的零样本预测;以及(iii) 一个推理和解释管道,提供可操作的、领域基础的诊断。使用来自美国网络运营商的三个月真实城市级5G KPM数据集进行评估,TelcoAgent在200个单元中针对每个单元的7个KPM均展示了高预测准确性,同时提供了可解释的见解和可操作的指令来解决网络退化问题。

英文摘要

Key Performance Measurement (KPM) forecasting is essential for proactive network management of 5G and next-generation telecom networks. However, existing machine learning (ML) approaches face significant limitations in scalability and explainability, restricting their effectiveness in real-world deployments. We propose TelcoAgent, a foundation model-based framework that enables accurate, scalable, and explainable forecasting of multiple KPMs across diverse network cells without the need for site-specific training. Specifically, the framework comprises three key components: (i) an automated three-agent pipeline that constructs a 3rd Generation Partnership Project (3GPP) knowledge graph directly from specification documents, (ii) a scalable, time-series foundation model (TSFM)-based prediction pipeline to deliver accurate, zero-shot forecasting, and finally (iii) a reasoning and explanation pipeline that provides actionable, domain-grounded diagnostics. Evaluated using a 3-month, real-world, city-scale 5G KPM dataset from a U.S.-based network operator, TelcoAgent demonstrates high forecasting accuracy for all 7 considered KPMs per cell across 200 cells, while delivering explainable insights and actionable instructions to address network degradations.

2606.19605 2026-06-19 cs.SE cs.AI 新提交 80%

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.(基础AI–思科系统公司) Yale University(耶鲁大学)

专题命中 工作流自动化 :框架自动诊断和优化流水线瓶颈,属于工作流自动化

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 新提交 80%

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

专题命中 工作流自动化 :基于图时间序列的DeFi风险监管智能体

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

2606.19812 2026-06-19 cs.AI cs.LG 新提交 75%

Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

AI辅助法律发现中的人机协同编排

Anushree Sinha, Srivaths Ranganathan, Abhishek Dharmaratnakar, Debanshu Das

发表机构 * Google LLC(谷歌公司) Mountain View, CA, USA(美国加利福尼亚州山景城)

专题命中 工作流自动化 :人机协同编排用于AI辅助法律发现,减少错误。

AI总结 针对AI代理在电子取证中因多步推理错误导致的法律风险,提出一种四层验证架构,通过人机协同阈值减少特权豁免风险达61%。

详情
AI中文摘要

自主大语言模型(LLM)代理越来越多地部署于电子发现(e-discovery),其中跨多步推理链的复合错误可能构成法律渎职。与单轮检索不同,在特权文档语料库上运行的代理工作流表现出我们称之为“轨迹崩溃”的一类失败:早期错误分类无声传播,导致整个特权审查失效。本文做出三项贡献。首先,我们提出一个按功能阶段组织的法律信息检索中代理失败的结构化分类法。其次,我们引入一个四层验证架构——涵盖规划、推理、执行和不确定性量化——旨在这些失败复合之前拦截它们。第三,我们在一个合成电子取证语料库上进行初步模拟研究,展示强制性人机协同(HOTL)升级阈值如何相对于完全自主基线降低特权豁免风险。我们的结果表明,与完全自主部署相比,校准的不确定性阈值可将特权豁免风险降低高达61%,同时将不到四分之一的文档路由给律师审查。

英文摘要

Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e-discovery), where compounding errors across multi-step reasoning chains can constitute legal malpractice. Unlike single-turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term "trajectory collapse": an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four-layer verification architecture -- spanning planning, reasoning, execution, and uncertainty quantification -- designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e-discovery corpus that demonstrates how mandatory Human-on-the-Loop (HOTL) escalation thresholds reduce privilege-waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege-waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.

2606.19602 2026-06-19 cs.AI 新提交 75%

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG:什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen(埃森大学医学院人工智能医学研究所) Faculty of Computer Science, University of Duisburg-Essen(杜伊斯堡-埃森大学计算机科学学院) Department of Physics, TU Dortmund University(多特蒙德工业大学物理系) Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University(多特蒙德工业大学拉马尔机器学习和人工智能研究所) Advanced Clinical Research Center, Fukushima Medical University(福岛医科大学先进临床研究中心) Department of Cardiology and Vascular Medicine, University Hospital Essen(埃森大学医院心血管内科)

专题命中 工作流自动化 :智能体RAG流水线自动推理和验证临床信息

AI总结 针对临床文档元数据缺失问题,提出基于智能体RAG的ACIE系统,在埃森大学医学中心部署,通过完整患者上下文推理和源引用验证,在7326次临床判断中实现96.5%的提取接受率。

详情
AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点,然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效,无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE(智能体临床信息提取):一个本地智能体RAG管道,能够推理完整的患者上下文,并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距,追溯了由此形成的架构决策,并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果,其中核医学医生根据引用的来源验证每个提取值。在7326次判断中,临床医生接受了96.5%的提取结果,按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

2606.19852 2026-06-19 cs.CL cs.LG 新提交 70%

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

提示、规划、提取:用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

发表机构 * Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida(健康结果与生物医学信息学系,医学院,佛罗里达大学) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida(呼吸科、重症医学科和睡眠医学科,医学系,医学院,佛罗里达大学) College of Nursing, Florida State University(护理学院,佛罗里达州立大学)

专题命中 工作流自动化 :智能体工作流用于临床信息提取。

AI总结 提出零样本智能体工作流,利用开源大语言模型从肺切除病理报告中提取13个CAP字段,在无训练下达到0.893 Micro-F1,接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情
AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中,使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题,但需要昂贵的人工标注,并且当上游实体缺失时会出现级联故障。在本研究中,我们开发了一个零样本智能体工作流,并评估了五个开源生成式大语言模型(LLMs),以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架,将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1,而最佳零样本模型(GPT-OSS-20B)达到了0.893的Micro-F1(召回率:0.949),在没有任务特定训练的情况下准确提取了复杂关系(如病理分期)。这些结果表明,开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

2606.20360 2026-06-19 astro-ph.IM 新提交 60%

Lightstack: A Python Package for Creating Photometric Data Cubes

Lightstack: 用于创建测光数据立方体的Python包

Andressa Wille, Rafael S. de Souza, Ana L. Chies-Santos, Thallis Pessi, Emille E. O. Ishida, Alberto Krone-Martins

专题命中 工作流自动化 :Python包自动化测光数据立方体创建

AI总结 提出Lightstack Python包,通过裁剪、堆叠和PSF匹配三步将独立图像组合成测光数据立方体,支持多波段测光研究。

Comments 4 pages, 1 figure, published in RNAAS

Journal ref Research Notes of the AAS, Volume 10, Number 6, 2026

详情
AI中文摘要

多波段测光追踪了跨广泛波长的多种物理过程。近几十年来,这一领域由多成像数据集的快速增长所驱动,例如来自哈勃空间望远镜和詹姆斯·韦伯空间望远镜的高分辨率观测,以及即将由罗曼空间望远镜和鲁宾天文台实现的大规模巡天。在这项工作中,我们介绍了lightstack,一个用于将独立图像组合成测光数据立方体的Python包。工作流程包括三个主要步骤:从所有可用滤光片的拼接图像中裁剪感兴趣区域;堆叠图像以构建数据立方体;对立方体执行PSF匹配。该包旨在为涉及多波段测光的研究准备数据。代码以MIT许可证发布,并在GitHub上提供,同时附有Jupyter教程笔记本。本出版物使用的版本(v0.2.1)已存档于Zenodo。

英文摘要

Multi-band photometry traces diverse physical processes across a wide range of wavelengths. In recent decades, this field has been driven by the rapid growth of multi-imaging datasets, from high-resolution observation from Hubble Space Telescope and James Webb Space Telescope to the forthcoming large-scale surveys enabled by the Roman Space Telescope and Rubin Observatory, for example. In this work, we present lightstack, a Python package for combining standalone images into photometric data cubes. The workflow consists of three main steps: cropping a region of interest from a mosaic across all available filters; stacking the images to construct the data cube; and performing PSF matching on the cube. This package is intended for preparing data for studies involving multi-band photometry. The code is released under an MIT license and is available on GitHub together with a Jupyter tutorial notebook. The version used for this publication (v0.2.1) is archived on Zenodo.