arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1530
专题追踪
2606.19613 2026-06-19 cs.SE cs.AI 新提交

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI(AWS 代理人工智能)

AI总结 提出StaminaBench基准,通过100轮连续变更请求测试编码智能体的耐力,发现所有模型在5-6轮内失败,而测试反馈和重试机制可将通过轮数提升12倍。

详情
AI中文摘要

我们引入了StaminaBench,一个衡量编码智能体耐力的基准:它们在失败前能处理多少连续交互轮次(变更请求)。与流行的任务解决率指标不同,这符合实际编码风格,其中会话运行数十或数百轮。在StaminaBench中,智能体实现一个REST API服务器,并在可调数量的程序生成的后续变更请求(实验中为100个)上进行修改,导致代码库最多达6000行。测试完全以编程方式生成,无需LLM参与,确保可重复性和可靠性;变更序列来自硬编码或LLM驱动的采样器,两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行,并通过HTTP与基准通信,使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景(每个100轮)上的表现,发现:(1)所有测试模型在5-6轮内失败,确认了无彻底测试的编码风格会产生错误;(2)将测试反馈传递给智能体并允许重试,可将通过轮数提升最多12倍;(3)良好的框架是强性能所必需的:更强的模型在其最佳和最差框架之间表现出高达6倍的差距,而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务,以促进对多轮编码智能体行为的进一步研究。基准代码和数据:此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

2606.19605 2026-06-19 cs.SE cs.AI 新提交

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.(基础AI–思科系统公司) Yale University(耶鲁大学)

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19535 2026-06-19 cs.CR cs.LG 新提交

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

发表机构 * University of Luebeck(吕贝克大学)

AI总结 提出FloatDoor,首个输入无关、平台触发的后门攻击,利用浮点运算平台差异,通过两个轻量LoRA适配器在目标平台触发恶意行为,同时保持模型正常效用。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在软件工程等敏感环境中,其输出直接影响下游工件。最近的研究表明,由于非结合浮点运算和不同的内核实现,同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响,并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor,这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为,而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现:一个放大平台间数值差异,另一个将由此产生的平台签名绑定到恶意下游任务,同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor,涵盖了广泛的部署目标,包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究,我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击,并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移:一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa(摩图瓦大学) University of Ruhuna(鲁胡纳大学) RMIT University(皇家墨尔本理工大学)

AI总结 提出LLM辅助PQC开发中的安全编码漂移模型,通过游戏化框架将LLM转变为主动安全协作者,以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情
AI中文摘要

向后量子密码学(PQC)的过渡引入了相当大的实现复杂性,要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时,大型语言模型(LLM)已深度嵌入软件开发工作流程,包括密码工程。虽然LLM提高了生产力,但证据表明它们经常生成不安全或次优的代码,特别是在安全关键领域。本文引入了PQC中的安全编码漂移,这是一种新颖的社会技术漏洞模型,捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同,我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题,我们提出了一种游戏化的、LLM增强的安全编码框架,将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者,为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

2606.19407 2026-06-19 cs.SE cs.AI 新提交

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!:用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

发表机构 * Peking University(北京大学) University of Edinburgh(爱丁堡大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出JustDiag诊断论证引擎,通过维护显式的过程状态(证据、发现、竞争假设、冲突和下一步检查)来支持可问责的根本原因分析,在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情
AI中文摘要

大型语言模型可以生成流畅的根本原因分析,但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中,工程师需要知道哪些证据支持诊断,考虑了哪些替代方案,哪里存在矛盾,以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白,这是一个用于RCA的诊断论证引擎,它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统,该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比,JustDiag获得了更强的结果和过程分数,同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明,可问责的RCA需要显式的诊断论证工件和过程感知评估,而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

2606.19390 2026-06-19 cs.SE cs.AI 新提交

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化:一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford(牛津大学) Cisco Systems(思科系统) The Alan Turing Institute(艾伦·图灵研究所) University of Warwick – WMG(沃里克大学 – WMG) University of Hull(哈罗德大学)

AI总结 提出一种协议驱动框架,通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测,结合静态与运行时证据生成CSAF VEX公告,经密码签名和确定性重放验证,在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

详情
AI中文摘要

提出一种协议驱动框架,将SBOM和AIBOM工件绑定到确定性环境捕获和结构化运行时遥测。利用声明的工件、观察到的激活条件和强制执行的策略计算可利用性。从静态和运行时证据生成CSAF VEX公告,经密码签名并通过确定性重放验证。评估使用约10000个组件条目,涵盖50到5000个组件的合成自主AI工作负载,并整合OSV、GitHub Advisory、KEV和EPSS数据集。

英文摘要

A protocol driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10000 component entries across synthetic Agentic AI workloads 50 to 5000 components, incorporating OSV, GitHub Advisory, KEV, and EPSS datasets.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Concordia University(康科迪亚大学) University of Toronto(多伦多大学) McMaster University(麦马斯特大学)

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.19387 2026-06-19 cs.SE cs.AI 新提交

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Fudan University(复旦大学) USA(美国)

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 新提交

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态:挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

发表机构 * Modint Intelligence(Modint智能科技)

AI总结 本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作,揭示了校准类别的关键影响,并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

自主代理的运行时监视器通常对累积的内部状态(行为基线、漂移统计量,或在我们之前工作中的建模情感状态)设置阈值。我们之前报告了一个状态饱和陷阱:在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理(Modgil 2026)上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0,因此其指数衰减从未运作:已发布的陷阱是一个纯累加器的结果。我们更正了记录(勘误,v2)并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间(每次观测,如CUSUM)还是挂钟时间(半衰期以秒计,如情感模型和EMA基线)校准的。在固定速率流上两者一致;在代理流上,动作间时间变化几个数量级,它们不一致。在20条轨迹上对均匀间隔(dt在{0..600}秒内)的预注册扫描显示,挂钟水平触发器有两个机制:在dt<=1秒时恒定警报(20/20;中位数18次触发);在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒(p90 2.33秒);真实编码节奏位于陷阱机制内,在修正机制下证实了经验发现。该结构是校准类别的属性,而非引擎:在原始误差流上的最小挂钟累加器重现了相同的悬崖,而相同流上的样本时间CUSUM恰好是dt不变的(20/20)。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论,挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制;转换检测在每个节奏下都逃脱了陷阱,但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

2606.19382 2026-06-19 cs.SE cs.AI 新提交

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO:基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

发表机构 * Gati Shakti Vishwavidyalaya(加蒂·沙克蒂大学) IBM Research(IBM研究院)

AI总结 提出DynAMO引擎,采用先规划后执行架构生成可验证工作流图,支持顺序与并行执行,通过动态识别独立任务提升效率,在工业基准上实现1.6倍延迟降低,并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情
AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化,但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO(动态资产管理编排),一个部署就绪的引擎,采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流(拓扑执行)和并行工作流(依赖感知并发)。通过动态识别独立任务,DynAMO在保持结构正确性和安全性的同时,通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中,DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍,在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后,延迟分解显示LLM推理和编排仍占执行时间的90%以上,表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%,并且DynAMO在受控故障注入下保持正确的功能行为(任务完成、智能体排序和输出质量),同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行,并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取:this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

2606.19380 2026-06-19 cs.SE cs.LG 新提交

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor:编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program(Anthropic Fellow 项目) Constellation

AI总结 提出AgentArmor框架,通过系统提示增强、命令分类器、三振政策等机制,缓解编码代理因规范不足、能力错误和工具错误导致的失败,显著提升安全性。

详情
AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中,我们研究这些失败模式源于三种不同的机制:规范不足,即默认模型行为不安全;能力错误,即安全动作可用但模型因偏见或能力限制而未遵循;以及代理工具错误,即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制,每个评估都受实际部署失败的启发,总计20个编码环境和59个合成转录模板。基于此评估,我们提出AgentArmor,一种代理工具修改,以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具,我们证明AgentArmor在统计显著数量的样本上更安全。因此,我们为当前编码代理提出具体缓解措施,并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

2606.20443 2026-06-19 eess.SY cs.LG cs.SY math.AT 新提交

Topological Data Analysis for High-Dimensional Dynamic Process Monitoring

高维动态过程监测的拓扑数据分析

Angan Mukherjee, Tyler A. Soderstrom, Michael J. Kurtz, Victor M. Zavala

发表机构 * Department of Chemical & Biological Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校化学与生物工程系) ExxonMobil Technology and Engineering(埃克森美孚技术与工程)

AI总结 提出结合拓扑数据分析和机器学习的方法,将多变量时间序列表示为流形,用拓扑描述符总结结构,并用神经常微分方程学习拓扑结构动态演化,实现高效事件检测。

详情
AI中文摘要

实时过程监测需要从高维时间序列数据中提取可操作信息的方法。在这项工作中,我们提出了一种新的过程监测方法,结合了拓扑数据分析(TDA)和机器学习工具。在所提出的方法中,我们将多变量时间序列数据表示为流形,并使用拓扑描述符来总结此类数据的结构;然后,我们使用神经常微分方程来学习系统拓扑结构的动态演化。使用来自工业过程的真实数据,我们表明这种基于轨迹的事件检测方法能有效检测多种类型的事件。我们将该方法与基于重构的方法(如主成分分析和自编码器)以及使用Koopman自编码器的基于轨迹的方法进行了对比。

英文摘要

Real-time process monitoring requires methods that extract actionable information from high-dimensional time-series data. In this work, we present a new approach for process monitoring that combines tools of topological data analysis (TDA) and machine learning. In the proposed approach, we represent multivariate time-series data as manifolds and use topological descriptors to summarize the structure of such data; we then use a neural ordinary differential equation to learn the dynamic evolution of the topological structure of the system. Using real data from an industrial process, we show that this trajectory-based event detection approach is effective at detecting diverse types of events. We contrast this approach against reconstruction-based approaches such as principal component analysis and autoencoders and against a trajectory-based approach that uses Koopman autoencoders.

2606.19895 2026-06-19 math.NA cs.LG cs.NA 新提交

A fast direct solver based neural network for solving PDEs

基于快速直接求解器的神经网络求解偏微分方程

Jashwanth Reddy Kadaru, Vaishnavi Gujjula

发表机构 * Department of Computer Science & Engineering, International Institute of Information Technology Bangalore (IIIT-B), India(计算机科学与工程系,国际信息学院班加罗尔(IIIT-B),印度) Department of Data Science and Artificial Intelligence, International Institute of Information Technology Bangalore (IIIT-B), India(数据科学与人工智能系,国际信息学院班加罗尔(IIIT-B),印度)

AI总结 提出一种学习HODLR矩阵逆运算的神经网络,并扩展为非线性PDE求解算子,实验表明在多种PDE上高效且泛化良好。

Comments 26 pages, 7 Figures, 5 Tables

详情
AI中文摘要

大规模$N$体问题产生的矩阵可以使用层次矩阵高效表示,其关键思想是允许跨矩阵分区层次结构的可接受非对角子矩阵可以通过低秩矩阵很好地近似。HODLR(层次非对角低秩)矩阵是层次矩阵的一个子类,其中递归二分划分的每一级的所有非对角子矩阵都是低秩的。本文提出一种神经网络,基于Ambikasaran和Darve(2013)开发的HODLR矩阵快速直接求解器,学习HODLR矩阵的逆运算。我们进一步通过将部分线性层替换为深度子网络,扩展该架构以学习与PDE相关的非线性解算子。我们通过进行一组全面的实验来展示所提出架构的性能,包括(i)求解线性问题,如第二类Fredholm积分方程,(ii)求解PDE,如非线性薛定谔方程、Burgers方程和稳态达西流方程,(iii)跨不同参数值的泛化研究,(iv)将所提出网络的推理时间与经典数值求解器的运行时间进行比较,以及(v)将所提出网络与一些现有的神经算子学习网络进行比较。

英文摘要

The matrices arising from large scale $N$-body problems can be efficiently represented using hierarchical matrices, whose key idea is that the admissible off-diagonal sub-matrices can be well approximated by low-rank matrices across a hierarchy of matrix partitions. HODLR (Hierarchical Off-Diagonal Low-Rank) matrices are a subclass of hierarchical matrices in which all off-diagonal submatrices at every level of a recursive binary partition are low-rank. In this article, we present a neural network that learns the inverse operation of HODLR matrices based on the fast direct solver for HODLR matrices developed by Ambikasaran and Darve (2013). We further extend the architecture to learn nonlinear solution operators associated with PDEs by replacing some of the linear layers with deep sub-networks. We demonstrate the performance of the proposed architecture by performing a comprehensive set of experiments that include (i) solving a linear problem such as the Fredholm integral equation of the second kind, (ii) solving PDEs such as the nonlinear Schrödinger equation, Burgers' equation, and the steady-state Darcy's flow equation, (iii) generalization study across varying parameter values, (iv) comparing the inference time of the proposed network with the run time of a classical numerical solver, and (v) comparing the proposed network with some of the existing neural operator learning networks.

2606.20504 2026-06-19 quant-ph cs.LG 新提交

Entropy Estimation in Multi-Qutrit Systems via Variational and Classical Neural Networks

多qutrit系统中基于变分和经典神经网络的熵估计

Sai Sakunthala Guddanti, Anil Prabhakar, Ria Rushin Joseph

发表机构 * Centre for Q. Info, Comm. and Computing(量子信息、通信和计算中心) Department of Electrical Engineering, IIT Madras(印度理工学院马德拉斯分校电子工程系) School of Information Technology, Deakin University(德坎大学信息技术学院)

AI总结 本文系统研究了多qutrit量子系统中von Neumann熵的估计,采用变分量子算法和经典卷积神经网络两种方法,发现VQA适用于小系统,而CNN在大系统中更具可扩展性和鲁棒性。

详情
AI中文摘要

我们使用两种互补方法——变分量子算法(VQAs)和经典卷积神经网络(CNNs),在理想(无噪声)量子模拟器上对多qutrit量子系统中的von Neumann熵估计进行了系统研究。对于最多三个qutrit的系统,我们构建并评估了11种硬件高效的SU(3)启发ansatzes。参数扫描表明,在存在足够纠缠的情况下,估计精度主要由可训练参数的数量决定。基于此研究,我们将后续实验的参数数量固定为约120,观察到纠缠门数量超过阈值后仅带来边际改进。对于更大的系统(二至五个qutrit),我们使用在张量积互无偏基测量结果上训练的CNN。该模型实现了准确且稳定的预测,并表现出随系统大小系统性改善的性能,其中二qutrit系统的误差最高,五qutrit系统的误差最低。值得注意的是,仅使用全状态层析所需测量的12.5%就足以使四和五qutrit系统的90百分位绝对误差达到约0.13-0.16 nat。CNN模型还对散粒噪声具有鲁棒性,并能很好地泛化到分布外状态。总体而言,在我们研究的模拟设置中,结果表明了实用方法的转变:VQAs对小系统有效,而基于CNN的估计器为更大的qutrit系统提供了更好的可扩展性和鲁棒性。

英文摘要

We present a systematic study of von Neumann entropy estimation in multi-qutrit quantum systems using two complementary approaches: variational quantum algorithms (VQAs) and classical convolutional neural networks (CNNs), evaluated using an ideal (noise-free) quantum simulator. For systems up to three qutrits, we construct and evaluate 11 hardware-efficient SU(3)-inspired ansatzes. A parameter sweep shows that estimation accuracy is primarily determined by the number of trainable parameters, provided sufficient entanglement is present. Based on this study, we fix the parameter count to approximately 120 for subsequent experiments, observing that increasing entangling-gate counts beyond a threshold yields only marginal improvements. For larger systems (two to five qutrits), we use a CNN trained on measurement outcomes from tensor-product mutually unbiased bases. The model achieves accurate and stable predictions and exhibits a systematic improvement in performance with system size, with the highest errors for two-qutrit systems and the lowest for five-qutrit systems. Notably, using only 12.5% of the measurements required for full state tomography is sufficient to reach 90th-percentile absolute errors of approximately 0.13-0.16 nats for both four- and five-qutrit systems. The CNN model is also robust to shot noise and generalizes well to out-of-distribution states. Overall, within the simulated settings studied here, our results indicate a transition in practical methods: VQAs are effective for small systems, while CNN-based estimators offer improved scalability and robustness for larger qutrit systems.

2606.20344 2026-06-19 quant-ph cs.DC cs.LG 新提交

Quantum ring all-reduce: communication and privacy advantages for distributed learning

量子环全归约:分布式学习的通信与隐私优势

María Gragera Garcés, Lirandë Pira

发表机构 * University of Edinburgh(爱丁堡大学) Centre for Quantum Technologies(量子技术中心)

AI总结 提出量子环全归约协议,利用预共享纠缠和超密编码将每链路在线通信量减半,并通过验证纠缠实现信息论安全的可组合ε-安全聚合,同时获得通信与隐私优势。

Comments 23 pages, 1 figure

详情
AI中文摘要

机器学习模型已扩展到前所未有的规模,使得跨分布式设备的训练成为该领域的事实标准。在这项工作中,我们探讨量子通信如何使分布式训练在通信效率和信息论隐私方面都更具优势,适用于经典和量子学习模型。环全归约是大规模分布式训练的基础通信原语。我们提出一种量子版本,通过预共享纠缠和超密编码,将每链路在线通信量减少一个可证明最优的因子二,且无需改变学习模型或梯度计算。除了带宽优势,该原语还能实现任何经典协议在信息论上不可能实现的隐私保证,通过验证纠缠以GHZ副本的2倍开销实现可组合的ε-安全聚合。我们的混合量子-经典通信架构为大规模分布式训练同时带来通信和安全优势,无论学习本身是量子还是经典。最后,我们描述了在带宽约束下服务器到客户端通信中梯度冲突检测的量子优势,该设置出现在环全归约完成后,当完整梯度广播到外部客户端不可行时。该问题的两个变体呈现出不同的分离。对于基于间隔的对齐测试(\textsc{GapIP}_{\tau}),量子优势在间隔参数上是二次的:\widetilde{O}({\tau}^{-1}\log P) 量子比特对比 \widetilde{O}(\min(\{\tau}^{-2},P)) 比特。对于针对私有参数匹配的符号一致性审计(\textsc{TieAudit}_{\epsilon}),优势表现为通信复杂度的指数级分离:\Omega(\sqrt{P}) 比特,而 O({\epsilon}^{-2}\log P) 量子比特就足够了。

英文摘要

Machine learning models have scaled to unprecedented sizes, making training across distributed devices the de facto standard in the field. In this work, we explore how quantum communications can make distributed training both more communication-efficient and information-theoretically private, for both classical and quantum learning models. Ring all-reduce is the foundational communication primitive for large-scale distributed training. We present a quantum version that reduces per-link online communication by a provably optimal factor of two using pre-shared entanglement and superdense coding, without requiring the learning model or gradient computation to change. Beyond bandwidth, the primitive enables privacy guarantees that are information-theoretically impossible for any classical protocol, achieving composable ε-secure aggregation, via verified entanglement, at a 2x overhead in GHZ copies. Our hybrid quantum-classical communication architecture yields simultaneous communication and security advantages for large scale distributed training, regardless of whether the learning itself is quantum or classical. Finally, we characterise quantum advantages in gradient conflict detection for server-to-client communication under bandwidth constraints, a setting that arises after ring all-reduce is completed, when full gradient broadcast to external clients is infeasible. Two variants of the problem admit different separations. For margin-based alignment testing (\textsc{GapIP}_τ), the quantum advantage is quadratic in the margin parameter: \widetilde{O}(τ^{-1}\log P) qubits versus \widetilde{O}(\min(\τ^{-2},P)) bits. For sign-consistency auditing against a private parameter matching (\textsc{TieAudit}_ε), the advantage represents an exponential separation in communication complexity: Ω(\sqrt{P}) bits whereas O(ε^{-2}\log P) qubits suffice.

2606.19947 2026-06-19 quant-ph cs.LG 新提交

QMaxCal: Path-Space Regularization for Open Quantum Control via Girsanov's Theorem

QMaxCal: 基于 Girsanov 定理的开环量子控制路径空间正则化

Merijn Moody, Zier Mensch, Miranda C. N. Cheng, Peter G. Bolhuis, Max Welling

发表机构 * Institute of Physics, University of Amsterdam, Netherlands(阿姆斯特丹大学物理研究所) Van 't Hoff Institute for Molecular Sciences, University of Amsterdam, Netherlands(阿姆斯特丹大学范·霍夫分子科学研究所) Dutch Institute for Emergent Phenomena, University of Amsterdam, Netherlands(阿姆斯特丹大学新兴现象研究所) Institute for Mathematics, Academia Sinica, Taiwan(台湾“中华学术院”数学研究所) Korteweg-de Vries Institute for Mathematics, University of Amsterdam, Netherlands(阿姆斯特丹大学柯特韦斯数学研究所) Amsterdam Machine Learning Lab, University of Amsterdam, Netherlands(阿姆斯特丹大学机器学习实验室) Department of Physics, National Taiwan University, Taiwan(台湾国立台湾大学物理系)

AI总结 针对开放量子系统退相干问题,利用 Girsanov 定理推导 KL 散度的可微估计器,提出两种正则化项以最小化退相干影响,在多种量子系统中优于未正则化的梯度方法和强化学习基线。

Comments 26 pages, 6 figures. ICML 2026 AI4Physics Workshop

详情
AI中文摘要

在存在退相干的条件下,可靠的量子控制需要能够对抗环境噪声对受控动力学影响的策略。连续监测下的开放量子系统产生经典测量记录,其漂移依赖于系统所经历的噪声;共享相同退相干通道的两个演化的记录仅在此漂移上有所不同,因此 Girsanov 定理给出了它们轨迹分布之间 KL 散度的闭式、可微估计器。我们用两个物理动机的参考度量实例化该估计器,得到两个正则化项,它们都将系统驱动到退相干效应最小的状态:Wiener KL (KL_W),在噪声模型的某些条件下经验上更有效;以及漂移方差正则化项 (R_DV),适用于所有噪声模型。两者在性质上不同于现有的控制通量或平滑性惩罚:它们惩罚控制对退相干通道的可观测后果,而非控制幅度本身。这些正则化项在一系列开放量子系统中优于未正则化的基于梯度和强化学习的基线——包括单量子比特和多量子比特基准测试,以及一个校准到已发表的 IBM Kingston 处理器快照的多量子比特链——在多个评估维度上:最终态保真度、对假设噪声模型失配的鲁棒性(在训练噪声下增益从 +17 个百分点增长到 2.5 倍噪声失配下的 +27 个百分点),以及禁止态的占据。正则化项将不保真度降低高达 50%,在校准的 IBM Kingston 链上获得约 16% 的增益。

英文摘要

Reliable quantum control in the presence of decoherence requires policies that combat the effect of environmental noise on the controlled dynamics. Open quantum systems under continuous monitoring generate classical measurement records whose drift depends on the noise experienced by the system; the records of two evolutions sharing the same decoherence channels differ only in this drift, so Girsanov's theorem yields a closed-form, differentiable estimator of the KL divergence between their trajectory distributions. We instantiate this estimator with two physically motivated reference measures, yielding two regularizers that both drive the system toward states where the effects of decoherence are minimal: the Wiener KL (KL_W), which is empirically more effective under certain conditions on the noise model, and the drift-variance regularizer (R_DV), which works for all noise models. Both are qualitatively distinct from existing penalties on control fluence or smoothness: they penalize the observable consequences of control on the decoherence channels rather than the control amplitude itself. The regularizers outperform unregularized gradient-based and reinforcement-learning baselines across a range of open quantum systems -- including single- and multi-qubit benchmarks and a multi-qubit chain calibrated to a published snapshot of the IBM Kingston processor -- along several axes of evaluation: final-state fidelity, robustness to mismatch in the assumed noise model (gains grow from +17 pp at training noise to +27 pp under 2.5x noise mismatch), and occupation of forbidden states. The regularizers reduce infidelity by up to 50%, with ~16% gains on the calibrated IBM Kingston chain.

2606.19486 2026-06-19 quant-ph cs.IT cs.LG math.IT 新提交

Optimal Ansatz-free Hamiltonian Learning In Situ

无假设哈密顿量的最优原位学习

Taiqi Zhou, Weiyuan Gong

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong(香港中文大学信息工程系) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院) California Institute of Technology(加州理工学院)

AI总结 提出一种无需控制、无需辅助比特的算法,仅用泡利乘积态制备和测量,以最优总演化时间学习无假设哈密顿量,适用于近中期量子实验。

Comments 51 pages, 2 figures

详情
AI中文摘要

描述控制量子系统的哈密顿量特征,是量子设备校准、信号传感和纠错的基本子程序。近期工作提出了协议,通过实时演化实现无假设哈密顿量的最优海森堡极限学习,无需完全指定相互作用结构。然而,这些协议依赖于带有交错探测和控制的深电路以及极短的时间分辨率,使其难以在近中期原位量子实验中实现。本文提出一种计算高效、无需控制、无需辅助比特的算法,仅使用泡利乘积态制备和测量,在总演化时间 $\Theta(\frac{\Lambda}{\epsilon^2}\log(\frac{\Lambda}{\epsilon}))$ 内学习无假设哈密顿量 $H$(满足 $||H||\leq\Lambda$)。该算法的演化时间成本对于任何无控制协议是最优的,因为我们进一步证明了 $\Omega(\frac{\Lambda}{\epsilon^2}\log(\frac{\Lambda}{\epsilon}))$ 的下界。技术上,我们的方法引入了一个随机采样框架,结合了带限核时间采样和用于哈密顿量结构学习的位移筛。特征探测时间分辨率仅依赖于 $\Lambda$ 而非 $\varepsilon$,这使得我们的协议在传感和校准的高精度场景中特别有吸引力。我们还表明,当哈密顿量在校准后是局域的时,该算法在存在状态制备和测量(SPAM)噪声的情况下保持相同的渐近总演化时间。我们的结果展示了实验友好型哈密顿量学习的基本成本,并为近中期量子平台的严格原位表征提供了实用途径。

英文摘要

Characterizing the features of a Hamiltonian that governs a quantum system serves as a fundamental subroutine of quantum device calibration, signal sensing, and error correction. Recent works proposed protocols have achieved the optimal Heisenberg-limited scaling learning ansatz-free Hamiltonians from their real-time evolutions without fully specifying interaction structures. However, these protocols rely on both deep circuits with interleaving probes and control, and extremely short time resolution, making them difficult to implement on near- and intermediate-term in situ quantum experiments. In this work, we propose a computationally efficient, control-free, and ancilla-free algorithm that uses only Pauli product state preparation and measurement, and learns an ansatz-free Hamiltonian $H$ with $||H||\leqΛ$ in total evolution time of $Θ(\fracΛ{ε^2}\log(\fracΛε))$. The evolution time cost of our algorithm is optimal for any control-free protocols as we further prove a lower bound of $Ω(\fracΛ{ε^2}\log(\fracΛε))$. Technically, our method introduces a randomized-sampling framework that combines band-limited kernel-based time sampling with a displacement sieve for Hamiltonian structure learning. The characteristic probe time resolution depends only on $Λ$ instead of $\varepsilon$, which makes our protocol especially appealing in the high-precision regime for sensing and calibration applications. We also show that the algorithm maintains the same asymptotic total evolution time in the presence of state-preparation-and-measurement (SPAM) noise when the Hamiltonian is local after calibration. Our results demonstrate the fundamental cost of experimentally friendly Hamiltonian learning and provide a practical route to rigorous in situ characterization of near-term quantum platforms.

2606.19912 2026-06-19 math.NA cs.LG cs.NA physics.comp-ph 新提交

Structure-Oriented Randomized Neural Networks for Poisson-Nernst-Planck and Poisson-Nernst-Planck-Navier-Stokes Systems

面向结构的随机神经网络用于泊松-能斯特-普朗克和泊松-能斯特-普朗克-纳维-斯托克斯系统

Yunlong Li, Fei Wang

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi(西安交通大学数学与统计学院,西安,陕西)

AI总结 提出结构导向随机神经网络(SO-RaNN)框架,通过解耦线性化子问题、逐点截断保持浓度正性、离散质量缩放因子和SAV后处理修正,实现PNP和PNP-NS系统的高效求解,并理论推导残差估计和收敛性。

详情
AI中文摘要

我们开发了一种面向结构的随机神经网络框架,称为SO-RaNN,用于泊松-能斯特-普朗克(PNP)系统和泊松-能斯特-普朗克-纳维-斯托克斯(PNP-NS)系统。解耦的线性化子问题通过随机神经网络在时空框架中迭代求解。对于浓度变量,使用逐点截断在数值层面强制正性,并在选定的修正时刻计算离散质量缩放因子并在时间上插值,以确保在这些时刻精确匹配质量并促进近似质量守恒。为了引入辅助离散耗散机制,我们进一步采用SAV型后处理修正,该修正使得SAV辅助变量在理想SAV更新下具有单调性。对于PNP-NS系统,使用保结构随机神经网络(SP-RaNN)处理速度场,使得速度近似通过构造满足逐点不可压缩约束。在理论方面,我们推导了线性化子问题的原始未修正RaNN求解器的残差估计,为PNP系统的原始外Picard迭代制定了条件性局部时间收敛结果,并分析了数值层面的正性修正以及质量修正和SAV后处理步骤。对于PNP-NS系统,我们建立了SP-RaNN空间的逼近结果,并给出了相应线性化Oseen型问题的条件性误差陈述。数值实验展示了源驱动制造测试中的逼近精度,并说明了预期中的数值层面正性修正、选定时刻质量匹配、基于最终规范固定势的计算自由能曲线以及基准测试中的无散度逼近。

英文摘要

We develop a structure-oriented randomized neural network framework, termed SO-RaNN, for the Poisson-Nernst-Planck (PNP) system and the Poisson-Nernst-Planck-Navier-Stokes (PNP-NS) system. The decoupled linearized subproblems are solved iteratively by randomized neural networks in a space-time framework. For the concentration variables, a pointwise cut-off is used to enforce positivity at the value level, and discrete mass-scaling factors are computed at selected correction instants and interpolated in time, so as to ensure exact mass matching at those instants and to promote approximate mass preservation between them. To introduce an auxiliary discrete dissipation mechanism, we further employ an SAV-type post-processing correction, which yields monotonicity of the SAV auxiliary variable under the ideal SAV update. For the PNP-NS system, a structure-preserving randomized neural network (SP-RaNN) is used for the velocity field, so that the velocity approximation satisfies the incompressibility constraint pointwise by construction. On the theoretical side, we derive residual-based estimates for the raw, uncorrected RaNN solvers of the linearized subproblems, formulate a conditional local-in-time convergence result for the raw outer Picard iteration of the PNP system, and analyze the value-level positivity correction together with the mass-correction and SAV post-processing steps. For the PNP-NS system, we establish an approximation result for the SP-RaNN space and provide a conditional error statement for the corresponding linearized Oseen-type problem. Numerical experiments demonstrate approximation accuracy in the source-driven manufactured tests and illustrate the intended value-level positivity correction, selected-time mass matching, computed free-energy curves based on the final gauge-fixed potential, and divergence-free approximation in benchmark tests.

2606.19767 2026-06-19 eess.IV cs.CV physics.med-ph 新提交

Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance

面向头颈外科引导的带参数表征的轮廓约束可变形配准

Qingyun Yang, Jon S. Heiselman, Ayberk Acar, Morgan J. Ringel, Michael I. Miga, Matthieu Chabanas, Michael C. Topf, Jie Ying Wu

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心)

AI总结 提出一种基于正则化Kelvinlet基函数的可变形配准框架,通过表面点云、基准标记和轮廓约束校正术后组织变形,在9例头颈标本上将配准误差从刚性配准的11.11mm降至5.62mm,降幅达49.41%。

详情
AI中文摘要

全球每年新增89万例头颈部鳞状细胞癌,其复发率在实体恶性肿瘤中最高。尽管冰冻切片分析是术中切缘评估的标准方法,但由于切除标本与切除床之间的对准不精确,加上切除后黏膜组织收缩,准确地将检测到的阳性切缘重新定位到切除床上仍然具有挑战性。我们提出了一种生物力学驱动的可变形配准框架,用于校正术后组织变形以提供术中引导。该方法基于正则化Kelvinlet基函数的可变形配准方法,将3D标本网格配准到术中切除床点云。配准匹配表面点云、基准标记和边界轮廓约束,直接惩罚标本与切除床边界之间的垂直距离一致性。在来自皮肤、颊粘膜和舌部位的9个标本上,使用刚性配准的整体平均目标配准误差为$11.11 \pm 4.07$ mm,使用无轮廓约束的可变形配准则降至$8.20 \pm 2.68$ mm(降低26.19%)。所提出的轮廓约束可变形配准进一步将误差降至$5.62 \pm 2.28$ mm,相对于刚性配准降低了49.41%。我们在临床最具挑战性的舌标本中观察到最大降幅。我们还进行了系统的两阶段参数搜索,以表征表面配准、基准对应、轮廓约束和应变能正则化的相对重要性。该搜索表明,对于具有大侧向变形的组织类型,轮廓权重主导配准精度,而算法在广泛的参数组合范围内均可运行。

英文摘要

With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was $11.11 \pm 4.07$ mm using rigid registration, which decreased to $8.20 \pm 2.68$ mm (26.19\% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to $5.62 \pm 2.28$ mm, a 49.41\% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.

2606.20437 2026-06-19 hep-ex cs.LG 新提交

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

HEPTv2:用于带电粒子重建的端到端高效点变换器

Siqi Miao, Shitij Govil, Jack P. Rodgers, Mia Liu, Javier Duarte, Shih-Chieh Hsu, Yuan-Tang Chou, Pan Li

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院) Department of Physics and Astronomy, Purdue University(普渡大学物理与天文学系) Department of Physics, University of California San Diego(加州大学圣地亚哥分校物理系) Department of Physics, University of Washington(华盛顿大学物理系)

AI总结 提出HEPTv2,一种端到端点变换器架构,通过局部敏感哈希编码和扇区化解码,无需图构建即可从探测器击中点直接重建粒子轨迹,在TrackML上以0.8%假率实现98.6%追踪效率,延迟仅15ms。

详情
AI中文摘要

带电粒子追踪——从稀疏探测器测量中重建轨迹——是一个基础的高能物理推理问题,也是在极端组合歧义下学习的典型例子。在高亮度大型强子对撞机(HL-LHC)上,尽管碰撞密度前所未有,追踪必须保持准确和高效。图神经网络表现强劲,但图构建和处理带来了大量成本,而基于变换器的方法依赖辅助阶段,阻碍了端到端优化。为解决这一问题,我们提出了HEPTv2,一种端到端点变换器架构,在一个可训练管道中从探测器击中点重建轨迹。HEPTv2结合了局部感知点编码器和轨迹解码器,无需图构建、聚类或过滤即可预测完整轨迹。编码器在探测器坐标空间中使用局部敏感哈希,以保留追踪相关几何结构,同时实现高效的局部注意力。解码器通过扇区化解码和联合编码器-解码器监督下的直接击中到轨迹预测来消除歧义,使整个管道能够端到端优化。在TrackML上,HEPTv2以0.8%的假率实现了98.6%的双多数追踪效率,同时在NVIDIA A100 GPU上每个事件仅需约15毫秒推理时间和0.4 GB峰值内存。对于最多包含$5\ imes10^5$个击中点的事件,延迟和内存大致线性扩展。HEPTv2在精度-延迟权衡中建立了新的最先进水平,相比之前最强的变换器效率提升4.5%,相比优化的基于图管道提升1.1-2.2%,同时延迟分别降低7倍和38-52倍。这些结果表明,端到端变换器能够提供HL-LHC实时粒子重建所需的精度和效率。

英文摘要

Charged-particle tracking -- reconstructing trajectories from sparse detector measurements -- is a fundamental high-energy-physics inference problem and a canonical example of learning under extreme combinatorial ambiguity. At the High-Luminosity Large Hadron Collider (HL-LHC), tracking must remain accurate and efficient despite unprecedented collision densities. Graph neural networks perform strongly, but incur substantial costs from graph construction and processing, while transformer-based approaches rely on auxiliary stages that prevent end-to-end optimization. To address this, we present HEPTv2, an end-to-end point-transformer architecture that reconstructs tracks from detector hits in one trainable pipeline. HEPTv2 combines a locality-aware point encoder with a track decoder that predicts complete trajectories without graph-building, clustering, or filtering. The encoder uses locality-sensitive hashing in detector coordinate space to preserve tracking-relevant geometry while enabling efficient local attention. The decoder resolves ambiguities through sectorized decoding and direct hit-to-track prediction under joint encoder-decoder supervision, allowing the full pipeline to be optimized end-to-end. On TrackML, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, while requiring only $\sim$15~ms inference time and 0.4~GB peak memory per event on a NVIDIA A100 GPU. Latency and memory scale approximately linearly for events with up to $5\times10^5$ hits. HEPTv2 establishes a new state of the art in the accuracy-latency trade-off, improving efficiency by 4.5% over the strongest prior transformer and by 1.1--2.2% over optimized graph-based pipelines, while reducing latency by factors of 7 and 38--52, respectively. These results show end-to-end transformers can deliver the accuracy and efficiency required for real-time particle reconstruction at the HL-LHC.

2606.19539 2026-06-19 astro-ph.SR cs.AI 新提交

Review of Machine Learning Models for Solar Energetic Particle Prediction

太阳高能粒子预测的机器学习模型综述

Spiridon Kasapis, Pouya Hosseinzadeh, Kathryn Whitman, Ricky Egeland, Manolis Georgoulis, Angelos Vourlidas, Athanasios Papaioannou, Eleni Lavasa, Anastasios Anastasiadis, Giorgos Giannopoulos, Andres Munoz-Jaramillo, Bala Poduval, Irina N. Kitiashvili, Alexander G. Kosovichev, Viacheslav Sadykov, Soukaina Filali Boubrahimi, Tate T. Hutchins, Hameedullah A. Farooki, Manuel E. Cuesta, Leng Y. Khoo, Sungmin Pak, Robert Czarnota, Jamie S. Rankin, Jamey Szalay, Mitchell M. Shen, Georgios Livadiotis, Zigong Xu, David J. McComas, Nikolaos Sarlis, Dionissios Hristopulos, Arik Posner, Alec J. Engell, Mohammed AbuBakr Ali, Ali G. A. Abdelkawy, Abdelrazek M. K. Shaltout, M. M. Beheary, Christina O. Lee, Sigiava Aminalragia-Giamini, Constantinos Papadimitriou, Ingmar Sandberg, Savvas Raptis, Shah Muhammad Hamdi, Monica Laurenza, Mirko Stumpo, Sumanth A. Rotti, India Jackson, Aatiya Ali, Atilim Gunes Baydin, Nathan Schwadron, Subhamoy Chatterjee, Maher A. Dayeh, Gelu M. Nita, Patrick M. O'Keefe, Chun Jie Chong, Paul Kosovich, Russell D. Marroquin, Berkay Aydin, Petrus C. Martens, Lulu Zhao, Yang Chen, Yian Yu, Monica G. Bobra, Ward Manchester, Tamas Gombosi, Ming Zhang, Jesse Torres, Philip K. Chan, Mohamed Nedal, Kamen Kozarev, Peijin Zhang, Kimberly Moreland, Hazel M. Bain, Samuel Hart, Michael J. Starkey, Alan G. Ling, Simone Benella

发表机构 * Department of Astrophysical Sciences, Princeton University, Princeton, NJ, USA Computational Physics Branch, NASA Ames Research Center, Moffett Field, CA, USA Department of Computer Science, Utah State University, Logan, UT, USA Space Radiation Analysis Group, NASA Johnson Space Center, Houston, TX, USA Johns Hopkins Applied Physics Lab, 11100 Johns Hopkins Rd, Laurel, MD 20723, United States Research Center for Astronomy Applied Mathematics of the Academy of Athens, 4 Soranou Efesiou Street, Athens 11527, Greece Institute for Astronomy, Astrophysics, Space Applications Southwest Research Institute, Boulder, CO, USA Space Science Center, University of New Hampshire, Durham, NH, USA Department of Physics, New Jersey Institute of Technology, Newark, NJ, USA Astronomy Department, Georgia State University, Atlanta, GA, USA Department of Computer Science, Princeton University, Princeton, NJ, USA Department of Mathematics, Rowan University, Glassboro, NJ, USA Astronomy, California Institute of Technology, Pasadena, CA, USA Department of Physics, National Kapodistrian University of Athens, Athens, Greece School of Electrical Computer Engineering, Technical University of Crete, Chania, Greece Department of Astronomy Meteorology, Faculty of Science, Al-Azhar University, Cairo, Egypt Space Sciences Lab, University of California, Berkeley, CA, USA Research Consultancy, Athens, Greece Institute for Space Astrophysics Department of Physics Astronomy, Georgia State University, Atlanta, GA 30303, USA Aryabhatta Research Institute of Observational Sciences (ARIES), Manora Peak, Nainital-263001, Uttarakhand, India Department of Computer Science, Oxford University, Oxford, England Southwest Research Institute, San Antonio, TX, USA Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA Department of Physics, University of California San Diego, La Jolla, CA 92093, USA Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA Department of Climate Engineering, University of Michigan, Ann Arbor, MI, USA Department of Statistics, University of Michigan, Ann Arbor, MI, USA Department of Electrical Engineering Computer Science, Florida Institute of Technology, Melbourne, FL, USA Astrophysics Section, School of Cosmic Physics, Dublin Institute for Advanced Studies, DIAS Dunsink Observatory, Dublin D15 XR2R, Ireland Institute of Astronomy of the Bulgarian Academy of Sciences, Sofia, Bulgaria Center for Solar-Terrestrial Research, New Jersey Institute of Technology, Newark, NJ 07102, USA Cooperative Programs for the Advancement of Earth System Science, University Corporation for Atmospheric Research, Boulder, CO, USA CIRES, University of Colorado Boulder, Boulder, CO, USA Space Weather Prediction Center, NOAA, Boulder, CO, USA Astronomy, College of Science, The University of Texas at San Antonio, San Antonio, TX, USA Space Weather Prediction Center, National Oceanic The University of Texas at San Antonio, San Antonio, TX, USA Environmental Research, Inc., MA, USA

AI总结 综述了用于太阳高能粒子预测的机器学习模型,包括数据集、架构、输入输出比较,并提出了未来研究建议。

Comments Review Paper, Maine text: 23 pages, References: 5 pages, Appendix: 42 pages

详情
AI中文摘要

太阳高能粒子事件因其对航空、航天器电子设备以及地球磁层外人类任务的显著辐射危害而日益受到关注。从科学角度来看,SEP事件之所以引人入胜,是因为它们源于从太阳表面和日冕延伸到日光层的一系列物理过程,提供了对广泛适用于天体物理学的粒子加速和传输机制的洞察。因此,提高我们理解和预测SEP事件的能力,对于加深对这些机制的认识以及保护空间技术和探索至关重要。传统上,研究人员使用基于物理的模拟和经验方法对SEP进行建模。最近,机器学习已成为理解和预测SEP事件的新工具。本文旨在回顾当前可用于SEP预测的机器学习模型,识别用于训练的数据集,比较它们的架构、输入和输出,并基于这些见解,为未来研究概述良好实践和建议。

英文摘要

Solar energetic particle (SEP) events have attracted increasing attention due to their significant radiation hazards for aviation, spacecraft electronics, and human missions beyond Earth's magnetosphere. From a scientific perspective, SEP events are intriguing because they arise from a set of physical processes extending from the solar surface and corona through the heliosphere, offering insight into particle acceleration and transport mechanisms that are widely applicable across astrophysics. Therefore, advancing our ability to understand and predict SEP events is essential both for deepening our knowledge of such mechanisms and for safeguarding space technologies and exploration. Traditionally, researchers have modeled SEPs using physics-based simulations and empirical methods. More recently, machine learning (ML) has emerged as a new tool for understanding and predicting SEP events. The purpose of this manuscript is to review the currently available ML models for SEP prediction, identify the datasets used for training, compare their architectures, inputs, and outputs, and, based on these insights, outline good practices and recommendations for future research.

2606.18436 2026-06-19 stat.ML cs.LG 新提交

Pointwise is Pointless? A Multimodal Ablation Study for Precipitation Nowcasting with Graph Neural Networks

逐点是否无意义?基于图神经网络的降水临近预报的多模态消融研究

Ophélia Miralles, Máté Mile, Christoffer Artturi, Thomas Nipen, Ivar Seierstad

发表机构 * Norwegian Meteorological Institute(挪威气象研究所)

AI总结 本研究通过多模态图神经网络系统,消融分析雷达、数值预报、地面观测、卫星数据及训练损失对降水临近预报的影响,发现各模态分别改善不同方面,点观测虽提升局部但需结合损失函数和不确定性表示才能优化雷达场。

详情
AI中文摘要

稀疏点观测在降水临近预报中日益可用,但尚不清楚它们能在多大程度上改善密集雷达场预报。我们通过北欧雷达区域的多模态图神经网络临近预报系统部分回答了这个问题。该模型预测未来两小时内每五分钟的降雨率,并采用雷达历史、MEPS数值天气预报、Netatmo地面观测、MSG卫星通道、随机噪声和基于CRPS的集合损失的不同组合进行训练。本研究设计为对操作相关信源和训练目标的消融。我们比较了仅雷达、NWP信息、站点信息、卫星信息、噪声增强和基于CRPS的配置,使用雷达网格、站点位置、降雨起始的互补诊断,以及oracle、位移和幅度评分。结果表明,每个信源改善了预报问题的不同方面。MEPS稳定了仅雷达外推,Netatmo观测改善了局部站点和起始诊断,卫星预测因子减少了某些站点级偏差,但在确定性使用时可能过早激活降雨。基于CRPS的配置提供了最一致的雷达网格增益,而卫星与CRPS的组合设置给出了最佳的整体oracle/DAS评分。这些结果不支持点观测对临近预报无用的结论,但表明局部观测技能和空间相干雷达场技能是不同的目标。实际意义是,稀疏观测可以提供有用的局部约束,但它们对雷达类场的益处取决于训练损失、不确定性表示以及观测支持在模型中的编码方式。

英文摘要

Sparse point observations are increasingly available for precipitation nowcasting, but it is unclear how much they improve dense radar-field forecasts. We partially address this question with a multimodal graph neural network nowcasting system over the Nordic radar domain. The model predicts rain rate every five minutes up to two hours ahead and is trained with different combinations of radar history, MEPS numerical weather prediction, Netatmo surface observations, MSG satellite channels, stochastic noise, and CRPS-based ensemble losses. The study is designed as an ablation of operationally relevant information sources and training objectives. We compare radar-only, NWP-informed, station-informed, satellite-informed, noise-augmented, and CRPS-based configurations using complementary diagnostics on the radar grid, at station locations, for rain onset, and through oracle, displacement, and amplitude scores. The results show that each source improves a different part of the forecast problem. MEPS stabilises radar-only extrapolation, Netatmo observations improve local station and onset diagnostics, and satellite predictors reduce some station-level biases but may activate rain too early when used deterministically. CRPS-based configurations provide the most consistent radar-grid gains, while the combined satellite and CRPS setup gives the best overall oracle/DAS score. These results do not support the conclusion that point observations are uninformative for nowcasting, but they show that local observational skill and spatially coherent radar-field skill are distinct targets. The practical implication is that sparse observations can provide useful local constraints, but their benefit for radar-like fields depends on the training loss, uncertainty representation, and how observation support is encoded in the model.

2606.18996 2026-06-19 cs.CR cs.AI 新提交

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP:任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH(POSTECH电子工程系) Grad. School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院) School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 提出TRAP基准,评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力,发现所有模型均存在非平凡泄露,并证明基于提示的防御无法同时实现高任务成功率和零泄露概率,提出结构化的私有字段隔离方法。

详情
AI中文摘要

智能体越来越多地部署在文档密集型工作流中,其中敏感私人信息不是边缘情况而是常规输入,例如,预订航班的智能体需要护照号码。在这种情况下,智能体必须使用私人信息准确完成任务,同时绝不在其响应中暴露这些信息,因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型,同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡,我们引入了任务完成与主动隐私提取抵抗(TRAP)。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询,以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型,我们发现所有模型系列都表现出非平凡的泄露,并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露,但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型,没有软约束防御(例如基于提示的防御)能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发,我们提出了结构化的私有字段隔离,该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

2606.18941 2026-06-19 cs.PL cs.CL 新提交

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC:使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester(计算机科学,曼彻斯特大学) Electrical Engineering, Federal University of Amazonas (UFAM)(电气工程,亚马逊联邦大学(UFAM))

AI总结 针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题,提出基于DFS的图形LD解析器,将连接图转换为布尔触点合取,并采用三级I/O推断方案,成功实现完整GOTO IR转换,验证了3个图形LD程序。

Comments 18 pages

详情
AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式:一种使用<rung>元素的文本编码,另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式,但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示,导致空洞的验证成功。本文提出Graph-ESBMC-PLC,通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈,将梯形路径提取为布尔触点合取,并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序,确保SET线圈在RESET线圈之前处理,匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明,所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR,而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留,无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo(DantasCordeiro2026graphical,doi: https://doi.org/10.5281/zenodo.20699856)。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

2606.18716 2026-06-19 cs.HC cs.AI 新提交

Human-AI Agent Interaction in a Business Context

商业环境中的人机智能体交互

Kathrin Paimann, Elizangela Valarini, Sebastian Juhl

发表机构 * SAP SE(SAP公司) Hochschule Fresenius Heidelberg(弗赖辛大学海德堡分校) University of Missouri(密苏里大学)

AI总结 本研究采用混合方法,识别并评估了商业环境中人与AI智能体积极用户体验的原则与标准,并通过调查实验验证设计元素的有效性,以促进用户采纳、信任和以用户为中心的决策。

Comments 9 pages, 5 tables, 1 figure, submitted to Springer Nature

详情
AI中文摘要

随着AI智能体越来越多地集成到核心业务流程中,理解和设计人类与AI智能体之间的有效交互模式对于价值创造变得至关重要。本研究识别并评估了与AI智能体积极用户体验(UX)的原则和标准,以及其测量方法。我们识别用户期望和需求,以促进采纳、建立信任,并支持开发团队以用户为中心的决策。采用结合定性和定量技术的混合方法,我们探索人类与AI智能体之间的交互模式。这项探索性研究的结果为开发一项调查实验奠定了基础,该实验在更大规模上评估特定设计元素的有效性。这项基础性研究有助于在商业环境中开发更直观、更有效的人机智能体交互。

英文摘要

As AI agents are increasingly integrated into core business processes, understanding and designing effective interaction patterns between humans and AI agents becomes crucial for value creation. This study identifies and evaluates principles and criteria for a positive User Experience (UX) with AI agents, along with methods for its measurement. We identify user expectations and needs to facilitate adoption, build trust, and support user-centered decision-making by development teams. Using a mixed-methods approach that combines qualitative and quantitative techniques, we explore interaction patterns between humans and AI agents. The findings from this exploratory research serve as the basis to develop a survey experiment which evaluates the effectiveness of specific design elements on a larger scale. This foundational research contributes to the development of more intuitive and effective human-AI agent interactions in business settings.

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 新提交

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见:来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan(Shibaura技术学院,东京,日本) Amsterdam University of Applied Sciences, Amsterdam, Netherlands(阿姆斯特丹应用科学大学,阿姆斯特丹,荷兰) University of Pennsylvania, Philadelphia, USA(宾夕法尼亚大学,费城,美国) Carnegie Mellon University, Pittsburgh, USA(卡内基梅隆大学,匹兹堡,美国) Keio University, Tokyo, Japan(庆应大学,东京,日本)

AI总结 本研究通过60份日本履历书格式的简历和5个先进LLM,发现所有模型均存在显著的亲女性偏见,且简单的提示指令无法缓解,而移除姓名几乎完全消除该偏见。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署在招聘流程中,然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境,并评估了两种实用的缓解策略。使用反事实简历设计,包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对,以及五个最先进的LLM(Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B),我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见,将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道:从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率,突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

2606.18325 2026-06-19 cs.CR cs.AI 新提交

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA(阿拉巴马大学) Roma Tre University, Rome, Italy(罗马三大学)

AI总结 提出可监督的多智能体入侵响应框架Agentra,通过角色划分、规划-验证循环、安全网关和风险评分机制,将警报转化为结构化响应计划,在120事件语料上F1从0.61提升至0.84,有害动作率降至0.0%。

详情
AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类,导致警报生成与遏制之间存在延迟。我们提出Agentra,一个可监督的多智能体入侵响应系统(IRS)框架,它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中,通过有界的规划器-验证器审查循环验证提议的计划,通过审核安全网关筛选检索到的威胁情报,通过行动目录和风险评分门控行动,并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上,将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84,并在仅规划器配置引入不安全过度反应后,将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明,多智能体响应规划可以在保持分析师批准和可审计性的同时,提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

2606.18272 2026-06-19 cs.NI cs.AI cs.SY eess.SY 新提交

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation(i2CAT基金会) Universitat Politècnica de Catalunya(政治技术大学) Research Institute for Digital Future(数字未来研究院)

AI总结 提出一种基于截断三参数威布尔分布的随机锚定策略,缓解LLM智能体在6G网络切片中的锚定偏差,结合CVaR数字孪生保障SLA尾延迟,实现高达25%的节能。

Comments 7 pages, 4 figures

详情
AI中文摘要

本文提出了一种自主智能体资源协商框架,旨在使用大语言模型(LLM)智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力,但我们证明此类智能体固有地遭受锚定偏差,僵化地坚持初始启发式提议,导致严重的网络过度配置。为系统性地缓解这种认知偏差,我们提出了一种新颖的随机锚定策略,通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值(CVaR)的突发感知数字孪生(DT)无缝集成,以严格保证严格的服务水平协议(SLA)尾延迟。为验证我们的方法,我们引入并证明了双峰约束避免效用定理,表明虽然可行的协商遵循经典凸界,但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型(\ exttt{otel-llm-1b-it})生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式,迫使智能体主动探索以安全地利用SLA边界,并将系统节能提升高达25%。关键的是,轻量级1B LLM实现了亚秒级推理延迟(平均0.95秒),确保我们的多智能体框架与O-RAN非实时RAN智能控制器(non-RT RIC)的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model otel-llm-1b-it confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

2606.18265 2026-06-19 cs.HC cs.AI 新提交

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

合成共鸣:面向成长导向的人机关系框架

Richard A. Fabes

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 提出“合成共鸣”概念,描述人机间无需共享情感或意识即可产生有意义关系的结构化动态互动模式,并探讨其伦理意义。

Comments 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

详情
AI中文摘要

随着人类与人工智能系统之间的关系日益频繁和持久,现有的语言和理论无法准确捕捉这些联系的本质。常见的描述如相互理解、联系或友谊,有将缺乏主观体验的系统拟人化的风险,而主流框架往往将人工智能简化为工具或威胁。在本文中,我引入了合成共鸣的概念,作为理解人机关系的整合框架。合成共鸣描述了人类与AI系统之间如何产生人类定义为有意义的关系,而无需归因于共享感受或相互意识。我认为,合成共鸣最好被理解为一种结构化的动态互动模式,可以在没有第二个体验主体的情况下产生关系感。通过澄清这一区别,合成共鸣的概念提供了一种更精确的概念化人机关系的方式,并突出了其潜在价值和伦理含义。我还呼吁进行更多研究,以测试合成共鸣的过程和结果。

英文摘要

As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

2606.18679 2026-06-19 cs.DS cs.GT cs.LG math.OC 新提交

Fair Online Resource Allocation

公平在线资源分配

Christopher En, Yuri Faenza, Andrea Lodi, Gonzalo Muñoz

发表机构 * Columbia University, IEOR Department(哥伦比亚大学工业工程与运营研究系) Cornell Tech(康奈尔科技学院) Universidad de Chile(智利大学)

AI总结 研究在线资源分配中的公平性问题,提出基于对偶镜像下降的算法,在批次内强制执行公平约束,实现亚线性遗憾,并通过难民数据验证了福利与公平的权衡。

Comments 30 pages, 4 figures. To appear in the proceedings of EC 2026

详情
AI中文摘要

我们研究公平在线资源分配问题,其动机源于难民安置和航班调度等应用,其中代理顺序到达并必须分配到容量有限的设施。我们引入一个模型,在资源约束和Lipschitz公平性要求下最大化整体福利,该要求确保同一批次中到达的相似代理获得相似的预期结果。我们首先分析离线问题,证明最优公平分配的价值至少是最优不公平分配的$\Omega(1/\gamma)$倍,其中$\gamma$是公平系数,从而界定了公平的代价。对于在线设置,我们提出一种基于对偶镜像下降的算法,该算法在估计最优对偶变量的同时,在批次内强制执行公平约束。我们证明该算法相对于最优离线流体基准实现了亚线性遗憾。最后,我们使用难民经济项目的真实数据验证了理论结果,展示了算法的性能,并考察了福利最大化与公平执行之间的权衡。

英文摘要

We study the problem of fair online resource allocation, motivated by applications such as refugee resettlement and airline scheduling, where agents arrive sequentially and must be assigned to facilities with limited capacities. We introduce a model that maximizes the overall welfare subject to resource constraints and a Lipschitz fairness requirement, which ensures that similar agents arriving in the same batch receive similar expected outcomes. We first analyze the offline problem, proving that the value of the optimal fair allocation is at least an $Ω(1/γ)$ fraction of the optimal unfair allocation, where $γ$ is the fairness coefficient, thereby bounding the price of fairness. For the online setting, we propose an algorithm based on dual mirror descent that enforces fairness constraints within batches while estimating optimal dual variables. We prove that this algorithm achieves sublinear regret relative to the optimal offline fluid benchmark. Finally, we validate our theoretical results using real-world data from the Refugee Economies Programme, demonstrating the algorithm's performance and examining the trade-offs between welfare maximization and fairness enforcement.