arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2210.02573 2026-05-27 cs.LG

Efficient Learning of Mesh-Based Physical Simulation with BSMS-GNN

基于BSMS-GNN的网格物理模拟高效学习

Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

发表机构 * Department of Computer Science, UCLA, Los Angeles, USA(加州大学洛杉矶分校计算机科学系) AR Perception, Google, Los Angeles, USA(谷歌AR感知部门) Department of Mathematics, UCLA, Los Angeles, USA(加州大学洛杉矶分校数学系)

AI总结 针对大规模网格物理模拟中图神经网络扩展复杂度和过平滑问题,提出基于二分图确定的双步幅池化策略BSMS-GNN,无需人工粗网格且避免几何边界错误边,显著提升精度和计算效率。

Comments Updates summary: fix the missing remark for yadi and menglei (* mention work partially done during while they are at snap inc.)

详情
AI中文摘要

使用平面图神经网络(GNN)和堆叠消息传递(MP)在大规模网格上学习物理模拟具有挑战性,因为其扩展复杂度与节点数量相关且存在过平滑问题。社区对引入多尺度结构到GNN用于物理模拟的兴趣日益增长。然而,当前最先进的方法受限于依赖人工绘制粗网格或基于空间邻近性构建粗层级,这可能在几何边界引入错误边。受二分图确定启发,我们提出了一种新颖的池化策略——双步幅(bi-stride),以解决上述限制。双步幅在广度优先搜索(BFS)的每个其他前沿上池化节点,无需手动绘制粗网格,并避免了空间邻近性导致的错误边。此外,它实现了每层级单次MP方案以及通过插值进行非参数化池化和反池化,类似于U-Net,显著降低了计算成本。实验表明,所提出的框架BSMS-GNN在代表性物理模拟中,在精度和计算效率方面均显著优于现有方法。

英文摘要

Learning the physical simulation on large-scale meshes with flat Graph Neural Networks (GNNs) and stacking Message Passings (MPs) is challenging due to the scaling complexity w.r.t. the number of nodes and over-smoothing. There has been growing interest in the community to introduce \textit{multi-scale} structures to GNNs for physical simulation. However, current state-of-the-art methods are limited by their reliance on the labor-intensive drawing of coarser meshes or building coarser levels based on spatial proximity, which can introduce wrong edges across geometry boundaries. Inspired by the bipartite graph determination, we propose a novel pooling strategy, \textit{bi-stride} to tackle the aforementioned limitations. Bi-stride pools nodes on every other frontier of the breadth-first search (BFS), without the need for the manual drawing of coarser meshes and avoiding the wrong edges by spatial proximity. Additionally, it enables a one-MP scheme per level and non-parametrized pooling and unpooling by interpolations, resembling U-Nets, which significantly reduces computational costs. Experiments show that the proposed framework, \textit{BSMS-GNN}, significantly outperforms existing methods in terms of both accuracy and computational efficiency in representative physical simulations.

2302.13473 2026-05-27 cs.LG

Towards Interpretable Federated Learning

迈向可解释的联邦学习

Anran Li, Rui Liu, Ming Hu, Yuanyuan Chen, Shipeng Wang, Lizhen Cui, Han Yu

发表机构 * Department of Biomedical Informatics and Data Science, School of Medicine at Yale University(耶鲁大学医学院生物医学信息学与数据科学系) School of Computer Science and Engineering, Nanyang Technological University(南洋理工大学计算机科学与工程学院) School of Software, Shandong University(山东大学软件学院) Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University(山东大学与南洋理工大学联合人工智能研究中心)

AI总结 本文首次综述可解释联邦学习(IFL),提出涵盖模型解释、调试和数据贡献评估的独特分类体系,并分析代表性方法、评估指标和未来方向。

Comments Survey of interpretable federated learning

详情
AI中文摘要

联邦学习(FL)使多个数据所有者能够在不暴露私有本地数据的情况下协作构建机器学习模型。为了使FL得到广泛采用,平衡性能、隐私保护和可解释性的需求至关重要,尤其是在金融和医疗等关键任务应用中。因此,可解释联邦学习(IFL)已成为一个新兴的研究课题,吸引了学术界和工业界的极大兴趣。其跨学科性质对新研究人员来说可能具有挑战性。在本文中,我们通过提供(据我们所知)第一篇关于IFL的综述来弥合这一差距。我们提出了一个独特的IFL分类法,涵盖了使FL模型能够解释预测结果、支持模型调试以及提供关于单个数据所有者或数据样本贡献的见解的相关工作,这对于公平分配奖励以激励在FL中积极可靠的参与至关重要。我们对代表性的IFL方法、常用的性能评估指标以及构建多功能IFL技术的有前景方向进行了全面分析。

英文摘要

Federated learning (FL) enables multiple data owners to build machine learning models collaboratively without exposing their private local data. In order for FL to achieve widespread adoption, it is important to balance the need for performance, privacy-preservation and interpretability, especially in mission critical applications such as finance and healthcare. Thus, interpretable federated learning (IFL) has become an emerging topic of research attracting significant interest from the academia and the industry alike. Its interdisciplinary nature can be challenging for new researchers to pick up. In this paper, we bridge this gap by providing (to the best of our knowledge) the first survey on IFL. We propose a unique IFL taxonomy which covers relevant works enabling FL models to explain the prediction results, support model debugging, and provide insights into the contributions made by individual data owners or data samples, which in turn, is crucial for allocating rewards fairly to motivate active and reliable participation in FL. We conduct comprehensive analysis of the representative IFL approaches, the commonly adopted performance evaluation metrics, and promising directions towards building versatile IFL techniques.

2009.11997 2026-05-27 cs.LG cs.AI cs.RO

Continual Model-Based Reinforcement Learning with Hypernetworks

基于超网络的连续模型强化学习

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti

发表机构 * Division of Engineering Science, University of Toronto, Canada(多伦多大学工程科学系) Department of Computer Science, University of Toronto, Canada(多伦多大学计算机科学系)

AI总结 提出HyperCRL方法,利用任务条件超网络在序列任务中持续学习动力学模型,避免重新训练并固定存储开销,在机器人 locomotion 和 manipulation 任务中优于现有持续学习方法。

Comments Updated link to project website in the abstract. 7 pages (+2 pages in appendix), 8 figures. In proceedings of the 2021 IEEE International Conference on Robotics and Automation

详情
AI中文摘要

在基于模型的强化学习(MBRL)和模型预测控制(MPC)中,有效规划依赖于学习到的动力学模型的准确性。在MBRL和MPC的许多实例中,该模型被假定为平稳的,并且定期从头开始重新训练,使用从环境交互开始收集的状态转移经验。这意味着训练动力学模型所需的时间——以及计划执行之间的暂停时间——随着收集的经验规模线性增长。我们认为这对于终身机器人学习来说太慢,并提出了HyperCRL,一种使用任务条件超网络在序列任务中持续学习所遇到动力学的方法。我们的方法有三个主要特点:首先,它包括不重新访问先前任务训练数据的动力学学习会话,因此只需存储最近固定大小的状态转移经验;其次,它使用固定容量的超网络来表示非平稳且任务感知的动力学;第三,它优于依赖固定容量网络的现有持续学习替代方案,并且与记忆不断增长的过去经验核心集的基线方法相比具有竞争力。我们展示了HyperCRL在机器人 locomotion 和 manipulation 场景(如推和开门任务)中在连续基于模型的强化学习中的有效性。我们的项目网站(含视频)位于此链接:https://rvl.cs.toronto.edu/blog/hypercrl

英文摘要

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/hypercrl

1909.08210 2026-05-27 cs.LG stat.ML

Reformulation of RBM to Unify Linear and Nonlinear Dimensionality Reduction

RBM的重新表述以统一线性和非线性降维

Jiangsheng You, Chun-Yen Liu

发表机构 * Aspen Technology Inc(阿斯彭技术公司)

AI总结 本文通过最大后验估计和期望最大化算法重新表述受限玻尔兹曼机为确定性模型,提出无需MCMC的对比散度算法,统一了标量和向量变量的线性和非线性降维。

Comments 16 pages with 7 figures

详情
AI中文摘要

受限玻尔兹曼机(RBM)是一种具有共享权重的两层神经网络,在文献中已被广泛研究用于降维、数据表示和推荐系统。传统的RBM需要对两层上的值进行概率解释,并在训练期间使用马尔可夫链蒙特卡洛(MCMC)过程生成样本。对比散度(CD)算法能高效训练RBM,但其收敛性尚未得到数学证明。在本文中,利用最大后验(MAP)估计和期望最大化(EM)算法,我们证明了无MCMC的CD算法对于条件似然目标函数是收敛的。本文的另一个关键贡献是将RBM重新表述为确定性模型。在重新表述的RBM中,无MCMC的CD算法近似于梯度下降(GD)方法。这种重新表述的RBM可以在节点上采用连续的标量和向量变量,并灵活选择激活函数。数值实验显示了其在线性和非线性降维中的能力,并且对于非线性降维,通过选择合适的激活函数,重新表述的RBM可以优于主成分分析(PCA)。最后,我们展示了其在CIFAR-10数据集(彩色图像)和多变量序列数据上的向量值节点应用,这些应用无法用传统RBM自然配置。这项工作不仅为传统RBM提供了理论见解,而且统一了标量和向量变量的线性和非线性降维。

英文摘要

A restricted Boltzmann machine (RBM) is a two-layer neural network with shared weights and has been extensively studied for dimensionality reduction, data representation and recommendation systems in the literature. The traditional RBM requires a probabilistic interpretation of the values on both layers and a Markov chain Monte Carlo (MCMC) procedure to generate samples during the training. The contrastive divergence (CD) is efficient to train the RBM but its convergence has not been proved mathematically. In this paper, using a maximum a posteriori (MAP) estimate and the expectation maximization (EM) algorithm, we show that the CD algorithm without MCMC is convergent for the conditional likelihood object function. Another key contribution in this paper is the reformulation of the RBM into a deterministic model. Within the reformulated RBM, the CD algorithm without MCMC approximates the gradient descent (GD) method. This reformulated RBM can take the continuous scalar and vector variables on the nodes with flexibility in choosing the activation functions. Numerical experiments show its capability in both linear and nonlinear dimensionality reduction, and, for the nonlinear dimensionality reduction, the reformulated RBM can outperform principal component analysis (PCA) by choosing the proper activation functions. Finally, we demonstrate its application to vector-valued nodes for the CIFAR-10 dataset (color images) and the multivariate sequence data, which cannot be configured naturally with the traditional RBM. This work not only provides theoretical insights regarding the traditional RBM but also unifies the linear and nonlinear dimensionality reduction for scalar and vector variables.

2605.27371 2026-05-27 cs.CY cs.AI

Algorithmic Monocultures in Hiring

招聘中的算法同质化

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang

发表机构 * Stanford University(斯坦福大学) Chapman University(查普曼大学) Northeastern University(东北大学)

AI总结 研究招聘算法同质化导致相同个体和种族群体被拒绝的问题,通过分析300万求职者的400万份申请数据,发现明显的种族差异和结果同质性。

Comments Published at FAccT 2026. Website: https://algorithmichiring.github.io/

详情
AI中文摘要

许多雇主使用由少数几家算法供应商构建的算法筛选求职者。我们假设算法同质化导致相同的个体和相同种族群体的成员面临拒绝。我们获取并分析了一个包含300万求职者提交400万份申请的新数据集,所有申请均由同一供应商构建的算法筛选。我们发现求职者结果存在明显的种族差异。根据美国就业歧视标准,亚裔和黑人求职者提交的所有申请中,分别有14.74%和25.87%的申请提交给了对亚裔和黑人求职者产生不利影响的职位。个体也收到同质化的结果:在所有申请10个职位的求职者中,有4%被所有职位推荐拒绝,这一比例高于随机预期。为了更好地理解这种同质性,我们利用招聘算法的确定性可复制性,生成如果求职者申请所有职位本应获得的结果。我们表明,求职者需要广泛申请才能确保他们的申请被人审阅。

英文摘要

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

2605.27360 2026-05-27 cs.NI cs.AI

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

发表机构 * Institute for Intelligent Networked Systems(智能网络系统研究所)

AI总结 提出GENESIS框架,通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE,将意图转化为经空口实验验证的解决方案,以加速6G无线接入网研发。

Comments 18 pages, 16 figures

详情
AI中文摘要

蜂窝研究与开发受制于六个结构性流程,每个流程每次迭代需要数月的体力工程工作:(i) 将标准或研究论文中的新特性综合为生产代码;(ii) 一致性测试和互操作性测试;(iii) 针对现场异常和多样化部署环境进行加固;(iv) 网络功能的数据驱动优化;(v) 发现并原型化未来标准的新波形、功能及能力;(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟,但其已知缺陷在无线接入网用例中更为严重:它们会幻觉应用程序编程接口并误读规范,导致RAN组件在第一次错误时即失去互操作性,并且它们严重依赖仿真来设计算法,而仿真在迁移到真实硬件时往往失效。为应对这些挑战,我们提出GENESIS,一个智能体人工智能框架,将意图(如规范条款、遥测异常或研究假设)转化为经空口实验验证的解决方案,并反馈到持久知识库中。GENESIS建立在三种可组合原语(智能体、技能、钩子)和一个知识层(SYNAPSE)之上,该知识层既作为事实来源,也作为框架产生的所有工件的接收者,使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

2605.27332 2026-05-27 cs.SE cs.AI cs.CV

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

发表机构 * Huawei Research Canada(华为加拿大研究)

AI总结 提出EdgeFlow方法,通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验,无需训练数据或微调即可提升流程图到Mermaid代码的转换精度,在工业数据集上节点F1提升17.39%,边F1提升16.94%。

Comments 10 pages

详情
AI中文摘要

流程图广泛应用于工业需求中,但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力,然而,当直接应用于流程图转换时,它们常常在拓扑关键视觉细节上失败。为了解决这个问题,我们提出了EdgeFlow,它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换,无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow(一个来自真实世界需求的数据集)上评估了EdgeFlow。与现成的VLM相比,EdgeFlow将节点级F1提高了17.39个百分点,边级F1提高了16.94个百分点。在路径级别,EdgeFlow将路径F1提高了11.06个百分点,从而更好地支持基于模型的测试。这些结果表明,EdgeFlow提供了一种实用的、无需训练的方法,用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进;这凸显了需要包含工业数据的多样化基准,以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

2605.27328 2026-05-27 cs.SE cs.AI cs.MA

Governed Evolution of Agent Runtimes through Executable Operational Cognition

通过可执行操作认知实现代理运行时的受控演化

Mariano Garralda-Barrio

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出一个框架,通过可执行操作认知实现多智能体系统中代理生成工件的受控运行时演化,引入HarnessMutation机制在验证、可追溯、评估和回滚约束下进行生命周期感知的运行时适应。

Comments 14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: https://github.com/mgarralda/governed-runtime

详情
AI中文摘要

近期智能体系统的进展越来越将代码视为可执行的操作基底,而非可丢弃的输出工件。先前的工作如\emph{Code as Agent Harness}将经过验证的智能体生成工件视为运行时实体,可以在长时间运行的认知循环中创建、执行、修订、持久化和重用。然而,这些工件的治理、生命周期管理和操作演化仍未被充分定义。 本文提出了一个通过可执行操作认知实现多智能体系统中受控运行时演化的框架。我们将智能体生成工件形式化为持久的运行时能力,这些能力逐渐成为操作基底的一部分,而非瞬时的中间输出。基于这一视角,我们引入了\emph{HarnessMutation}作为一种受控机制,用于在明确的验证、可追溯性、评估和回滚约束下进行生命周期感知的运行时适应。 该框架不将运行时适应视为无限制的自我修改,而是将演化建模为在持久操作记忆上的有界且可观察的过程。它进一步展示了这些思想如何在现代智能体运行时和面向治理的编排系统上实现,为适应性基础设施提供了概念基础,使其演化保持明确、可审计且受约束。

英文摘要

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

2605.27246 2026-05-27 cs.LO cs.AI math.LO

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

多种逻辑,一种方法论:在形式化推理中倡导逻辑多元主义(预印本)

Christoph Benzmüller, Daniel Kirchner, Luca Pasetto

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文基于LogiKEy逻辑多元知识表示与推理方法论,主张在统一元逻辑框架内支持对象逻辑层面的逻辑多元主义,并警告逻辑帝国主义对跨学科复用的阻碍。

Comments 21 pages, 6 figures; to appear (preprint)

详情
AI中文摘要

这份立场声明回顾了二十年来在经典高阶逻辑(HOL)中浅嵌入非经典逻辑的工作,该研究扩展为HOL中的一系列逻辑嵌入,并启发了LogiKEy逻辑多元知识表示与推理方法论。本文在LogiKEy等统一元逻辑框架内,以计算形而上学为基础,论证了对象逻辑层面的逻辑多元主义。更广泛地说,它倡导现代证明助手对逻辑多元主义的原则性支持,并警告逻辑帝国主义——即在大规模理论发展中僵化采用单一基础逻辑——这阻碍了LogiKEy旨在实现的跨学科复用。

英文摘要

This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HOL), a line of research that expanded into a range of logic embeddings in HOL and inspired the LogiKEy logic-pluralistic knowledge representation and reasoning methodology. This paper advances the case for logical pluralism at object-logic level within a unifying meta-logical framework such as LogiKEy, grounding the argument in computational metaphysics. More broadly, it advocates principled support for logical pluralism in modern proof assistants, and cautions against logical imperialism -- the rigid adoption of a single foundational logic for large-scale theory developments -- which impedes the interdisciplinary reuse that LogiKEy is designed to enable.

2605.27210 2026-05-27 quant-ph cs.AI

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Qiskit QuantumKatas: 为LLM评估改编微软的量子计算练习

Juan Cruz-Benito, Ismael Faro

发表机构 * IBM Research(IBM研究院)

AI总结 本文将微软的QuantumKatas量子计算课程从Q#移植到Qiskit,并构建评估框架,用于系统评估大型语言模型在量子计算任务上的能力。

详情
AI中文摘要

我们将微软的QuantumKatas——一个成熟的量子计算课程——从Q#改编到最广泛采用的量子计算框架Qiskit,并打包一个用于系统LLM评估的评估框架。由此产生的基准测试包含26个类别中的350个任务,涵盖从基本门到高级算法(Grover、Simon、Deutsch-Jozsa)、纠错、密钥分发和量子游戏。每个任务包括自然语言提示、规范解和通过经典电路模拟的确定性测试验证。通过基于QuantumKatas经过验证的教学设计而不是从头创建任务,我们继承了有原则的难度递进和全面的概念覆盖,同时贡献了框架改编、评估基础设施和实证分析。我们评估了7种提示配置下的16个LLM——总共39,200次模型运行——以证明基准测试的实用性。三个关键发现出现:(1)基准测试有效区分模型能力,最佳配置通过率从32.3%到83.1%不等,前沿模型与开源模型之间平均差距为26.1个百分点;(2)模型在实现已知算法方面表现良好(SimonsAlgorithm 82.1%,BasicGates 81.6%),但在问题编码方面表现不佳(SolveSATWithGrover 34.4%,DistinguishUnitaries 40.0%);(3)思维链提示显示出适度双峰效应——它是三个模型的最佳策略(其中两个根据供应商文档明确进行了推理调优),但降低了其余模型的性能,使其总体上处于中游(平均56.3%),落后于少样本-5(57.8%)。我们发布基准测试、评估框架和基线结果,以支持量子计算中LLM能力的研究。

英文摘要

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas' proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations -- a total of 39,200 model runs -- to demonstrate the benchmark's utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect -- it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.

2605.27205 2026-05-27 eess.IV cs.AI

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

TWIST:面向应用感知无线数字孪生的闭环令牌同步

Sige Liu, Kezhi Wang

发表机构 * Department of Computer Science, Brunel University London(布伦尔大学伦敦计算机科学系)

AI总结 提出TWIST框架,通过闭环令牌同步和模式条件不等错误保护,在有限通信资源下实现应用感知的无线数字孪生状态同步,提升交通状态推断性能并降低同步成本。

详情
AI中文摘要

无线数字孪生需要在有限且时变的通信资源下,对随时间演变的物理场景及其数字副本进行重复同步。对于以感知为中心的数字孪生,像素域传输或均匀保护的比特流可能与孪生侧应用消耗的语义状态不匹配。本文提出TWIST,一种面向应用感知无线数字孪生的闭环令牌同步框架。TWIST将每个物理观测表示为一个令牌,并通过无线链路同步该状态,而非优化视觉重建。令牌位置按任务相关性分组,并通过低、中、高同步模式下的模式条件不等错误保护进行保护。在孪生侧,解码置信度将不可靠的硬令牌决策转换为擦除,在更新语义孪生状态之前由补全模型恢复。恢复后的状态支持交通状态推断,并生成紧凑的反馈统计信息,包括信道质量、接收器不确定性、语义漂移和应用优先级,用于后续模式自适应。在动态道路场景数字孪生场景上的实验表明,与固定模式和仅信道自适应策略相比,TWIST改善了交通状态推断和语义孪生状态同步,同时相对于始终高传输降低了平均同步成本。

英文摘要

Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited and time-varying communication resources. For perception-centric twins, pixel-domain transmission or uniformly protected bitstreams can be mismatched to the semantic state consumed by twin-side applications. This paper proposes TWIST, a closed-loop token synchronization framework for application-aware wireless digital twins. TWIST represents each physical observation as a token and synchronizes this state over a wireless link, rather than optimizing visual reconstruction. Token positions are grouped by task relevance and protected through mode-conditioned unequal error protection under low-, medium-, and high-synchronization modes. At the twin side, decoding confidence converts unreliable hard token decisions into erasures, which are restored by a completion model before updating the semantic twin state. The recovered state supports traffic-state inference and generates compact feedback statistics, including channel quality, receiver uncertainty, semantic drift, and application priority, for subsequent mode adaptation. Experiments on a dynamic road-scene digital-twin scenario show that TWIST improves traffic-state inference and semantic twin-state synchronization compared with fixed-mode and channel-only adaptation strategies, while reducing the average synchronization cost relative to always-high transmission.

2605.27139 2026-05-27 eess.IV cs.CV physics.ins-det

Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography

无监督深度图像先验用于稀疏视角和有限角度电子断层扫描

Serge Brosset, Daniel del Pozo Bueno, Thomas David, Laure Guetaz, Philippe Ciuciu, Zineb Saghi

发表机构 * Univ. Grenoble Alpes, CEA, Leti(格勒诺布尔阿尔卑斯大学,CEA,LETI) Univ. Grenoble Alpes, CEA, Liten(格勒诺布尔阿尔卑斯大学,CEA,Liten) CEA, Joliot, NeuroSpin(CEA,Joliot,NeuroSpin) Inria, MIND, Université Paris-Saclay(Inria,MIND,巴黎-萨克雷大学)

AI总结 提出无监督深度图像先验方法,在稀疏视角和有限角度条件下实现与监督方法相当的电子断层重建性能,并应用于实验数据验证其可靠性。

Comments 22 pages, 12 figures

详情
AI中文摘要

电子断层扫描(ET)在纳米材料的三维(3D)表征中发挥着重要作用。然而,在有限角度和稀疏视角条件下,传统算法会产生退化的重建结果,影响所得3D数据的质量和可解释性。本文提出深度图像先验(DIP),一种无监督的深度学习(DL)方法,用于高度退化的断层扫描采集,并通过模拟数据证明,即使在倾斜范围仅为60°、倾斜步长为10°的情况下,其性能也与需要训练数据集的监督方法相当。然后,我们将其应用于实验数据,并表明它在稀疏视角和有限角度条件下都能实现可靠的3D量化,突显了其在广泛材料和采集模式中的潜力。

英文摘要

Electron tomography (ET) plays an important role in the three-dimensional (3D) characterization of nanomaterials. However, under limited-angle and sparse-view conditions, conventional algorithms produce degraded reconstructions, which compromise the quality and interpretability of resulting 3D data. In this paper, we present deep image prior (DIP), an unsupervised deep learning (DL) approach, for highly degraded tomography acquisitions and demonstrate, using simulated data, that its performance is comparable to that of supervised approaches requiring training datasets, even for tilt ranges as limited as 60° and tilt increments of 10°. We then apply it to experimental data and show that it enables reliable 3D quantification under both sparse-view and limited-angle conditions, highlighting its potential for a wide range of materials and acquisition modalities.

2605.27135 2026-05-27 cs.CR cs.CV

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

现代事后水印方法能否击败断箭?

Enoal Gesny, Eva Giboulot

发表机构 * Inria(法国里昂研究所)

AI总结 本文通过公平比较现代与经典事后水印方法在多种攻击下的鲁棒性和安全性,发现经典方法在现实场景中更优。

详情
AI中文摘要

随着扩散模型等生成模型的快速普及,数字水印已成为识别AI生成图像的关键解决方案。现代事后水印方案利用神经网络实现极低的误报率,同时对常见图像变换保持鲁棒性。然而,这些现代方法与经典方法之间缺乏比较,特别是在鲁棒性和安全性优先于极低误报概率的现实场景中。本文提出了现代与经典事后水印在多种经典增强和近期复杂攻击下的鲁棒性和安全性的公平比较。实验表明,在现实场景中,经典水印在保持鲁棒性的同时,在安全性方面优于现代技术。

英文摘要

With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.

2605.27131 2026-05-27 cs.ET cs.AI cs.DB

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

超越数据网格幻象:设计现代AI增强型湖仓以弥合理论与实践差距

Oliver Angélil, Jan Migon

发表机构 * ishango.ai Zurich, Switzerland(ishango.ai 瑞士苏黎世) Independent Researcher(独立研究者)

AI总结 针对企业数据平台中领域自服务与整体治理之间的张力,提出一种基于现代湖仓架构的AI增强型中心辐射模型,通过中心卓越中心提供共享服务与AI治理,领域团队逐步承担更多责任,以平衡灵活性与控制,并通过数据产品采纳率、查找时间和洞察时间三个指标评估架构效果。

Comments 11 pages, 5 figures

详情
AI中文摘要

企业数据平台面临着领域自服务与整体治理之间的持久张力。数据网格范式提出了去中心化的领域所有权作为解决方案,但纯粹的实现往往效果不佳:团队在没有足够的平台成熟度、工具或协调机制的情况下继承了新的责任。本文认为,通过在现代湖仓架构上叠加AI增强的中心辐射模型,可以缓解灵活性与控制之间的权衡。中心枢纽(卓越中心)提供共享平台服务、策略自动化和AI驱动的治理,自动标准化数据产品、生成质量规则、起草数据合约并审查变更以检测回归。领域辐条拥有业务语义、产品积压和本地迭代节奏,随着成熟度提高逐步承担更多责任。执行治理任务的同一LLM也降低了领域从业者发展跨业务和数据工程的真正跨职能专业知识的门槛,使辐条团队能够承担更大的端到端所有权,而无需按比例增加对中心的依赖。自然语言对话界面进一步为业务用户民主化访问,释放了历史上未充分利用的企业数据。在组织方面,我们提出了一个分阶段框架,将所有权从中心转移到辐条,避免了集中式瓶颈和不协调的去中心化。我们通过三个结果指标评估架构:数据产品采纳率、查找时间和洞察时间,这些指标将平台成功与可衡量的业务价值而非内部活动联系起来。

英文摘要

Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed decentralized domain ownership as a remedy, but pure implementations frequently underdeliver: teams inherit new responsibilities without the platform maturity, tooling, or coordination mechanisms needed to exercise them effectively. This paper argues that the flexibility-versus-control trade-off can be relaxed through an AI-augmented hub-and-spoke model layered on a modern lakehouse architecture. A central hub (Center of Excellence) provides shared platform services, policy automation, and AI-enabled governance, automatically standardizing data products, generating quality rules, drafting data contracts, and reviewing changes for regressions. Domain spokes own business semantics, product backlogs, and local iteration cadence, progressively assuming greater responsibility as they mature. The same LLMs that automate governance tasks also lower the barrier for domain practitioners to develop genuine cross-functional expertise spanning business and data engineering, enabling spoke teams to take on greater end-to-end ownership without proportionally increasing their dependence on the hub. Natural-language conversational interfaces further democratize access for business users, exposing historically underutilized enterprise data. On the organizational side, we propose a staged framework that shifts ownership from hub to spokes, avoiding both centralized bottlenecks and uncoordinated decentralization. We evaluate the architecture through three outcome metrics: data product adoption, time-to-find, and time-to-insight, that tie platform success to measurable business value rather than internal activity.

2605.27110 2026-05-27 cs.CR cs.CL

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

BAIT: 基于边界引导的自我条件推理披露升级

Xuan Luo, Yue Wang, Geng Tu, Jing Li, Ruifeng Xu

发表机构 * The Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong Polytechnic University(香港理工大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 提出BAIT三步越狱框架,通过识别保护边界、细化边界和请求详细示例,利用模型自身推理和一致性倾向实现恶意目标披露,实验表明在多个基准上攻击成功率显著优于传统方法。

详情
AI中文摘要

在这项工作中,我们提出了BAIT(边界感知迭代陷阱),一个三步越狱框架,通过内部披露接近恶意目标。BAIT首先要求模型识别保护边界,然后要求其细化该边界,最后请求一个详细示例。通过在每个步骤中扩展模型之前的响应,BAIT将模型自身的推理和一致性倾向转变为披露路径。在AdvBench、JailbreakBench、AIR-Bench和SORRY-Bench上的实验表明,BAIT在顶级大语言模型上持续实现高攻击成功率,显著超越了传统的越狱基线。进一步分析揭示:1)预防导向的框架显著优于直接知识请求;2)细化步骤在披露升级中起关键作用;3)前两步有一定概率引发有害内容,同时触发很少的过滤。

英文摘要

In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.

2605.27093 2026-05-27 stat.ML cs.LG

Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix

基于高斯过程的学习:相关矩阵上Wishart先验的新MCMC实现

Kane Warrior, Dalia Chakrabarty

发表机构 * Department of Mathematics(数学系) University of York(约克大学)

AI总结 提出一种自组装Wishart先验用于协方差矩阵,结合MCMC对核超参数进行贝叶斯推断,通过回溯窗口引入自适应性,有效诊断弱信息输入。

详情
AI中文摘要

在输入-输出关系的概率监督学习中(作为高斯过程(GP)的样本函数),通常为核的超参数指定先验,这些超参数参数化GP的协方差函数,其中(所得多元正态)似然的诱导协方差矩阵控制学习和预测。当所寻求的函数高度多元时,必须同时学习多个长度尺度参数,使得推断困难。我们为协方差矩阵开发了一种“自组装”Wishart先验,同时使用MCMC对核超参数进行贝叶斯推断。该构造使用最近MCMC迭代的回溯窗口来定义依赖于时间步长的尺度矩阵,从而为链引入自适应性。结果表明,在基于GP的学习范式中,对协方差矩阵的直接先验指定可用于诊断弱信息输入。我们通过两个不同的实证示例支持我们的先验开发——一个基于合成数据,另一个基于真实世界数据集。

英文摘要

In probabilstic supervised learning of an input-output relationship - as a sample function of a Gaussian Process (GP) - priors are typically specified for the hyperparameters of the kernel that parametrises the covariance function of the GP, where the induced covariance matrix of the (resulting multivariate Normal) likelihood, governs the learning and prediction. When the sought function is highly multivariate, multiple lengthscale parameters must be learnt simultaneously, making inference difficult. We develop a ``self-assembled'' Wishart prior for the covariance matrix, while undertaking Bayesian inference on the kernel hyperparameters using MCMC. The construction uses a look-back window over recent MCMC iterations to define a time-step dependent scale matrix, thereby introducing adaptiveness to the chain. Results suggest that direct prior specification on the covariance matrix can be useful for diagnosing weakly informative inputs within the GP-based learning paradigm. We support our prior development with two distinct empirical illustrations - one on synthetic data, and another on a real-world dataset.

2605.27076 2026-05-27 cs.MA cs.LG

Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

审查反馈下结构学习的代价:一种阈值-老虎机方法

Michael Ledford, William Regli

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对任务仅当联盟达到未知规模阈值时才产生奖励的审查反馈问题,提出阈值激活合作多臂老虎机模型,并通过集中式算法C-TAC实现O(log T)累积遗憾,以及去中心化事件触发协议D-TAC在保持可行性对齐的同时减少23倍通信。

详情
AI中文摘要

在许多多智能体应用中,任务仅当由满足未知规模阈值的联盟执行时才产生奖励;否则,反馈完全被审查。这种审查造成了可识别性问题:智能体无法区分随机失败与协调不足。我们将此设置形式化为阈值激活合作多臂老虎机(TAC-MAB),并在集中式和去中心化协调下进行分析。我们证明集中式算法(C-TAC)实现了累积遗憾O(log T),该遗憾分解为结构搜索项(捕获在审查反馈下解决可行性的代价)和统计监控项(用于价值估计)。然后我们引入D-TAC,一种去中心化事件触发协议,其中智能体仅在其结构信念改变时进行同步。实验表明,在保守信念融合下,D-TAC相对于集中式基线实现了23倍的通信减少,同时保持了可行性对齐。这些结果刻画了在审查反馈下学习的协调代价,并表明无需持续同步即可实现接近集中式的通信效率。

英文摘要

In many multi-agent applications, tasks yield rewards only when executed by a coalition meeting an unknown size threshold; otherwise, feedback is fully censored. This censorship creates an identifiability problem: agents cannot distinguish stochastic failure from insufficient coordination. We formalize this setting as the Threshold-Activated Cooperative Multi-Armed Bandit (TAC-MAB) and analyze it under both centralized and decentralized coordination. We show that a centralized algorithm (C-TAC) achieves cumulative regret O(log T), decomposed into a structural-search term that captures the cost of resolving feasibility under censored feedback and a statistical-monitoring term for value estimation. We then introduce D-TAC, a decentralized event-triggered protocol in which agents synchronize only when their structural beliefs change. Empirically, D-TAC achieves a 23x reduction in communication relative to the centralized baseline while preserving feasibility alignment under conservative belief fusion. These results characterize the coordination cost of learning under censored feedback and show that near-centralized communication efficiency is achievable without continuous synchronization.

2605.27051 2026-05-27 cs.SE cs.AI

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

ConVer:使用合约和循环不变式合成实现可扩展的形式化软件验证

Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin, Lucas C. Cordeiro

发表机构 * The University of Manchester(曼彻斯特大学)

AI总结 提出一种自上而下的组合验证工具ConVer,利用大语言模型合成函数合约,并通过CEGAR-CEGIS循环迭代精炼合约,以解决大规模C程序形式化验证中的状态空间爆炸问题。

Comments 12 pages; 6 figures

详情
AI中文摘要

大型C程序的形式化验证受到状态空间爆炸的阻碍:有界模型检验(BMC)工具必须通过展开所有嵌套结构来编码整个状态空间直至预定边界。我们提出了ConVer,一种自上而下的组合验证工具。给定一个带有顶层断言的C程序,ConVer自上而下地分解验证:它使用大语言模型(LLM)从系统属性中合成函数合约,然后在CEGAR-CEGIS循环中交替进行系统级和函数级检查,每当检查失败时通过SMART ICE学习精炼合约。我们在四个难度递增的基准测试套件上评估了ConVer,并与其他最先进(SOTA)工具进行了比较。在包含45个简单C程序的Frama-C基准测试中,ConVer在三个LLM后端上实现了82-96%的验证成功率,其中93-95%的收敛程序仅需一次CEGAR-CEGIS迭代。在X.509解析器基准测试(6个程序)和LF2C-Simple套件(17个程序)上,ConVer分别实现了33-50%和82-88%的成功率。在包含11个递归和循环密集型程序的VerifyThis套件上,预抽象策略实现了55-64%的成功率。此外,我们提出了ESBMC-LF,一个预处理工具,它将LF模型转换为C语言,同时保留LF文件的属性,使ConVer能够验证它们。我们使用ESBMC-LF将LF验证器基准测试转换为C语言;我们将这些称为LF-Hard。我们表明,ConVer总体上成功验证了67%的LF-Hard基准测试。

英文摘要

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

2605.27043 2026-05-27 stat.ML cs.LG stat.ME

Causal Representation Learning for Generalisable Recommendation

因果表示学习用于可泛化推荐

Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee

发表机构 * University of Warwick(沃里克大学) Spotify(Spotify公司) University College London(伦敦大学学院)

AI总结 针对推荐系统中训练分布与部署分布不一致导致的泛化问题,提出基于因果表示学习的信息论解缠标准及其可计算变分下界,仅利用混淆日志即可提升模型在分布偏移下的泛化能力,在Spotify A/B测试、KuaiRand数据集和合成基准上验证了有效性。

详情
AI中文摘要

基于观测数据训练的预测模型在部署时往往无法泛化到所遇到的分布,尤其是当训练数据是被优化系统的产物时。推荐系统是一个典型例子:它们是在被部署策略、过去用户行为和平台过滤混淆的交互日志上训练的。因此,训练分布与在服务时评分的候选分布存在显著差异,这种差距使得离线指标无法可靠预测在线性能。我们通过一种受因果表示学习(CRL)启发的方法来解决分布偏移问题。我们提出了一种信息论解缠标准,并证明其最优值仅取决于输入的因果成分。然后,我们推导出一个可处理的变分下界,使得该标准仅从有限观测数据中即可优化。我们的方法范围比大多数CRL文献更窄,因为我们目标是改善分布偏移下的泛化能力,而非完全识别所有潜在因果因素。这个更窄的目标使得该方法实用,仅需要现有的混淆日志,适用于任何标准监督模型,且不增加推理时间成本。我们的主要评估是在Spotify上对数百万用户进行的A/B测试,应用于个性化播放列表生成的排序器。一个容量匹配的CRL变体在离线性能上相当,但在在线听众参与度上带来了显著提升。在公开的KuaiRand推荐数据集和具有已知因果结构的合成基准上的补充证据显示了相同模式:与基线离线持平,在分布偏移下获得收益。在所有三种设置中,加入我们的因果解缠目标都带来了更有意义的分布外泛化。

英文摘要

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

2605.27042 2026-05-27 cs.CR cs.AI

Lessons from Penetration Tests on Large-Scale Agent Systems

大规模智能体系统渗透测试的经验教训

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang, Frederico Araujo, Ian Molloy

发表机构 * IBM Research(IBM研究院)

AI总结 本文通过对2025年专有智能体产品的两次渗透测试,评估了AI智能体的安全态势是否有所改善,并指出许多安全漏洞并非全新,而是反映了先前计算系统中长期存在的重复性弱点类别。

Comments Accepted at SAGAI 2026

详情
AI中文摘要

随着AI系统获得越来越多的自主性和执行能力,发现的安全漏洞数量持续上升。然而,许多这些漏洞并非根本上的新颖,而是反映了先前计算系统中长期观察到的重复性弱点类别。具有执行能力的AI智能体实际上是无限的自修改程序,与计算栈的多个层进行广泛交互。这种广泛的交互表面给开发者带来了显著的安全负担,他们必须推理并保护复杂的跨层行为。先前的研究主要集中在开源智能体和智能体框架中的漏洞。相比之下,专有智能体系统——在更严格的编码标准和正式审查流程下开发——是否表现出类似的安全弱点仍不清楚。在本文中,我们展示了2025年对专有智能体产品进行的两次渗透测试的结果,并评估了自这些评估以来AI智能体的安全态势是否有所改善。

英文摘要

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

2605.27039 2026-05-27 eess.AS cs.SD

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

为什么它们记不住?揭示多轮声学记忆中的表征和检索瓶颈

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne(墨尔本大学) KAIST(韩国科学技术院) The University of Auckland(奥克兰大学) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 本文通过引入EnvMem基准,发现大型音频语言模型在多轮交互中非语音信息记忆失败的主要原因是表征轨迹漂移,而非注意力分配不足。

详情
AI中文摘要

大型音频语言模型(LALMs)处理语音和环境声学线索,但在多轮交互中难以保留非语音信息。语义(语音)和声学(非语音)理解之间的性能差距仍未被充分理解,其表征和检索的底层机制尚不清楚。本文引入EnvMem,一个受控的多轮基准,用于研究这一差距并识别表征(即潜在嵌入)和检索层面(即注意力分配)失败的根源。我们进一步进行事后干预以探究表征结构和注意力动态。我们的结果揭示表征轨迹漂移是关键失败模式,同时表明注意力分配在解释观察到的退化中作用有限。总体而言,我们提供了一个系统框架,用于分析和改进长上下文LALMs中的非语言记忆,为未来鲁棒声学记忆建模的数据和训练设计提供启示。

英文摘要

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

2605.27014 2026-05-27 cs.LO cs.AI

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

ReasonOps: 可信验证的LLM推理的统一操作范式

Adnan Rashid

发表机构 * School of Electrical Engineering(电子工程学院) Computer Science (SEECS) National University of Sciences(计算机科学(SEECS)国家 Sciences and Technology)

AI总结 本文提出ReasonOps,一种将推理视为持续监控、可验证、可靠性感知的操作过程的统一范式,整合语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正,以解决当前LLM推理中的逻辑不一致、幻觉符号转换等问题。

Comments 5 Pages

详情
AI中文摘要

大型语言模型(LLM)已将人工智能从主要生成系统转变为日益强大的推理代理。最近在定理证明、自动形式化、符号推理和工具增强语言模型方面的进展表明,在机器辅助形式推理方面取得了实质性进展。然而,当前的推理系统仍然存在隐藏的逻辑不一致、幻觉符号转换、无支持的定理应用以及有限可靠性保证。现有方法在形式验证、运行时保证、神经符号推理和可信人工智能(AI)研究社区之间仍然分散。本文介绍了ReasonOps,一种用于可信验证推理系统的统一操作范式。受DevOps和MLOps等操作生态系统的启发,ReasonOps将推理视为一个持续监控、可验证、可靠性感知的操作过程,而不是一个孤立的推理任务。所提出的范式将语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正整合到一个统一的推理生命周期中。本文进一步介绍了ReasonOps架构,使用自主制动系统分析示例演示了其工作流程,并讨论了其在未来安全关键自主AI系统中的潜在作用。我们认为,像ReasonOps这样的操作推理范式可能成为下一代可信AI生态系统的基础设施。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents. Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning. However, current reasoning systems still suffer from hidden logical inconsistencies, hallucinated symbolic transitions, unsupported theorem applications, and limited reliability guarantees. Existing approaches remain fragmented across formal verification, runtime assurance, neuro-symbolic reasoning and trustworthy Artificial Intelligence (AI) research communities. This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems. Inspired by operational ecosystems such as DevOps and MLOps, ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task. The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle. The paper further presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems. We argue that operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.

2605.26990 2026-05-27 stat.ML cs.LG

Constrained Bayesian Experimental Design via Online Planning

通过在线规划的约束贝叶斯实验设计

Yujia Guo, Daolang Huang, Xinyu Zhang, Sammie Katt, Samuel Kaski, Ayush Bharti

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Department of Computer Science, Aalto University, Finland(芬兰阿尔托大学计算机科学系) Department of Computer Science, University of Manchester, UK(英国曼彻斯特大学计算机科学系)

AI总结 提出一种结合离线预训练摊销策略和后验网络与在线多步前瞻规划(场景树)的方法,以在动态约束下优化贝叶斯实验设计,相比现有方法获得更优信息序列且计算开销适中。

Comments 24 pages, 9 figures. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

贝叶斯实验设计(BED)是一个用于数据高效顺序实验设计的理论框架。然而,现有的BED方法无法适应实际任务中由于预算限制、成本变化或物理约束(限制设计随时间演化)而产生的动态约束。在本文中,我们介绍了一种新的BED方法,通过将离线预训练的摊销策略和后验网络与使用场景树的在线多步前瞻规划相结合,实现了实验设计的约束优化。我们通过实验证明,在多种约束BED任务中,我们的方法相比现有方法产生了更信息丰富的设计序列,同时仅增加了适度的额外计算开销。

英文摘要

Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.

2605.26973 2026-05-27 stat.ML cond-mat.dis-nn cs.LG cs.NE q-bio.NC

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

信噪比与样本量控制神经网络中的表征对齐

Ali Hussaini Umar, Alessandro Laio

发表机构 * SISSA Trieste, Italy(意大利特里斯特SISSA研究所) Theoretical and Scientific Data Science (TSDS) group at the International School for Advanced Studies (SISSA)(国际先进研究学院(SISSA)理论与科学数据科学(TSDS)小组) Condensed Matter and Statistical Physics section at the International Centre for Theoretical Physics (ICTP)(国际理论物理中心(ICTP)凝聚态与统计物理部门)

AI总结 通过理论和实验证明,信噪比和训练样本量以单调和非单调方式分别影响神经网络表征对齐,且对齐程度在插值阈值附近最小,与泛化误差解耦。

详情
AI中文摘要

已知神经网络会发展出潜在表征,这些表征是$对齐$的,即在不同架构、训练协议或训练数据集训练的网络之间结构相似。我们在一个受控环境中研究这一现象,使用被噪声过程的独立实现扰动的训练集,训练一组网络执行回归和分类任务。我们表明,信噪比(SNR)和训练样本量以定性相似的方式影响对齐,无论是在真实世界数据集上训练的网络,还是在极其简单的具有单个隐藏层的$线性$网络中(其对齐可以解析估计)。在线性和非线性网络、回归和分类任务以及合成和真实数据中,我们一致观察到,对齐随SNR单调变化,但随训练样本量非单调变化。特别地,对齐在插值阈值附近最小,且更强的对齐不一定对应更好的泛化误差。这些发现揭示了数据质量和数量对对齐的非平凡依赖关系,且与泛化性能解耦。

英文摘要

Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.

2605.26925 2026-05-27 quant-ph cs.LG

Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization

自适应强化学习用于鲁棒开放量子系统控制:一种带有时间优化的多任务框架

Haftu W. Fentaw, Steve Campbell, Simon Caton

发表机构 * Centre for Quantum Engineering, Science, and Technology, University College Dublin(都柏林大学量子工程、科学与技术中心)

AI总结 提出一种多任务软演员-评论家(SAC)强化学习框架,用于开放量子系统控制,同时学习最优脉冲序列并发现特定问题的演化时间T和控制脉冲段数N,在51种哈密顿量变化下实现高保真度状态转移,并展现出优于GRAPE的鲁棒性。

详情
AI中文摘要

我们提出了一种多任务软演员-评论家(SAC)强化学习框架,用于跨不同哈密顿量的开放系统量子控制,该框架学习最优脉冲序列,同时发现特定问题的演化时间T和控制脉冲段数N。在51种哈密顿量变化上的实验结果表明,多任务SAC模型能够生成控制脉冲,在环境噪声下将系统从初始状态驱动到目标状态,并具有高保真度,为适用于实际噪声量子器件的通用量子控制奠定了必要基础。通过逐步扩展训练哈密顿量集,我们研究了使用给定数量样本哈密顿量训练的单个多任务模型是否能够成功完成来自同一哈密顿量空间但训练中未遇到的哈密顿量的状态转移任务。此外,我们的鲁棒性不保真度度量(RIM)分析表明,与GRAPE优化的控制相比,SAC训练的策略对脉冲幅度扰动和退相干率变化表现出更优越的鲁棒性。

英文摘要

We present a Multi-task Soft Actor-Critic (SAC) Reinforcement Learning framework designed for open-system quantum control across diverse Hamiltonians, which learns optimal pulse sequences while simultaneously discovering problem-specific evolution time T and number of control pulse segments N. Experimental results across 51 Hamiltonian variations demonstrate that the multi-task SAC model is able to generate control pulses that can drive a system, under environment noise, from its initial state to its target state with high fidelities, establishing essential foundations for universal quantum control applicable to realistic noisy quantum devices. Through progressive expansion of the training Hamiltonian set, we investigate if a single multi-task model trained using a given number of sample Hamiltonians can successfully accomplish state-transfer tasks for Hamiltonians drawn from the same Hamiltonian space but not encountered during training. In addition, our Robustness Infidelity Measure (RIM) analysis reveals that SAC trained policies exhibit superior robustness to pulse amplitude perturbations and decoherence rate variations compared to GRAPE-optimized controls.

2605.26898 2026-05-27 cs.SE cs.AI

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

引导LLM使用软件设计模式的策略:以单例模式为例

Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron

发表机构 * University of Gothenburg and Chalmers University of Technology(哥德堡大学和查尔姆斯理工大学)

AI总结 通过实验比较四种提示策略(指令、二元自动反馈、详细自动反馈、少样本详细反馈),评估13个LLM在164个Java编码挑战中生成遵循单例模式的代码的能力,发现迭代二元反馈在保持或提升功能性的同时最佳地实现了单例模式对齐。

Comments Accepted at PROMISE 2026

详情
AI中文摘要

大型语言模型(LLM)可以从自然语言提示生成功能性源代码,但往往无法一致地遵循更高级别的架构结构或设计模式。由于LLM在软件工程中的应用日益增多,它们将既定设计原则应用于生成代码的能力对于软件产品的长期成功至关重要。因此,本文的目标是确定引导LLM将设计模式融入生成源代码的策略。我们设计了一个计算实验,评估13个LLM生成遵循单例设计模式的代码的能力,使用了四种提示策略:指令、二元自动反馈、详细自动反馈以及带少样本提示的详细反馈,在HumanEval-X的164个Java编码挑战中进行。我们的结果表明,引导LLM包含设计模式的最佳策略在很大程度上取决于模型类型。尽管如此,总体而言,迭代二元反馈在保持或改善代码功能性的同时,提供了与单例模式的最佳对齐。通过指令引导,Llama 3.3在100%的情况下生成了单例类,并改善了代码功能性,使通过的测试数量增加了34.1个百分点。通过指令和二元反馈引导,它取得了类似的结果。Qwen 3(8B)使用二元反馈将单例模式对齐度提高到99.2%,功能性提高到58.6%。我们的结果表明,即使是简单的策略也可以用于引导LLM使用设计模式。

英文摘要

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.

2605.26886 2026-05-27 cs.DS cs.LG

Parsimonious Learning-Augmented Online Metric Matching

简约学习增强的在线度量匹配

Yongho Shin, Phanu Vajanopath

发表机构 * Institute of Computer Science, University of Wrocław, Wrocław, Poland(沃斯克拉大学计算机科学研究所)

AI总结 针对在线度量匹配问题,提出一种简约学习增强算法,通过虚拟预测填补缺失预测,并建立性能下界,实验验证了其有效性。

Comments To appear in ICML 2026

详情
AI中文摘要

近年来,学习增强算法受到了广泛关注,尤其是在在线优化领域。由于生成预测的高计算成本,越来越多的研究关注于学习增强算法中性能保证与预测使用数量之间的权衡,例如缓存和度量任务系统问题。在本文中,我们将这一研究方向扩展到在线度量匹配,开发了简约学习增强算法并建立了其性能下界。我们的方法将“跟随预测”框架扩展到简约设置,通过在缺乏实际预测时使用一种在线度量匹配算法来填充虚拟预测,该算法在执行过程中保持良好中间匹配。我们通过实证评估补充了理论结果,证明了我们方法的实际有效性。

英文摘要

Learning-augmented algorithms have received significant attention in recent years, particularly in the context of online optimization. Motivated by the high computational cost of generating predictions, a growing line of work studies the tradeoff between performance guarantees and the number of predictions used in learning-augmented algorithms for problems such as caching and metrical task systems. In this paper, we extend this line of research to online metric matching by developing parsimonious learning-augmented algorithms and establishing lower bounds on their performance. Our approach extends the Follow-the-Prediction framework to the parsimonious setting by filling in a virtual prediction in the absence of an actual prediction, using an online metric matching algorithm that maintains good intermediate matchings throughout its execution. We complement our theoretical results with an empirical evaluation, demonstrating the practical effectiveness of our approach.

2605.26870 2026-05-27 cs.MA cs.AI cs.HC

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

学术研究中的持久性AI智能体:单研究者实施案例研究

Anas H. Alzahrani

发表机构 * Department of Preventive Medicine and Public Health, Faculty of Medicine, King Abdulaziz University(预防医学与公共卫生系,医学院,国王阿卜杜勒阿齐兹大学)

AI总结 通过单研究者案例研究,分析了持久性AI智能体在真实学术环境中的架构、使用、产出和治理,发现缓存主导的工作流可能将经济单位从每token成本转向每完成工件成本。

Comments 19 pages, 2 figures, 3 main tables; supplementary appendix with 6 tables, 2 figures, and a reproducibility methods section. Describes 17 configured agents in a persistent research environment and introduces the PARE-M (Persistent Agentic Research Environment Measurement) framework

详情
AI中文摘要

背景:大型语言模型通常作为模型、基准或简短对话片段进行评估。当智能体持久嵌入真实学术研究环境,具有持久记忆、本地文件、外部工具、计划例程、委派角色和明确安全协议时,会发生什么知之甚少。方法:从2026年1月31日至5月25日进行了一项结构化自我观察的实施案例研究。分析单元是持久的人-智能体环境:研究者、智能体运行时、记忆层、工具、仓库、计划任务、专门智能体角色和治理规则。结果使用PARE-M(持久智能体研究环境测量)组织,这是一个涵盖架构、利用、工件生产、资源使用、可重复性和治理的测量框架。结果:可恢复的主智能体遥测包含96个活跃日中的75,671条去重记录,其中8,059条用户角色消息和23,710条助手角色消息。工作空间包括502个记忆相关文件、17个配置的智能体目录和57个技能文件。活跃系统时间为579.7小时(30分钟上限间隙估计)。记忆衍生记录识别出482个输出代理事件和889个失败、验证、纠正或协议代理事件。一个严格的2026年5月轨迹子集捕获了627个模型完成事件和73.95百万记录token,其中82.9%为缓存读取。结论:工作流以缓存为主导,表明持久智能体环境可能将经济单位从每token成本转向每完成工件成本。未来评估应使用工件级分母、可重复解析规则、纠正分类法和治理事件的独立编码。

英文摘要

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.

2605.26856 2026-05-27 q-bio.NC cs.AI cs.RO

The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

感觉调节网络:可停性作为对象导向现象学的架构基础

G. Nagarjuna, Durgaprasad Karnam

发表机构 * Indian Institute of Science Education and Research (IISER) Pune(印度科学教育与研究学院(IISER)浦那) Centre for Educational Technology (CET), Indian Institute of Technology Bombay(教育技术中心(CET),印度理工学院孟买)

AI总结 本文提出感觉调节网络(SMN)作为具身认知的架构,通过对手动力学和可停性机制,将对象导向现象学(胡塞尔意义)的意向性建立在身体组织的结构特征上,从而调和认知主义与4E认知的争论。

Comments 64 pages, main body 38 pages + References 6, Appendices 20 pages, Tables 3, and Figures 21

详情
AI中文摘要

认知科学仍然分裂为认知主义——它解释了递归和语言,但无法将形式符号扎根于意义——和4E方法——它将认知扎根于身体,但很少详细说明身体的架构以支持生成性。我们认为这一僵局源于对具身代理架构的不完整描述,并提出一个架构:感觉调节网络(SMN),即认知代理被构想为整个身体,在每个解剖尺度上由对手动力学组织,由感觉调节器构建,这些调节器通过一个基底感知和行动,配对成协调动作区,由全身广播网络路由。三个承诺赋予了SMN其效力。可停性——将对抗性可供性招募到共激活平衡中——提供了对象导向现象学(在胡塞尔意义上)所需的架构位置:对手性使得共激活成为可能,共激活使得停止成为可能,停止使得注意成为可能,注意使得意向指向成为可能,而无需在顶层添加任何模块。可自我调节动作模式(SMAP)的双信号特性使得自我/世界区分成为布线的结构特征,而非代理应用的范畴。四级动作模式层级——基础、可停、可协商、交易——提供了从自主规律性到公共惯例化的单一轨迹,将基于语法的生成性条件定位为架构转变。SMN调和了认知主义与4E的争论:递归存在于可协商动作模式的可修改动力学中,具身性存在于支持它们的对手基底中。附录中给出了一个初步的形式化方法和八个预测寄存器(七个可测试,一个假设性),以及参考模拟。

英文摘要

Cognitive science remains split between cognitivism - which accounts for recursion and language but cannot ground formal symbols in meaning - and 4E approaches - which ground cognition in the body but rarely specify the body's architecture in enough detail to support generativity. We argue the impasse stems from an incomplete account of the embodied agent's architecture, and propose one: the Sensation Modulating Network (SMN), the cognitive agent conceived as the whole body, organized at every anatomical scale by opponent dynamics, built from Sensation Modulators that sense and act through one substrate, paired into Coordinated Action Zones routed by a body-wide broadcast network. Three commitments give the SMN its purchase. Haltability - the recruitment of antagonistic affordance into co-activated equilibrium - provides the architectural locus that object-directed phenomenology, in Husserl's sense, requires: opponency enables co-activation, co-activation enables halt, halt enables attention, attention enables intentional directedness, with no module added on top. The dual-signal property of self-modulatable action patterns (SMAPs) makes the self/world distinction a structural feature of the wiring rather than a category the agent applies. And a four-level action-pattern hierarchy - Basal, Haltable, Negotiable, Transactional - gives a single trajectory from autonomic regularity to public conventionalization, locating the conditions for grammar-grounded generativity as architectural transitions. The SMN reconciles the cognitivism-4E debate: recursion lives in the modifiable dynamics of Negotiable Action Patterns, embodiment in the opponent substrate that supports them. A tentative formalism and eight predicted registers (seven testable, one hypothetical), with reference simulations, are given in an appendix.

2605.26821 2026-05-27 hep-ph cs.LG hep-ex

Particle-Lund Multimodality in Jet Taggers

喷注标记器中的粒子-拉普兰多模态

Loukas Gouskos, Benedikt Maier

发表机构 * Brown University(布朗大学) Imperial College of Science, Technology and Medicine(帝国理工学院科学、技术与医学学院)

AI总结 提出PLuM多模态架构,联合处理粒子成分与拉普兰平面分裂,通过交叉注意力机制研究显式QCD层次结构是否补充原始粒子表示,发现对顶夸克和H→bb标记有系统性提升,在HH(4b)分析中背景抑制提高25%。

详情
AI中文摘要

拉普兰平面提供了喷注内QCD辐射的物理动机层次表示,而基于变换器的标记器通过直接从原始粒子成分及其成对关系中学习达到了最先进的性能。我们研究变换器是否从成分级输入隐式捕获层次QCD结构,或者显式物理表示是否仍然具有互补性。为了测试这一点,我们引入了PLuM,一种多模态架构,将粒子成分和拉普兰平面分裂投影到共享潜在空间,并用统一变换器联合处理两者。交叉注意力允许模型探测结构化QCD信息是否提供了超出粒子单独编码的区分能力。我们观察到顶夸克和H→bb标记的系统性增益,而在H→cc或H→4q拓扑中没有发现可比改进。这种选择性增强表明,即使在高度表达性的架构中,关于b喷注形成的显式层次信息仍然与原始粒子表示互补,而其他拓扑已经在成分级被很好地捕获。对于高影响LHC分析,如洛伦兹增强的双希格斯玻色子搜索中的四b夸克末态(HH(4b)),增益显著:在25%的双希格斯效率工作点,PLuM的背景抑制比基线高25%。我们的结果表明,在变换器时代,QCD辐射的物理结构化表示仍然保留区分价值,激励进一步研究深度学习算法如何编码喷注动力学的不同方面。

英文摘要

The Lund plane offers a physics-motivated, hierarchical representation of QCD radiation within jets, while transformer-based taggers have reached state-of-the-art performance by learning directly from raw particle constituents and their pairwise relations. We investigate whether transformers implicitly capture hierarchical QCD structure from constituent-level inputs, or whether explicit physics representations remain complementary. To test this, we introduce PLuM, a multimodal architecture that projects particle constituents and Lund plane splittings into a shared latent space, processing both jointly with a unified transformer. Cross-attention allows the model to probe whether structured QCD information provides discriminating power beyond what particles alone encode. We observe systematic gains for top-quark and $\mathrm{H}\to\mathrm{b}\bar{\mathrm{b}}$ tagging, while finding no comparable improvement for $\mathrm{H}\to\mathrm{c}\bar{\mathrm{c}}$ or $\mathrm{H}\to 4\mathrm{q}$ topologies. This selective enhancement suggests that explicit hierarchical information about b-jet formation remains complementary to raw particle representations even in highly expressive architectures, while other topologies are already well-captured at constituent level. For high-impact LHC analyses such as Lorentz-boosted di-Higgs searches in the four $\mathrm{b}$ quark final state ($\mathrm{H}\mathrm{H}(4\mathrm{b})$), the gains are substantial: at a $25\%$ di-Higgs efficiency working point, PLuM achieves $25\%$ higher background rejection than the baseline. Our results indicate that physically structured representations of QCD radiation retain discriminating value in the transformer era, motivating further study into how different aspects of jet dynamics are encoded by deep learning algorithms.