arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3405
2605.24421 2026-05-26 cs.CR cs.LG

Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content

毒害瞭望塔:通过对抗性日志内容对LLM增强的安全运营进行提示注入攻击

Rohan Pandey, Archit Bhujang

发表机构 * DigitalOcean Arizona State University(亚利桑那州立大学)

AI总结 研究攻击者控制的日志字段如何向LLM注入指令(日志基底提示注入),提出四类攻击分类,并评估不同防御下的攻击成功率。

Comments 10 pages

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作安全运营中心(SOC)的分析师助手,它们接收日志和警报数据,生成分类标签、事件摘要或修复建议。我们研究了这种设计的一个结构性故障模式:许多日志字段是攻击者控制的。用户代理、URL、有效载荷、DNS查询和尝试的用户名因此可以将指令与入侵证据一起传递给模型。我们将这种设置称为\emph{日志基底提示注入}。我们引入了日志基底攻击的四类分类法:直接覆盖(S1)、角色劫持(S2)、上下文操纵(S3)和混淆有效载荷(S4)。我们使用 exttt{gpt-4o-mini}作为分析师评估了48种策略-防御-任务组合。三个发现突出。首先,直接覆盖在我们的设置中无效:所有S1分类攻击实现了0%的抑制。相比之下,角色劫持在朴素分类器下抑制了68%的恶意日志,并且在更强的防御下仍然有效。其次,摘要是最高风险的任务:上下文操纵在没有防御的情况下达到96%的注入成功率,即使在受限输出下也达到38%。第三,防御减少但未消除攻击面:平均注入成功率从朴素提示下的26.6%下降到我们最强防御下的11.8%。我们还将实证结果与确定性模拟分析师进行比较,发现模拟显著错误预测了当前模型行为,尤其是对于直接覆盖。这些结果表明,SOC副驾驶应将原始日志内容视为对抗性输入,而非普通分析师上下文。

英文摘要

Large language models (LLMs) are increasingly used as analyst assistants in security operations centers (SOCs), where they ingest log and alert data to produce triage labels, incident summaries, or remediation advice. We study a structural failure mode of this design: many log fields are attacker controlled. User agents, URLs, payloads, DNS queries, and attempted usernames can therefore carry instructions to the model alongside evidence of the intrusion. We call this setting \emph{log-substrate prompt injection}. We introduce a four-class taxonomy of log-substrate attacks: direct override (S1), persona hijack (S2), context manipulation (S3), and obfuscated payloads (S4). We evaluate 48 strategy-defense-task combinations using \texttt{gpt-4o-mini} as the analyst. Three findings stand out. First, direct overrides are ineffective in our setting: all S1 classification attacks achieve 0\% suppression. In contrast, persona hijacks suppress 68\% of malicious logs under a naive classifier and remain effective under stronger defenses. Second, summarization is the highest-risk task: context manipulation reaches 96\% injection success without defenses and 38\% even with constrained output. Third, defenses reduce but do not eliminate the attack surface: average injection success falls from 26.6\% under naive prompting to 11.8\% under our strongest defense. We also compare empirical results to a deterministic mock analyst and find that simulation substantially mispredicts current model behavior, especially for direct overrides. These results suggest that SOC copilots should treat raw log content as adversarial input rather than ordinary analyst context.

2605.24386 2026-05-26 quant-ph cond-mat.stat-mech cs.DS cs.LG

Fermi-Dirac machines as quantizations of neurons

费米-狄拉克机作为神经元的量子化

Alexander He, Nana Liu, Mark M. Wilde

发表机构 * Department of Physics, Cornell University(康奈尔大学物理系) Institute of Natural Sciences, Shanghai Jiao Tong University(上海交通大学自然科学研究院) School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Ministry of Education Key Laboratory in Scientific and Engineering Computing, Shanghai Jiao Tong University(上海交通大学教育部科学与工程计算重点实验室) Global College, Shanghai Jiao Tong University(上海交通大学全球学院) School of Electrical and Computer Engineering, Cornell University(康奈尔大学电气与计算机工程学院)

AI总结 本文将费米-狄拉克机重新解释为经典神经元的正则量子化,通过用算子替换经典变量,开发了高效混合量子-经典算法来评估和训练量子化神经元,并证明了基于费米-狄拉克神经元的计算决策问题是BQP完全的。

Comments 87 pages, 12 figures, 2 tables

详情
AI中文摘要

费米-狄拉克机最近被提出作为在量子计算机上解决半定优化问题的一种方法。在这里,我们将其重新解释为经典神经元的正则量子化。通过将经典神经元视为应用于参数化经典哈密顿量的激活函数,我们通过用算子替换经典变量来量子化该模型,这些算子的特征值编码了它们的可能值。这遵循了量子力学中正则量子化的标准方法。关键的是,当哈密顿量由对易算子组成时,我们的构造精确地简化为经典神经元。更一般地,我们的方法产生了一个激活可观测量,定义为应用于参数化量子哈密顿量的激活函数。这个量子化神经元的输出是一个随机变量,其期望值等于激活可观测量相对于输入状态的期望值。我们开发了高效的混合量子-经典算法来评估和训练我们的量子化神经元的输出和梯度,从而实现评估和训练。这些算法依赖于基本原语,包括随机采样、哈密顿量模拟和Hadamard测试。我们还量子化了其他一系列激活函数,包括平滑修正线性单元(ReLU)、sigmoid线性单元、高斯平滑ReLU和高斯误差线性单元(GeLU),这些已知对深度学习应用有用。数值实验表明,基于量子哈密顿量的神经元可以学习经典神经元无法学习的函数。我们进一步基于费米-狄拉克神经元定义了一个计算决策问题,并证明它是BQP完全的,提供了反对有效经典模拟的复杂性理论证据。最后,我们将我们的方法推广到连续量子变量,并勾画了将这些神经元组合成网络的两种不同方式。

英文摘要

Fermi-Dirac machines were proposed recently as an approach to solving semidefinite optimization problems on quantum computers. Here, we reinterpret them as canonical quantizations of classical neurons. By viewing a classical neuron as an activation function applied to a parameterized classical Hamiltonian, we quantize this model by replacing classical variables with operators whose eigenvalues encode their possible values. This follows the standard approach to canonical quantization in quantum mechanics. Crucially, when the Hamiltonian consists of commuting operators, our construction reduces exactly to a classical neuron. More generally, our approach yields an activation observable, defined as an activation function applied to a parameterized quantum Hamiltonian. The output of this quantized neuron is a random variable with expectation value equal to that of the activation observable with respect to an input state. We develop efficient hybrid quantum-classical algorithms for evaluating outputs and gradients of our quantized neurons, enabling evaluation and training. These algorithms rely on basic primitives that include random sampling, Hamiltonian simulation, and the Hadamard test. We also quantize a whole host of other activation functions, including the smooth rectified linear unit (ReLU), sigmoid linear unit, Gaussian-smoothed ReLU, and Gaussian error linear unit (GeLU), which are known to be useful for deep learning applications. Numerical experiments indicate that neurons based on quantum Hamiltonians can learn functions that classical neurons cannot. We further define a computational decision problem based on Fermi-Dirac neurons and prove that it is BQP-complete, providing complexity-theoretic evidence against efficient classical simulation. Finally, we generalize our approach to continuous quantum variables and sketch two different ways of composing these neurons into networks.

2605.24364 2026-05-26 stat.ML cs.LG

Multicalibration Boosting: Theory, Convergence, and Transferability

多校准提升:理论、收敛性和可迁移性

Hanxuan Ye, Hongzhe Li

发表机构 * Department of Biostatistics and Epidemiology(生物统计学与流行病学系) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文统一了多校准提升(MCBoost)的理论框架,揭示了校准-风险权衡,并建立了在弱假设下的收敛率、有限样本保证和协变量偏移下的迁移性保证。

详情
AI中文摘要

多校准通过要求预测在一组丰富的函数(包括预测切片和子群体)上无偏,扩展了经典校准。它已成为公平性、鲁棒性和可靠预测的强大框架,然而多校准提升(MCBoost)的理论理解仍然零散且常依赖限制性假设。在这项工作中,我们发展了一个统一且精细的MCBoost视角,涵盖了现有变体,包括多精度、BatchGCP和BatchMVP。我们揭示了几个现象,为其实际行为提供了新见解:即使高度准确和灵活的预测器也可能保持显著的不校准;强制多校准引入了校准-风险权衡;早停在这一权衡的控制中起核心作用。在理论方面,我们在更弱且更现实的条件下建立了MCBoost的通用框架。我们证明提升迭代收敛到审计类生成的累积跨度上总体最优预测器的Bregman投影,从而显式刻画了实现多校准的函数空间。我们进一步推导了不同光滑性假设下的收敛率、有限样本保证以及确保终止时多校准的原则性停止规则。最后,我们将通用适应性理论扩展到协变量偏移下,提供了更一般的迁移保证,并阐明了多校准预测器何时跨领域泛化。这些结果为多校准提升提供了更完整的理论基础和实践指导,将其定位为现代预测模型的统一框架和可靠的后处理方法。

英文摘要

Multicalibration extends classical calibration by requiring predictions to be unbiased over a rich collection of functions, encompassing both prediction slices and subpopulations. It has emerged as a powerful framework for fairness, robustness, and reliable prediction, yet the theoretical understanding of multicalibration boosting (MCBoost) remains fragmented and often relies on restrictive assumptions. In this work, we develop a unified and refined perspective on MCBoost that subsumes existing variants, including multiaccuracy, BatchGCP, and BatchMVP. We uncover several phenomena that provide new insights into its practical behavior: even highly accurate and flexible predictors can remain substantially miscalibrated; enforcing multicalibration introduces a calibration-risk trade-off; and early stopping plays a central role in controlling this trade-off. On the theoretical side, we establish a general framework for MCBoost under weaker and more realistic conditions. We show that the boosting iterates converge to a Bregman projection of the population-optimal predictor onto the cumulative span generated by the audit class, thereby explicitly characterizing the function space on which multicalibration is achieved. We further derive convergence rates under different smoothness assumptions, finite-sample guarantees, and principled stopping rules that ensure multicalibration at termination. Finally, we extend the theory of universal adaptability under covariate shift, providing more general transfer guarantees and clarifying when multicalibrated predictors generalize across domains. These results provide a more complete theoretical foundation and practical guidance for multicalibration boosting, positioning it as both a unifying framework and a reliable post-processing approach for modern predictive models.

2605.24326 2026-05-26 cs.DC cs.AI cs.NI

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

ScaleAcross Explorer:探索跨规模AI模型训练的通信优化

Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, Carole-Jean Wu

发表机构 * Harvard University(哈佛大学) Meta Platforms, Inc.(Meta平台公司)

AI总结 针对跨数据中心大规模AI模型训练(scale-across)的复杂设计空间,提出ScaleAcross Explorer优化器,通过联合优化并行放置、并行调度和网络层技术,实现高达64.62%的训练加速。

Comments 28 pages, 27 figures

详情
AI中文摘要

大型语言模型训练的快速扩展需要将GPU资源分布在多个数据中心建筑和区域之间。我们将这种范式称为“scale-across”训练。随着基础设施的扩展,系统设计空间变得越来越复杂,涵盖了新的模型架构、硬件异构性和不断演变的通信模式。借鉴Meta的生产经验,我们强调了在跨多个拥有数十万GPU的数据中心部署训练作业的复杂性。为了加速对庞大设计空间的探索并实现前沿模型开发的高效训练,我们对三个关键设计维度进行了深入表征:并行放置、并行调度和网络层技术。然后,我们提出了ScaleAcross Explorer,这是一个考虑设计维度相互作用并整体优化跨规模训练的优化器。测试床实验和模拟表明,在广泛的设计点上,与生产配置相比,训练速度提升高达64.62%,与最先进的基线相比,训练速度提升高达37.59%。

英文摘要

The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.

2605.24324 2026-05-26 quant-ph cs.LG

A Matched Spectral Benchmark of Quantum Inspired Feature Maps

量子启发特征映射的匹配光谱基准

Toheeb Ogunade, Taofeek Kassim, Etinosa Osaro

发表机构 * Department of Computer Science, University of Lagos(拉各斯大学计算机科学系) Department of Physics, University of Lagos(拉各斯大学物理系) Department of Chemical Engineering, University of Notre Dame(圣母大学化学工程系)

AI总结 通过匹配输出维度和强经典控制,比较振幅、角度和基编码作为确定性特征映射在经典监督学习中的表现,分析其几何特性并发现固定量子启发编码几何本身并非经典数据上机器学习优势的可靠来源。

详情
AI中文摘要

量子机器学习通常受到这样一种想法的驱动:量子系统可以暴露经典模型难以访问的有用高维结构。我们分离了这一主张的一个核心组成部分:固定数据编码映射。在匹配输出维度和强经典控制下,将振幅、角度和基编码作为经典监督学习的确定性特征映射进行评估。该基准测试将这些编码与原始线性模型、随机傅里叶特征、多项式特征、PCA、RBF SVM和浅层神经网络在多种经典数据集上进行比较。我们不是将性能视为单一终点,而是通过有效秩、条件数、中心核对齐、预测性能和实际开销来分析每个表示的几何结构。结果图景是机械性的:振幅编码可以通过单位球面归一化去除幅度信息,角度编码可能在几何上与原始线性特征冗余,基编码可能施加与平滑决策结构不对齐的二元汉明几何。然而,这些发现并不反对量子计算,它们表明固定的量子启发编码几何本身并不是经典数据上机器学习优势的可靠来源。

英文摘要

Quantum machine learning is often motivated by the idea that quantum systems can expose useful high-dimensional structure that is difficult to access with classical models. We isolate one central component of this claim: the fixed data-encoding map. Amplitude, angle, and basis encoding are evaluated as deterministic feature maps for classical supervised learning under matched output dimensionality and strong classical controls. The benchmark compares these encodings against raw linear models, random Fourier features, polynomial features, PCA, RBF SVMs, and shallow neural networks across diverse classical datasets. Rather than treating performance as a single endpoint, we analyze the geometry of each representation through effective rank, condition number, centered kernel alignment, predictive performance, and practical overhead. The resulting picture is mechanistic: amplitude encoding can remove magnitude information through unit-sphere normalization, angle encoding can become geometrically redundant with raw linear features, and basis encoding can impose a binary Hamming geometry that is poorly aligned with smooth decision structure. These findings do not argue against quantum computation, however, they show that fixed quantum-inspired encoding geometry alone is not a reliable source of machine-learning advantage on classical data.

2605.24300 2026-05-26 cs.CR cs.AI cs.LG

Enhancing Reliability in LLM-Based Secure Code Generation

增强基于LLM的安全代码生成的可靠性

Mohammed F. Kharma, Mohammad Alkhanafseh, Ahmed Sabbah, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(巴伊兹大学计算机科学系) Department of Computer Science, University of Central Florida(佛罗里达中央大学计算机科学系)

AI总结 提出Mitigation-Aware Chain-of-Thought (MA-CoT)框架,通过嵌入任务特定的CWE缓解指导和语言感知安全措施,显著降低LLM生成代码中的漏洞,在多个模型和语言上验证了其一致的安全可靠性提升。

Comments 15 pages; 7 tables; 3 figures

详情
AI中文摘要

大型语言模型(LLM)被广泛用于代码生成,但其安全可靠性在不同语言和提示策略下仍不一致。现有的提示工程提高了功能正确性,但很少能确保一致的安全结果。我们引入了 extit{Mitigation-Aware Chain-of-Thought (MA-CoT)}框架,该框架嵌入了任务特定的CWE缓解指导和语言感知安全措施,以减少生成代码中反复出现的漏洞。我们在三个LLM(gpt-5, claude-4.5, gemini-2.5)、三种编程语言(C, Java, Python)和四种提示策略(Vanilla, Zero-shot, CoT, MA-CoT)上,使用200个任务的主数据集以及LLMSecEval的外部验证对MA-CoT进行了评估。通过静态分析和专家验证,MA-CoT在主数据集上将总安全问题从92个减少到39个(57.6%),在LLMSecEval上从73个减少到4个(94.5%)。高严重性问题(阻塞+严重)分别从90个降至39个(56.7%)和从45个降至2个(95.6%)。在两个数据集中,MA-CoT是唯一能持续提高安全可靠性的策略;Zero-shot和CoT可靠性较低,且可能增加漏洞,尤其是在C语言中。我们进一步引入了严格的漏洞驱动因素分层归因(语言核心层与栈层),并表明残余风险集中在硬化导向的模式(例如,依赖于操作系统和工具链),这激励了在提示之外采用安全构造原语。

英文摘要

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textit{Mitigation-Aware Chain-of-Thought (MA-CoT)} framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6\%) on the primary dataset and from 73 to 4 (94.5\%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7\%) and from 45 to 2 (95.6\%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.

2605.24298 2026-05-26 cs.CR cs.AI cs.LG

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

LLM生成代码安全性的提示方法实证评估

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(计算机科学系,巴勒斯坦比泽大学) King Fahd University of Petroleum and Minerals(国王法赫德石油和矿物大学) University of Central Florida(中央佛罗里达大学)

AI总结 通过跨5个LLM和4种编程语言的实证评估,提出弱点感知零样本链式思考(WA-0CoT)提示策略,发现提示方法虽影响弱点类别分布,但无法显著降低漏洞频率或密度。

Comments 40 pages, 22 tables, 8 figures

详情
AI中文摘要

大型语言模型(LLM)在自动化代码生成中的日益使用提高了软件开发效率,但往往以安全性为代价。生成的代码经常忽略关键问题,使其容易受到弱加密和不正确的输入验证等问题的影响。为了研究这一问题,我们对跨五个LLM和四种编程语言(Java、C++、C和Python)的LLM生成代码的安全质量进行了全面的实证评估,考察了多种提示工程方法的影响。我们提出了一种弱点感知的零样本链式思考(WA-0CoT)提示策略,该策略利用CWE映射丰富提示中的安全上下文以指导模型推理。我们的实证分析在卡方检验的支持下发现,不同提示方法在漏洞频率或密度上没有统计学上的显著降低。然而,包括WA-0CoT在内的提示策略系统地影响了CWE类别的组成分布,其效果因编程语言而异。这些发现表明,虽然安全感知的提示改变了生成弱点的结构,但仅靠提示工程不足以可靠地降低整体漏洞水平。结果强调了在评估LLM生成代码的安全属性时,语言感知和模型感知的提示设计的重要性。

英文摘要

The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.

2605.24294 2026-05-26 cs.CR cs.AI cs.LG

Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection

使用自监督学习和强化学习在Android恶意软件检测中适应概念漂移

Ahmed Sabbah, Mohammad Kharma, Mohammad Alkhanafseh, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Birzeit University(巴伊兹大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出一个基于自监督学习和强化学习的框架,通过冻结编码器测量潜在漂移并轻量适配,同时利用PPO控制器在成本约束下选择维护动作,以应对Android恶意软件检测中的概念漂移。

Comments 9 pages, 2 figures, 2 tables

详情
AI中文摘要

Android恶意软件检测器在部署后常因概念漂移而性能下降,而每次维护步骤完全重新训练成本高昂。我们提出一个按时间顺序的自适应维护框架,将部署时的维护建模为序列决策问题。该框架在初始化阶段通过自监督学习学习稳定的潜在表示,冻结编码器,在固定表示空间中测量潜在漂移,并使用可训练适配器和分类头进行轻量下游适配。一个近端策略优化控制器根据检测器状态(包括当前效用、固定记忆集上的保留率、潜在漂移指标和更新成本)选择低成本的维护动作。我们在模拟器和真实Android恶意软件数据集上,使用静态和动态特征,在因果部署式协议下评估该框架。结果表明,RL控制器提供了一种强大的成本感知适配策略,在非平稳部署条件下,始终保持在最佳策略之列,同时在时间性能、记忆保留和维护成本之间取得有利平衡。

英文摘要

Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly. We propose a chronological adaptive maintenance framework that models deployment-time maintenance as a sequential decision problem. The framework learns a stable latent representation through self-supervised learning during initialization, freezes the encoder, measures latent drift in the fixed representation space, and performs lightweight downstream adaptation using a trainable adapter and classification head. A proximal policy optimization controller selects low-cost maintenance actions based on the detector state, including current utility, retention on a fixed memory set, latent drift indicators, and update cost. We evaluate the framework under a causal deployment-style protocol on emulator and real Android malware datasets with static and dynamic features. Results show that the RL controller provides a strong cost-aware adaptation strategy, consistently remaining among the top-performing policies while achieving a favorable balance between temporal performance, memory retention, and maintenance cost under non-stationary deployment conditions.

2605.24239 2026-05-26 cs.CR cs.AI

Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence

解锁苹果的私有云计算:隐私保护人工智能分析

Yannik Dittmar, Marvin Jerome Stephan, Thomas Völkl, Matthias Hollick, Jiska Classen

发表机构 * Hasso Plattner Institute, University of Potsdam(哈索普兰特纳研究所,波茨坦大学) TU Darmstadt, Secure Mobile Networking Lab(德累斯顿技术大学,安全移动网络实验室) IMDEA Networks Institute, Madrid, Spain(IMDEA网络研究所,马德里,西班牙)

AI总结 通过逆向工程苹果私有云计算(PCC)在移动设备上的实现,评估其隐私保护特性,并开放非公开接口以支持自定义查询和独立基准测试。

详情
Journal ref
Proceedings of the 19th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec 2026)
AI中文摘要

许多现有的移动设备人工智能解决方案依赖于大量敏感数据的收集,引发隐私担忧,并且通常需要存储上下文和模型改进数据。苹果的私有云计算(PCC)旨在通过强调移动设备集成和隐私优先设计来解决这一问题。PCC的核心主张是它不存储任何用户数据,并且用户输入和用户账户是不可关联的。 尽管大多数PCC系统规范是公开的,但编译后的二进制文件增加了一层不透明性。没有可重现的构建,这些二进制文件中也没有符号,导致规范与实际交付给用户的产品之间可能存在差异。此外,查询PCC的底层模型和接口并不公开可访问,限制了学术上对模型属性(如准确性)的评估。这给评估像PCC这样的隐私保护方法是否既值得信赖又能提供高质量答案带来了挑战。 我们是第一个逆向工程移动设备上PCC实现以评估隐私方面,并在本地设备上开放其非公开接口以支持自定义PCC查询的研究团队。我们通过独立基准测试PCC模型,展示了超出苹果预期用例的访问级别。通过公开我们的PCC基准测试框架,我们为未来的研究提供了支持。

英文摘要

Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy concerns and often requiring storage for both context and model improvement. Apple's Private Cloud Compute (PCC) aims to address this by emphasizing mobile device integration and a privacy-first design. The central claim of PCC is that it does not store any user data and that user input and user accounts are unlinkable. While most of the PCC system specifications are public, compiled binaries add a layer of opaqueness. There are no reproducible builds, and there are no symbols within those binaries, creating potential discrepancies between the specification and what is shipped to the user. Additionally, the underlying models and interfaces for querying PCC are not openly accessible, limiting academic evaluation of model properties, such as accuracy. This poses a challenge in assessing whether a privacy-preserving approach like PCC is actually trustworthy while also providing high-quality answers. We are the first to reverse-engineer the PCC implementation on mobile devices to evaluate privacy aspects and to open its non-public interfaces on local devices to support custom PCC queries. We demonstrate this level of access beyond Apple's intended use cases by independently benchmarking the PCC model. We enable future research by making our PCC benchmarking framework publicly available.

2605.24213 2026-05-26 cs.SE cs.AI cs.LG

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

迈向评估工程:机器学习评估工具在野外的实证研究

Zhimin Zhao, Zehao Wang, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

发表机构 * Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen's University(软件分析与智能实验室(SAIL),计算学院,女王大学) Concordia University(Concordia大学) Lahore University of Management Sciences (LUMS)(拉合尔管理科学大学(LUMS))

AI总结 通过对57个评估工具的实证研究,提出五阶段工具模型,并分类16560个问题,发现规范阶段问题最多(41.4%),主要根因是未实现功能(24.3%)、文档缺失(20.3%)和输入验证缺失(17.2%),为将评估工程作为独立软件工程关注点奠定实证基础。

详情
AI中文摘要

评估工具是编排模型评估的软件系统,管理模型调用、数据加载、指标计算和结果报告。尽管它们在机器学习基础设施中扮演关键角色,但其操作挑战和工程问题迄今受到的关注有限。我们对57个评估工具进行了实证研究,推导出一个五阶段工具模型,并根据工作流阶段和根本原因对16,560个问题进行了分类。大多数工具操作挑战集中在规范阶段(占问题的41.4%),在此阶段工具集成外部模型、数据集和评分评判者。操作挑战的三个最常见根本原因是未实现功能(24.3%)、文档缺失(20.3%)和输入验证缺失(17.2%),这些合计占分类问题的61.7%,涵盖现有功能的缺陷和阻碍预期工作流的能力缺口。根本原因也因工作流阶段而异:环境不兼容和外部依赖破坏占配置问题的36.2%,而算法错误(25.9%)和验证缺失(22.5%)主导评估问题。这些贡献共同为将评估工程视为一个独立的软件工程关注点建立了实证基础。

英文摘要

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

2605.24212 2026-05-26 stat.AP cs.AI cs.LG stat.ML

Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction

分布鲁棒迁移学习在结构缺失协变量中的应用:以跨国心脏骤停预测为例

Siqi Li, Chuan Hong, Ziye Tian, Benjamin Sieu-Hon Leong, Koshi Nakagawa, Hideharu Tanaka, Sang Do Shin, Khuong Quoc Dai, Do Ngoc Son, Marcus Eng Hock Ong, Nan Liu, Molei Liu

发表机构 * Centre for Biomedical Data Science, Duke-NUS Medical School, Singapore(生物医学数据科学中心,杜克-国家大学医学院,新加坡) Duke-NUS AI + Medical Sciences Initiative, Duke-NUS Medical School, Singapore(杜克-国家大学医学院AI+医学科学倡议,新加坡) Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA(生物统计学与生物信息学系,杜克大学,北卡罗来纳州达勒姆,美国) Duke Clinical Research Institute, Durham, NC, USA(杜克临床研究学院,北卡罗来纳州达勒姆,美国) Emergency Medicine Department, National University Hospital, Singapore(急诊医学部,国立大学医院,新加坡) Department of Sport and Medical Science, Faculty of Physical Education, Kokushikan University, Tokyo, Japan(体育与医学科学系,体育学院,立命馆大学,东京,日本) Graduate School of Emergency Medical System, Kokushikan University, Tokyo, Japan(急救医疗系统研究生院,立命馆大学,东京,日本) Department of Emergency Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea(急诊医学系,首尔国立大学医学院,首尔,韩国) Center for Emergency Medicine, Bach Mai Hospital, Hanoi, Vietnam(急救医学中心,巴赫梅医院,河内,越南) Center for Critical Care Medicine, Bach Mai Hospital, Hanoi, Vietnam(重症医学中心,巴赫梅医院,河内,越南) Health Services Research Centre, Singapore Health Services, Singapore(卫生服务研究中心,新加坡卫生服务,新加坡) Department of Emergency Medicine, Singapore General Hospital, Singapore(急诊医学部,新加坡中央医院,新加坡) Pre-hospital & Emergency Research Centre, Health Services Research and Population Health, Duke-NUS Medical School, Singapore(院前与急诊研究中心,卫生服务研究与人口健康,杜克-国家大学医学院,新加坡)

AI总结 提出DRUM框架,通过分布鲁棒优化和神经网络生成器处理目标域中结构缺失的协变量,实现无标签目标域的预测模型迁移,并在跨国心脏骤停预测中验证有效性。

详情
AI中文摘要

当关键训练协变量在部署时不可用且目标域中标记结果有限时,跨医疗系统部署临床预测模型常常失败。例如,院外心脏骤停(OHCA)的高性能模型依赖于高资源环境中常规收集的详细院前测量数据,但在许多国际登记处中不可用。现有方法要么丢弃缺失协变量,牺牲预测信息,要么依赖于关于其目标分布的可检验假设。我们提出了DRUM(具有结构缺失协变量的分布鲁棒无监督迁移学习),这是一个将预测模型迁移到某些协变量结构缺失且结果标签不可用的目标群体的框架。DRUM将协变量划分为共享组件($X$,在所有环境中观察到)和缺失组件($A$,仅在源域中观察到)。DRUM不进行缺失协变量插补,而是使用神经网络生成器优化未知目标分布$A \mid X$上的最坏情况预测性能,并通过鲁棒性参数控制与源条件允许的偏差。我们进一步开发了一种偏差校正程序,以减少对干扰估计误差的敏感性。模拟显示,在分布偏移下,平均和最坏情况预测误差均有显著改善。应用于跨国OHCA预测,将模型从美国登记处迁移到多个未记录院前变量的亚洲登记处,DRUM在各个站点产生了更校准的预测和改进的临床分类性能。

英文摘要

Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and labeled outcomes are limited in the target domain. For example, high-performing models for out-of-hospital cardiac arrest (OHCA) rely on detailed prehospital measurements routinely collected in high-resource settings but unavailable in many international registries. Existing methods either discard missing covariates, sacrificing predictive information, or rely on untestable assumptions about their target distribution. We propose DRUM (\underline{D}istributionally \underline{R}obust \underline{U}nsupervised transfer learning with structurally \underline{M}issing covariates), a framework that transfers prediction models to target populations where certain covariates are structurally absent and outcome labels are unavailable. DRUM partitions covariates into shared components ($X$), observed across all settings, and missing components ($A$), observed only in the source. Rather than imputing missing covariates, DRUM optimizes worst-case predictive performance over the unknown target distribution of $A \mid X$ using a neural network generator, with a robustness parameter controlling allowable deviation from the source conditional. We further develop a bias correction procedure that reduces sensitivity to nuisance estimation error. Simulations show substantial improvements in both mean and worst-case prediction error under distribution shift. Applied to cross-national OHCA prediction, transferring models from a US registry to multiple Asian registries where prehospital variables are unrecorded, DRUM yields better-calibrated predictions and improved clinical classification performance across sites.

2605.24207 2026-05-26 cs.DB cs.LG

Incorporating Deep Learning Design in Database Queries

将深度学习设计融入数据库查询

Yuval Lev Lubarsky, Dean Light, Boaz Berger, Shunit Agmon, Benny Kimelfeld

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种将深度学习自然集成到数据库查询中的方法,通过为元组关联可学习的向量嵌入,使查询同时操作数据和嵌入,实现关系深度学习。

详情
AI中文摘要

关系数据库上的深度学习通常通过将数据转换为图表示并在外部框架中应用基于图的神经网络来实现。这种数据库与外部机器学习系统之间的往返引入了非平凡的工程开销。实际上,这些图神经网络对元组嵌入进行操作,并以捕获关系连接引起的交互的方式操纵它们。鉴于这种自然的对应关系,没有根本原因说明为什么在关系数据上指定神经网络应该比查询它困难得多。我们提出了一种将深度学习与数据库查询自然集成的方法。关键思想是为每个元组关联一个来源,表示为具有可学习参数的向量嵌入。查询被提升为联合操作数据和嵌入,将带有嵌入元组的输入关系映射到带有嵌入元组的输出关系。这种方法为关系深度学习提供了声明性基础,促进了与数据库系统的集成、优化和广泛采用。我们描述了RelaNN,这是一个基于PyTorch和cuDF构建的概念验证实现。通过实现各种图学习模型,包括图卷积网络、异构图变换器、超图神经网络和深度同态网络,我们展示了RelaNN的实用性。程序的简单性及其有竞争力的运行时性能展示了一条具体路径,使得在数据库上实现最先进的神经网络变得像编写查询一样简单。

英文摘要

Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine learning (ML) systems introduces non-trivial engineering overhead. In effect, these graph neural networks operate on tuple embeddings and manipulate them in ways that capture the interactions induced by relational joins. Given this natural correspondence, there is no fundamental reason why specifying a neural network over relational data should be substantially harder than querying it. We propose an approach that naturally integrates deep learning with database queries. The key idea is to associate each tuple with provenance, represented as a vector embedding with learnable parameters. Queries are lifted to operate jointly on data and embeddings, mapping input relations with embedded tuples to output relations with embedded tuples. This approach provides a declarative foundation for relational deep learning, facilitating integration with database systems, optimization, and wide adoption. We describe RelaNN, a proof-of-concept implementation of this approach built on top of PyTorch and cuDF. We illustrate the utility of RelaNN by implementing various graph-learning models, including graph convolutional networks, heterogeneous graph transformers, hypergraph neural networks and deep homomorphism networks. The simplicity of the programs and their competitive runtime performance demonstrate a concrete path toward making the implementation of state-of-the-art neural networks over databases as simple as writing a query.

2605.24183 2026-05-26 cs.DB cs.AI cs.LG

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

AvalancheBench: 通过潜在世界恢复评估企业数据智能体

Darek Kleczek, Fuheng Zhao, Alexander W. Lee, Julien Tissier, Pawel Liskowski, Ugur Cetintemel, Anupam Datta

发表机构 * Brown University and Snowflake(布朗大学和Snowflake)

AI总结 提出AvalancheBench基准,通过潜在世界恢复评估企业数据智能体的分析理解能力,揭示早期错误如何传播并导致系统性错误推荐。

详情
AI中文摘要

我们介绍了AvalancheBench,一个通过潜在世界恢复评估企业数据智能体的基准。AvalancheBench在三个方面改进了现有基准。首先,它评估分析理解而非流程完成:系统根据是否恢复了解释数据的片段、驱动因素、时间事件和关系来评分,而不仅仅是执行工作流或生成看似合理的报告。其次,它通过从已知潜在世界生成观测数据,为目标驱动分析提供真实基准,从而允许对不完整但有效的恢复给予部分分数。第三,它揭示了早期分析错误如何传播到后续结论:遗漏的片段、合并的事件或错误的归因可能导致系统性错误推荐。在这个意义上,AvalancheBench通过提供一个受控环境来诊断智能体是否恢复了企业数据背后的分析结构,从而补充了真实数据基准。在第一个电子商务用例中,领先编码智能体的最强配置仅恢复了26%的评分标准,失败集中在通用客户细分和合并的时间事件上。

英文摘要

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

2605.24180 2026-05-26 physics.soc-ph cs.AI cs.DL cs.HC

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

大规模人机协作科学:一项全球大规模随机现场实验

Binglu Wang, Weixin Liang, Jiahui Xue, Yuhui Zhang, Hancheng Cao, Dashun Wang, Yian Yin

发表机构 * Kellogg School of Management, Northwestern University(西北大学凯洛格管理学院) Center for Science of Science and Innovation, Northwestern University(西北大学科学与创新中心) Northwestern Institute on Complex Systems, Northwestern University(西北大学复杂系统研究所) Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Goizueta Business School, Emory University(埃默里大学戈伊兹特亚商学院) Department of Information Science, Cornell University(康奈尔大学信息科学系)

AI总结 通过全球大规模随机现场实验,研究大型语言模型(LLMs)生成的定制化反馈能否提升科研人员的修订率并促进AI工具使用,尤其惠及资源受限的研究者。

详情
AI中文摘要

合作是现代科学的定义模式,但其核心机制——反馈——仍然难以观察、难以扩展且分布不均。在此,我们测试大型语言模型(LLMs)是否能够贡献于这一隐蔽但至关重要的实践,并重新分配科学反馈——这一知识生产中必不可少但稀缺的资源。在一项全球大规模随机现场实验中,我们为来自133个地理区域的超过45,000名研究人员的150个领域的31,000多篇arXiv预印本提供了定制的LLM生成反馈。与对照组相比,收到反馈的作者修改其手稿的可能性显著更高,相对于基线修订率提高了12.55%。接触AI反馈还增加了作者在未来论文中使用LLM工具的频率,表明科学实践发生了长期转变。这些效应在非英语主导研究区域的作者、与学术文献联系较少的手稿以及h指数较低和职业早期阶段的团队中最为显著,这与AI反馈可能在获取及时批评受限的地方提供最大益处的观点一致。总之,这些发现提供了因果证据,表明基于AI的结构化干预可以将科学反馈的获取从一种主要是私人优势转变为更广泛分布的资源,对全球研究体系的生产力、公平性和能力产生更广泛的影响。

英文摘要

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

2605.24170 2026-05-26 math.DS cs.LG q-bio.QM

Learning dynamical systems with biochemically informed neural ordinary differential equations

基于生化信息的神经常微分方程学习动力系统

Luis L. Fonseca, Reinhard C. Laubenbacher, Lucas Böttcher

发表机构 * Laboratory for Systems Medicine, Department of Medicine, University of Florida(系统医学实验室,医学系,佛罗里达大学) Department of Computational Science and Philosophy, Frankfurt School of Finance and Management(计算科学与哲学系,法兰克福金融与管理学院)

AI总结 提出生化信息神经常微分方程(BINODEs),通过将神经网络表示的过程嵌入化学计量结构,在保持可解释性的同时灵活建模部分已知的生化动力系统。

Comments 23 pages, 13 figures, 4 tables

详情
AI中文摘要

生化反应的常微分方程模型通常被表述为化学计量系统,其中动力学源于一系列相互作用的过程。一个核心挑战是每个过程的函数形式很少先验已知,且可能难以从数据中推断。我们提出生化信息神经常微分方程(BINODEs),这是一种神经ODE框架,保留了机制模型的化学计量结构,同时通过神经网络表示各个过程。在BINODEs中,神经网络过程(NNP)的输出通过类似于化学计量矩阵的线性层映射到状态导数。这种架构允许将生物侧面信息(如过程特定输入、符号约束和单调性假设)直接构建到模型中。我们表征了NNP对几种标准生化速率定律的逼近性质,并表明所提出的框架在Monod、Lotka-Volterra、药代动力学和超日内分泌模型中恢复了轨迹和过程级结构。这些结果表明,BINODEs在机制可解释性和数据驱动灵活性之间提供了有用的折衷,用于建模部分已知的生化或生物动力系统。

英文摘要

Ordinary differential equation models of biochemical reactions are often formulated as stoichiometric systems in which the dynamics arise from a collection of interacting processes. A central challenge is that the functional form of each process is rarely known a priori and may be difficult to infer from data. We propose biochemically informed neural ordinary differential equations (BINODEs), a neural-ODE framework that retains the stoichiometric structure of mechanistic models while representing individual processes by neural networks. In BINODEs, the outputs of neural network processes (NNPs) are mapped to state derivatives through a linear layer analogous to a stoichiometric matrix. This architecture allows biological side information, such as process-specific inputs, sign constraints, and monotonicity assumptions, to be built directly into the model. We characterize the approximation properties of NNPs for several standard biochemical rate laws and show that the proposed framework recovers both trajectories and process-level structure in Monod, Lotka--Volterra, pharmacokinetic, and ultradian endocrine models. These results suggest that BINODEs offer a useful compromise between mechanistic interpretability and data-driven flexibility for modeling partially known biochemical or biological dynamical systems.

2605.24155 2026-05-26 cs.IR cs.AI cs.LG

An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation

一种可解释的CF-RL-TOPSIS融合模型用于技能感知的人才推荐

Özkan Canay

发表机构 * Sakarya University(萨克萨大学)

AI总结 提出CF-RL-TOPSIS可解释融合模型,结合协同过滤、强化学习臂和熵权TOPSIS,在ICT人才历史基准上验证其在不同数据模式下的有效性。

Comments Preprint submitted to Knowledge-Based Systems; 4 figures and 8 tables

详情
AI中文摘要

有效的技能感知人才推荐必须平衡行为转换模式、轨迹敏感适应性和可检查的职业层面标准。然而,关于这些信号如何相互作用的公共基准证据仍然有限。本研究提出CF-RL-TOPSIS,一种可解释的后期融合模型,它集成了转换感知协同分支、紧凑的强化风格职业族臂和由六个语义代理构建的熵权TOPSIS分支;验证选择的融合系数保持可审计。该模型在两个冻结的公共ICT人才历史基准JobHop和Karrierewege上进行了评估,使用重复的时间顺序前5排名和配对Wilcoxon检验。在JobHop上,完整混合模型达到NDCG@5 = 0.3040 +/- 0.0073,并显著优于repeat-last、item Markov、转换感知协同过滤、CF+TOPSIS混合、GRU4Rec和SASRec(计划比较中p <= 0.0039)。在Karrierewege上,混合模型保持竞争力,但未显著超过最强的Markov基线,揭示了一个持久性主导的环境,其中臂分支适当缩小到接近零权重。代理敏感性、家庭级深度Q网络和运行时检查支持这一解释,一个详细的用户级案例展示了如何检查单个推荐的各分支分数、标准权重和排名变化。贡献不是基准无关的优越性声明,而是对透明后期融合在简单延续启发式之外增加价值的条件的可重复说明。在语义丰富、非饱和的人才历史机制中,三个分支相互增强;在持久性主导机制中,相同的架构通过其协同骨干保持竞争力,而自适应分支正确处于非活跃状态。

英文摘要

Effective skills-aware talent recommendation must balance behavioral transition patterns, trajectory-sensitive adaptation, and inspectable occupation-level criteria. Evidence from public benchmarks on how these signals interact, however, remains limited. This study proposes CF-RL-TOPSIS, an interpretable late-fusion model that integrates a transition-aware collaborative branch, a compact reinforcement-style occupation-family bandit, and an entropy-weighted TOPSIS branch constructed from six semantic proxies; the validation-selected fusion coefficients remain auditable. The model is evaluated on two frozen public ICT talent-history benchmarks, JobHop and Karrierewege, using repeated chronological top-5 ranking and paired Wilcoxon tests. On JobHop the full hybrid attains NDCG@5 = 0.3040 +/- 0.0073 and significantly surpasses repeat-last, item Markov, transition-aware collaborative filtering, the CF+TOPSIS hybrid, GRU4Rec, and SASRec (p <= 0.0039 across planned comparisons). On Karrierewege the hybrid remains competitive but does not significantly exceed the strongest Markov baseline, revealing a persistence-dominated setting in which the bandit branch appropriately shrinks to near-zero weight. Proxy-sensitivity, family-level deep Q-network, and runtime checks support this interpretation, and a worked user-level case shows how branch scores, criterion weights, and rank shifts can be inspected for an individual recommendation. The contribution is not a benchmark-agnostic superiority claim, but a reproducible account of the conditions under which transparent late fusion adds value beyond simple continuation heuristics. In semantically rich, non-saturating talent-history regimes the three branches reinforce one another; in persistence-dominated regimes the same architecture remains competitive through its collaborative backbone, with the adaptive branch correctly inactive.

2605.24144 2026-05-26 cs.AR cs.LG

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

EVA:通过高效向量量化架构加速大语言模型解码

Bowen Duan, Cong Guo, Chiyue Wei, Haoxuan Shan, Yuzhe Fu, Xinhua Chen, Yifan Xu, Ziyue Zhang, Changchun Zhou, Hai Li, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 针对大语言模型解码阶段内存瓶颈和计算利用率低的问题,提出EVA架构,将GEMV计算转化为GEMM并消除内存冲突,实现最高11.17倍加速和7.17倍能效提升。

Comments 17 pages. Accepted to ISCA 2026

详情
AI中文摘要

大型语言模型(LLM)在多个领域取得了令人印象深刻的性能,但在自回归解码阶段仍然效率低下。与采用计算密集型GEMM操作的预填充阶段不同,解码执行一系列小型类GEMV计算,这些计算受内存限制且未能充分利用现代加速器。仅权重的向量量化(VQ)已成为一种有效的压缩技术,它将模型权重聚类到共享码本中,并用低精度索引替换原始权重矩阵,实现2位级别的权重压缩。虽然这种方法显著减少了模型大小和内存带宽,但它仍然存在两个关键的低效问题:GEMV计算利用率低以及码本查找期间频繁的内存冲突。本文提出了EVA,一种高效的基于向量量化的架构,解决了LLM解码中的计算和内存瓶颈。EVA基于一个简单而有效的见解,即结合输入-码本计算与无冲突内存访问。EVA不是从索引重建量化权重,而是直接在输入向量和权重码本之间执行点积,将LLM解码从GEMV计算转变为GEMM计算。然后,它从中间输出缓冲区执行结构化查找,消除了内存组冲突。我们进一步设计了一种硬件-软件协同优化的架构,专门用于LLM解码,同时保持与传统预填充执行的兼容性。评估表明,与最先进的基于查找的架构相比,EVA实现了最高11.17倍的加速和7.17倍的能效提升,同时保持了向量量化后的算术精度。我们的代码可在https://github.com/dbw6/Eva.git获取。

英文摘要

Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators. Weight-only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low-precision indices, enabling 2-bit-level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector-quantization-based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input-codebook computation with conflict-free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware-software co-optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17$\times$ speedup and 7.17$\times$ higher energy efficiency compared with the SOTA lookup-based architecture, while preserving arithmetic precision after vector quantization. Our code is available at https://github.com/dbw6/Eva.git.

2605.24138 2026-05-26 cs.SE cs.AI

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

多智能体编程中的对话模式理解:以斐波那契游戏开发为例

Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, Miroslaw Staron

发表机构 * Chalmers University of Technology University of Gothenburg Gothenburg Sweden Research \& Development, Volvo Car Corporation Gothenburg Sweden University of Gothenburg Research \& Development, Volvo Car Corporation

AI总结 本文通过分析12种开源LLM组合中设计者与程序员智能体的对话,揭示了多智能体交互的效率、一致性和有效性三个关键维度,发现DeepSeek-R1:DeepSeek-R1对能从首次迭代起稳定收敛到正确解,而其他组合则存在发散或错误共识问题。

Comments 10 pages, 7 figures, AIware, FSE 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地应用于软件工程(SE),但它们在自主、面向角色的协作方面的潜力仍远未得到充分探索。理解多个基于LLM的智能体如何协调、保持角色对齐并收敛到解决方案对SE至关重要,因为简单地让智能体交互并不能可靠地产生正确或稳定的结果。最近的实证研究表明,非结构化或理解不足的交互动态可能导致错误传播、对错误解决方案的过早共识,或阻止收敛的长期分歧,即使在交互早期存在正确的部分解决方案。作为解决这一未被充分探索领域的初步步骤,我们对两个智能体(设计者和程序员)之间的对话进行了系统分析,涉及来自7个开源LLM(Gemma 2、Gemma 3、LLaMA 3.2、LLaMA 3.3、DeepSeek-R1、MiniCPM和Qwen3)的12种模型组合。我们的系统方法揭示了多智能体交互的三个关键维度:效率(收敛的速度和稳定性)、一致性(通过BLEU和ROUGE可视化的角色对齐程度)和有效性(编译成功和错误解决的程度)。结果表明,DeepSeek-R1:DeepSeek-R1对从第一次迭代起就独特地收敛到正确解,并一致地保持到最终迭代,而LLaMA 3.2:LLaMA 3.2和Qwen3:Qwen3对尽管偏离了正确解,但表现出强烈的设计者:程序员角色对齐。其他对偏离了任务,从未收敛到结果。这些发现推进了对智能体编程的理解,并强调了进一步研究理解和校准收敛及停止条件的必要性,这对于未来的自主SE至关重要。

英文摘要

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

2605.24137 2026-05-26 cs.SE cs.AI

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

LLM生成的错误报告摘要中幻觉的经验分析与检测

Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde

发表机构 * Electrical and Software Engineering(电气与软件工程学院)

AI总结 本研究从章节感知角度经验性地调查了LLM生成的错误报告摘要中的幻觉,提出了联合预测幻觉内容、识别受影响章节和分类幻觉类型的检测方法,并在BugsRepo数据集上取得了良好性能。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成软件错误报告的摘要,包括诸如重现步骤(S2R)、实际行为(AB)和预期行为(EB)等章节。然而,这些模型经常产生看似可信但缺乏源报告支持的幻觉,这可能会误导开发者并降低对自动化维护工具的信任。现有的幻觉检测方法通常在完整响应级别评估输出,并未考虑技术文档的结构。一项对80个结构化错误报告摘要的初步探索性研究发现,约47.9%包含缺失信息,而12.3%包含捏造内容,凸显了在错误报告摘要中进行系统性幻觉分析的必要性。在这项工作中,我们从章节感知的角度经验性地调查了LLM生成的错误报告摘要中的幻觉。利用源自Mozilla OSS项目的BugsRepo数据集,我们引入了受控的合成幻觉注入,以构建用于训练和评估的基准。我们提出了一种章节感知的幻觉检测方法,该方法联合预测摘要是否包含幻觉内容、识别受影响的章节,并对幻觉类型进行分类。在多个预训练语言模型上的实验结果表明,所提出的方法在所有任务上均取得了强劲性能,最佳模型在报告级别上获得了0.89的Macro-F1,在章节级别上获得了0.83的Macro-F1,在幻觉类型上获得了0.84的Macro-F1。我们进一步分析了常见的幻觉模式和模型失败模式,以更好地理解当前LLM生成的错误报告摘要的局限性。研究结果强调了章节感知的幻觉分析对于提高软件维护工作流中LLM辅助错误报告摘要可靠性的重要性。

英文摘要

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

2605.24096 2026-05-26 cs.DB cs.AI cs.DC cs.SE

The Time is Here for Just-in-Time Systems: Challenges and Opportunities

即时系统的时代已到来:挑战与机遇

Shu Liu, Alexander Krentsel, Shubham Agarwal, Mert Cemri, Ziming Mao, Soujanya Ponnapalli, Alexandros G. Dimakis, Sylvia Ratnasamy, Matei Zaharia, Aditya Parameswaran, Ion Stoica

发表机构 * UC Berkeley(加州大学伯克利分校) Bespoke Labs

AI总结 本文提出基于LLM的即时系统合成方法Jitskit,通过从零开始合成专用键值存储系统,在18个规格上性能超越现有系统最高达4.6倍。

Comments preprint

详情
AI中文摘要

像键值存储这样的核心系统历史上需要数年时间构建,并且设计为通用型以分摊跨部署的成本,但付出了显著的性能代价。我们认为基于LLM的编码代理现在使一种不同的方法变得可行:即时系统,其中整个系统从零开始合成,专门针对环境、工作负载和所需的系统属性。我们提出了一个即时系统合成流水线Jitskit,并探索了其在从跨不同YCSB工作负载、部署约束(如计算资源)和系统属性(如一致性和持久性)的规格卡中合成键值存储的有效性。Jitskit迭代地改进系统实现,以匹配针对不断演化的评估测试套件的规格。生成的合成系统性能优越,在尝试的18个规格中的18个上击败了可比的最新系统,在最有利的规格上比最佳现成基线高出4.6倍。直接运行Claude Code要么奖励黑客,要么性能比Jitskit差高达5.4倍。我们讨论了构建Jitskit过程中克服的挑战和关键收获。

英文摘要

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.

2605.24079 2026-05-26 cs.SE cs.AI cs.CL

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

TRACER: 一种用于代码大语言模型中细粒度污染检测的语义感知框架

Yifeng Di, Xuliang Huang, Tianyi Zhang

发表机构 * Purdue University West Lafayette, IN(帕克大学韦斯特拉法叶分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TRACER框架,通过三级语义重叠和粗到细流水线检测代码LLM中的细粒度数据污染,在基准测试中F1达0.91。

Comments 21 pages, 2 figures, 15 tables

详情
AI中文摘要

数据污染是对模型评估可靠性的已知威胁。然而,在代码大语言模型(LLM)中,污染往往超出精确重复,这一问题仍未得到充分探索。我们提出了TRACER,一种用于细粒度代码污染检测的语义感知框架。TRACER使用三级语义重叠——功能相同、几乎相同和共享逻辑——对污染进行建模,并通过粗到细的流水线进行检测。我们还引入了首个细粒度代码污染检测基准,涵盖三个广泛使用的基准和三个具有代表性的后训练数据集。TRACER在多个LLM骨干网络上取得了强大且一致的性能,其中GPT-5在细粒度检测中F1分数达到0.91。在二分类设置中,TRACER的F1达到0.92,比现有方法高出42%-217%。我们进一步进行了消融研究和错误分析,以评估TRACER中各个组件的贡献。

英文摘要

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

2605.24077 2026-05-26 eess.SP cs.LG

LWM-CDE: A Representation Space for Wireless Data Reasoning and Transferability

LWM-CDE:无线数据推理与可迁移性的表示空间

Sadjad Alikhani, Akshay Malhotra, Shahab Hamidi-Rad, Ahmed Alkhateeb

发表机构 * School of Electrical, Computer and Energy Engineering, Arizona State University(电气、计算机与能源工程学院,亚利桑那州立大学) InterDigital, Inc.(InterDigital公司)

AI总结 提出基于预训练无线基础模型特征空间的数据集相似性框架LWM-CDE,通过对比学习和几何形状损失微调数据集嵌入,构建距离可靠指示可迁移性的结构化流形,在无线基准测试中比现有指标更高效且与经验迁移性能相关性更强。

Comments The model and relevant scripts are available on the WILab Hugging Face page: https://huggingface.co/wi-lab

详情
AI中文摘要

机器学习在真实世界无线通信任务中的部署面临显著的泛化挑战,原因包括信号结构对位置和环境的依赖性、不同部署场景下数据的高度多样性以及真实世界数据的有限可用性。当前用于评估训练分布与推理(部署)分布之间数据相似性以及模型可迁移性的方法存在计算成本高和性能不一致的问题,导致关键的模型部署和模型生命周期管理决策缺乏原则性基础。为了解决这一问题,我们引入了一个基于预训练无线基础模型特征空间的数据集相似性框架。我们的方法LWM-CDE(对比学习数据集嵌入)通过结合对比损失和几何形状损失对基础模型的数据集嵌入进行微调,构建了一个结构化流形,其中距离可靠地指示可迁移性。在无线基准测试上的大量实验表明,LWM-CDE与经验迁移性能的相关性比现有指标更强,同时计算效率更高。学习到的表示空间支持更有效且数据高效的决策,例如源数据集选择、标签感知增强和预算预训练,展示了其在各种无线通信应用中的广泛实用性。

英文摘要

Machine learning deployments in real-world wireless communication tasks face significant generalization challenges due to location and environment-specific signal structure, high diversity in data across different deployments, and limited availability of real-world data. Current approaches for assessing data similarity between training and inference (deployment) distributions, as well as evaluating model transferability, suffer from high computational costs and inconsistent performance, leaving critical model deployment and model life cycle management decisions without a principled foundation. To address this, we introduce a dataset similarity framework built upon the feature space of a pretrained wireless foundation model. Our method, LWM-CDE (Contrastive learning of Dataset Embedding), fine-tunes the dataset embeddings of the foundation model using a combination of contrastive and geometry-shaping losses, creating a structured manifold where distance reliably indicates transferability. Extensive experiments on wireless benchmarks show that LWM-CDE achieves stronger correlation with empirical transfer performance than existing metrics while being more computationally efficient. The learned representation space supports more effective and data-efficient decision-making for tasks like source dataset selection, label-aware augmentation, and budgeted pretraining, demonstrating its broader utility across different wireless communication applications.

2605.24076 2026-05-26 stat.ML cs.LG

Causality as the Statistical Conscience of Artificial Intelligence: From Pearl's Ladder to Trustworthy Machines

因果关系作为人工智能的统计良知:从珀尔的阶梯到可信机器

Ernest Fokoué

发表机构 * School of Mathematics and Statistics, College of Science(数学与统计学学院,科学学院)

AI总结 本文论证因果推断是AI不可或缺的统计良知,通过统计必要性定理、统一因果统计估计框架以及三种AI失败模式的因果盲区分析,提出可信AI本质上是因果统计问题。

Comments 18 pages, 4 figures, 1 table

详情
AI中文摘要

现代人工智能通过优化大规模语料库上的统计风险函数实现了卓越的预测能力。然而,这与其正智能之间存在差距:无法区分相关性与因果关系。本文认为,因果推断(识别干预下不变的机制)是人工智能不可或缺的统计良知。没有因果基础,AI系统只是相关机器:在熟悉领域强大,在分布偏移下脆弱,在高风险场景中存在偏见。三个贡献发展了这一论点。首先,因果泛化的统计必要性定理:任何实现分布外泛化的算法必须编码因果结构,形式化了预测P(Y|X)与智能P(Y|do(X))之间的区别。其次,一个统一框架将珀尔的do演算、潜在结果框架、双机器学习以及不变风险最小化连接为一系列因果统计估计量,每个估计量在不同假设下识别干预分布。第三,三种AI失败模式(大语言模型中的幻觉、基于人类反馈的强化学习中的奖励黑客以及分布偏移下的退化)是因果盲区的表现,每种都有原则性的统计补救措施。可信AI的核心是一个因果统计问题。统计界不仅有能力解决它——而且是唯一拥有严格解决所需基础工具的群体。

英文摘要

Modern Artificial Intelligence achieves remarkable predictive power by optimizing statistical risk functionals over vast corpora. Yet a gap separates this from genuine intelligence: the inability to distinguish correlation from causation. This paper argues that causal inference (identifying mechanisms invariant under intervention) is AI's indispensable statistical conscience. Without causal grounding, AI systems are correlation machines: powerful in familiar domains, brittle under distribution shift, and biased in high-stakes settings. Three contributions develop this argument. First, a Statistical Necessity Theorem for Causal Generalization: any algorithm achieving out-of-distribution generalization must encode causal structure, formalizing the distinction between prediction P(Y|X) and intelligence P(Y|do(X)). Second, a unified framework connects Pearl's do-calculus, the Potential Outcomes framework, Double Machine Learning, and Invariant Risk Minimization as a family of Causal Statistical Estimators, each identifying interventional distributions under different assumptions. Third, three AI failure modes (hallucination in large language models, reward hacking in reinforcement learning from human feedback, and degradation under distribution shift) are manifestations of causal blindness, each admitting a principled statistical remedy. Trustworthy AI is, at its core, a problem of causal statistics. The statistical community is not merely equipped to solve it -- it is the only community with the foundational tools to do so rigorously.

2605.24073 2026-05-26 physics.chem-ph cs.LG

Multitask learning with semiempirical orbital charges enables sample-efficient MLIPs

基于半经验轨道电荷的多任务学习实现样本高效的机器学习原子间势

Ihor Neporozhnii, Sjoerd Hoogland, Oleksandr Voznyy

发表机构 * Department of Physical and Environmental Sciences(物理与环境科学系) Department of Electrical and Computer Engineering(电气与计算机工程系) The Alliance for AI-Accelerated Materials Discovery(人工智能加速材料发现联盟)

AI总结 提出利用半经验轨道电荷进行多任务学习,通过等变模型预测轨道电荷,显著提升机器学习原子间势的样本效率和精度,减少46%能量误差并降低五倍数据需求。

Comments 16 pages, 6 figures

详情
AI中文摘要

机器学习原子间势(MLIPs)需要生成计算昂贵的大规模训练数据集,以准确模拟材料和分子。利用多任务学习结合电子结构信息可提高样本效率,然而,在全哈密顿矩阵(随原子数二次缩放)上训练对于大数据集是难以处理的。在这项工作中,我们表明利用轨道分辨的半经验电荷进行多任务学习显著提高了MLIPs的样本效率和精度。为了高效预测轨道电荷,我们实现了一个专门的等变模型,与不变基线相比降低了电荷预测误差。通过使用计算成本低、随原子数线性缩放的GFN1-xTB轨道电荷增强训练,我们的模型实现了能量平均绝对误差降低46%,并且仅需五分之一的数据即可达到仅能量模型的性能。此外,我们的方法优于在昂贵的密度泛函理论(DFT)原子电荷上训练的模型,捕捉了轨道分辨的电子复杂性,并迫使网络学习一个物理准确的潜在空间,该空间根据共享化学性质自发地对金属进行聚类。由于轨道电荷仅在训练期间需要,这种方法保持了推理效率,为开发复杂化学系统的准确、数据高效的基础模型提供了可扩展的方案。

英文摘要

Machine learning interatomic potentials (MLIPs) require generating computationally expensive, large-scale training datasets to accurately simulate materials and molecules. Incorporating electronic structure information using multitask learning improves sample efficiency, however, training on full Hamiltonian matrices, which scale quadratically with the number of atoms, is intractable for large datasets. In this work, we show that multitask learning utilizing orbitally resolved semiempirical charges significantly improves sample efficiency and accuracy in MLIPs. To efficiently predict orbital charges, we implement a specialized equivariant model, reducing charge prediction error compared to an invariant baseline. By augmenting training with computationally inexpensive GFN1-xTB orbital charges, which scale linearly with the number of atoms, our model achieves a 46\% reduction in energy mean absolute error and requires five times less data to match the performance of energy-only models. Furthermore, our approach outperforms models trained on expensive density functional theory (DFT) atomic charges, capturing orbitally resolved electronic complexity and forcing the network to learn a physically accurate latent space that spontaneously clusters metals by shared chemical properties. Because orbital charges are only required during training, this approach preserves inference efficiency, providing a scalable recipe for developing accurate, data-efficient foundation models for complex chemical systems.

2605.24072 2026-05-26 stat.ML cs.LG math.PR

Optimal Non-Asymptotic Edgeworth Expansions for Multivariate Neural Network Outputs

多元神经网络输出的最优非渐近 Edgeworth 展开

Lucia Celli

发表机构 * Department of Mathematics, University of Luxembourg(卢森堡大学数学系)

AI总结 针对有限宽度全连接神经网络输出,利用任意阶 Edgeworth 展开逼近其与高斯极限的偏差,并给出总变差距离的上下界。

Comments 34 pages, 2 figures

详情
AI中文摘要

具有高斯初始化权重的有限宽度全连接神经网络偏离其无限宽度高斯极限,表现出非消失的高阶累积量。我们针对在有限个输入上评估的神经网络,使用任意阶 $4m-1$($m\in\mathbb{N}$)的多维 Edgeworth 展开来逼近这些偏差。假设相应的高斯极限具有可逆协方差矩阵且激活函数为多项式有界,我们在真实网络输出分布与其 Edgeworth 逼近之间的总变差距离上建立了 $n^{-m}$ 阶的界,并给出了匹配的下界。作为一个应用,我们量化了当先验被其 Edgeworth 展开替代时贝叶斯后验分布的误差。我们的结果更具一般性,也适用于收敛到具有可逆协方差的高斯向量的条件高斯向量序列。

英文摘要

Finite-width fully connected neural networks with Gaussian-initialized weights deviate from their infinite-width Gaussian limit, exhibiting non-vanishing higher-order cumulants. We approximate these deviations, for a neural network evaluated in a finite number of inputs, using multidimensional Edgeworth expansions of arbitrary order $4m-1$, with $m\in\mathbb{N}$. Assuming that the corresponding Gaussian limit has an invertible covariance matrix and that the activation function is polynomially bounded, we establish a bound of order $n^{-m}$ on the total variation distance between the law of the true network output and its Edgeworth approximation, with matching lower bounds. As an application, we quantify the error in Bayesian posterior distributions when the prior is replaced by its Edgeworth expansion. Our results are more general and also apply to sequences of conditionally Gaussian vectors converging to a Gaussian vector with invertible covariance.

2605.24069 2026-05-26 cs.CR cs.AI

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

当手册撒谎:评估LLM智能体MCP投毒攻击的现实基准

Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin, Biyu Zhou, Wenjie Xiao, Wantao Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院)

AI总结 针对LLM智能体通过模型上下文协议(MCP)集成外部工具时面临的工具描述投毒(TDP)攻击,提出MCP-TDP安全基准,包含32个真实测试用例,评估8种主流LLM发现严重漏洞,并提出反应性自我纠正防御机制。

详情
AI中文摘要

使用工具的大型语言模型(LLM)智能体的兴起,通过模型上下文协议(MCP)等协议标准化,通过集成外部开放领域知识和工具,为LLM智能体解锁了前所未有的自主执行能力。然而,这种互操作性引入了一个针对智能体认知规划层的隐蔽攻击面。本文系统性地研究了工具描述投毒(TDP),一种新颖的语义攻击。在TDP中,恶意指令并非嵌入工具的可执行代码,而是隐蔽地注入其描述性元数据——即智能体依赖进行安全规划和决策的“手册”。为了严格系统地评估这一新兴威胁,我们引入了MCP-TDP安全基准。这个高保真沙箱环境包含32个跨越6个不同风险类别的真实测试用例。我们对8种主流LLM的评估揭示了严重漏洞,领先模型如GPT-4o在六个高风险场景中表现出近100%的攻击成功率(ASR)。此外,我们的发现表明,常见的提示护栏防御基本无效,并且可能适得其反(我们称之为“防火墙谬误”)。关键的是,我们还提出了一种防御机制:“反应性自我纠正”,即智能体在执行后自主检测并撤销其恶意行为。这项工作为TDP提供了第一个专门的安全基准,为保护高级智能体系统的认知和规划层提供了重要见解。

英文摘要

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

2605.24067 2026-05-26 physics.ao-ph cs.LG

Seeing Inside the Storm: Improving Nowcasting by Integrating Meteorological Drivers

洞察风暴内部:通过整合气象驱动因子改进临近预报

Minghui Qiu, Jun Chen, Lin Chen, Weifeng Chen, Shuxin Zhong, Zhidan Liu, Yu Zhang, Kaishun Wu

发表机构 * Guangzhou Meteorological Observatory(广州气象局)

AI总结 提出MetroLogist框架,通过物理定制的编码器、时间相位对齐器和跨场空间聚合器,整合热力学、动力学和微物理驱动因子,实现风暴生命周期的完整建模,显著提升临近预报性能。

详情
AI中文摘要

大多数基于雷达反射率的临近预报系统关注当前降水,忽略了大气前兆——如低层辐合、湍流涡旋和潜热加热——这些为预见风暴诞生提供了短暂窗口。我们提出了MetroLogist,一个受物理启发的雷达智能框架,模拟从风暴前兆到组织化演变的完整对流生命周期。然而,利用这些前兆并非易事:它们源自多个气象驱动因子——热力学、动力学和微物理——这些因子异步演化(C1)且在空间上分散(C2)。为此,MetroLogist设计了三个紧密集成的组件。物理定制编码器根据雷达回波的内在物理尺度和语义进行处理,形成热力学、动力学和微物理流,捕捉不同的动力机制。时间相位对齐器通过利用因果时间注意力来捕捉不同驱动因子何时以及如何相互作用和激活,从而解决C1。跨场空间聚合器通过跨区域融合,对齐相邻单元中微弱且分散的前兆,以暴露上游触发因素并强制空间一致性,从而解决C2。在3D-NEXRAD(2020-2022,全美范围)上的评估显示,MetroLogist在高影响检测(CSI40)上比强基线提升了+9.7%,并在风暴发展阶段实现了37.67%的显著增益——展示了在风暴出现之前感知它们的真正预见能力。代码可在补充材料中找到。

英文摘要

Most nowcasting systems, built on radar reflectivity, focus on current precipitation, ignoring the atmospheric precursors -- such as low-level convergence, turbulent eddies, and latent heating -- that offer a fleeting window to foresee storm birth. We introduce MeteoLogist, a physics-inspired radar intelligence framework that models the full life cycle of convection -- from its precursors to organized storm evolution. However, exploiting these precursors is non-trivial: they originate from multiple meteorological drivers -- thermodynamic, kinematic, and microphysical -- that evolve asynchronously (C1) and remain spatially fragmented (C2). To this end, MeteoLogist designs three tightly integrated components. The Physics-Tailored Encoders process radar echoes according to their intrinsic physical scales and semantics, forming thermodynamic, kinematic, and microphysical streams that capture distinct dynamical regimes. The Temporal-Phase Aligner addresses C1 by leveraging causal temporal attention to capture when and how different drivers interact and activate. The Cross-Field Spatial Aggregator addresses C2 through cross-regional fusion, aligning weak and scattered precursors across neighboring cells to expose upstream triggers and enforce spatial coherence. Evaluated on 3D-NEXRAD (2020--2022, US-wide), MeteoLogist boosts high-impact detection (CSI40) by +9.7% over strong baselines, and achieves a remarkable 37.67% gain during the storm-developing stage -- demonstrating true foresight in sensing storms before they appear. The code can be found in the supplementary material.

2605.24050 2026-05-26 cs.SE cs.AI stat.AP

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

更多技能,更差智能体?扩展技能库时技能遮蔽降低性能

Hongwen Song, Song, Wei

发表机构 * Databricks Inc.(Databricks公司)

AI总结 本文研究LLM智能体技能库扩展导致性能下降的现象,提出将性能下降分解为技能遮蔽和上下文开销两种效应,并通过实验证明技能遮蔽是主要瓶颈。

详情
AI中文摘要

技能库允许LLM智能体按需加载任务特定指令,使非专家用户能够通过自然语言解决领域特定任务,而无需知道存在哪些技能或它们如何工作。然而,随着技能库的增长,性能会下降——当从一组已知有用的小技能扩展到包含202个技能的库时,性能下降高达21%。在这项工作中,我们将这种性能下降定义为从加载已知有用技能库到加载完整技能库之间的通过率下降。此外,我们提出通过条件化技能调用——即智能体在轨迹中选择哪些技能——将通过率下降分解为两种效应:\emph{技能遮蔽},即随着技能库扩展,智能体更频繁地选择错误技能;以及\emph{上下文开销},即即使选择正确,扩大的上下文也会降低执行性能。我们推导了这两种效应的上界,以表征它们对通过率下降的影响程度。我们对效应及其上界的经验估计均表明,\emph{技能遮蔽}效应随技能库大小增长,并对性能下降有显著贡献,而\emph{上下文开销}效应仍然很小且与零无显著差异。这种观察到的非对称性表明,技能选择失败(而非上下文扩大)是扩展技能库时的主要瓶颈。

英文摘要

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

2605.24031 2026-05-26 q-fin.CP cs.LG

Volatility Surface Reconstruction using Deep Learning under No-Arbitrage Constraints

无套利约束下使用深度学习进行波动率曲面重建

Pablo Rodriguez Manzi

发表机构 * Universidad de Buenos Aires(布宜诺斯艾利斯大学)

AI总结 研究使用深度学习模型在无套利约束下从稀疏噪声期权报价重建隐含波动率曲面,比较多种神经网络架构与经典SVI参数化方法,发现Transformer和U-Net在稀疏观测下重建精度高,软套利惩罚有效减少套利违规。

Comments MSc thesis, Universidad de Buenos Aires, 2026. 94 pages, 27 figures

详情
AI中文摘要

我们研究了使用深度学习模型在无套利约束下从稀疏和噪声期权报价重建隐含波动率曲面。我们比较了多种神经架构,包括多层感知机、卷积网络、U-Net、变分自编码器和基于Transformer的模型与经典SVI参数化方法在期权市场数据上的表现。结果表明,Transformer和U-Net架构实现了较强的重建精度,特别是在稀疏观测情况下,而软套利惩罚显著减少了套利违规,对重建误差影响适中。我们进一步分析了不同架构和正则化强度下精度与套利一致性之间的权衡。

英文摘要

We study the reconstruction of implied volatility surfaces from sparse and noisy option quotes using deep learning models under no-arbitrage constraints. We compare multiple neural architectures, including multilayer perceptrons, convolutional networks, U-Nets, variational autoencoders, and Transformer-based models against classical SVI parameterizations on option market data. Results show that Transformer and U-Net architectures achieve strong reconstruction accuracy, particularly under sparse observation regimes, while soft arbitrage penalties significantly reduce arbitrage violations with moderate impact on reconstruction error. We further analyze the trade-off between accuracy and arbitrage consistency across architectures and regularization strengths.

2605.24016 2026-05-26 cs.AR cs.AI

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

SA-Kura: 用于扩散采样中局部耦合Kuramoto漂移的节能脉动阵列加速器

Jeongmin Jin, Kyeongwon Lee, Mundo Jeong, Jongin Choi, Woojoo Lee

发表机构 * National Research Foundation of Korea(韩国国家研究基金会) Institute of Information & communications Technology Planning & Evaluation(信息通信技术规划与评估院)

AI总结 针对扩散采样中局部耦合Kuramoto漂移的计算瓶颈,提出首个专用数字脉动阵列加速器SA-Kura,通过重新公式化耦合计算实现高效脉动执行,相比软件和GPU分别实现193倍和6.57倍加速。

Comments 8 pages, 6 figures, 1 table; ACM/IEEE ISLPED 2026 accepted paper

详情
AI中文摘要

扩散推理在边缘部署中仍然成本高昂,但现有加速器几乎完全专注于分数网络,因为标准漂移仅仅是微不足道的线性缩放。Kuramoto定向扩散用局部耦合的相位相互作用取代了这种微不足道的漂移,提高了采样效率,但引入了新的硬件瓶颈:在每个反向步骤中评估的中心依赖非线性5x5模板。该内核难以映射到传统的CNN加速器和面向矩阵的引擎。我们提出了SA-Kura,据我们所知,这是第一个专用于局部耦合Kuramoto漂移的数字脉动阵列加速器。通过将成对正弦耦合重新表述为独立于中心相位的邻居累加,然后进行单个中心依赖的乘减组合,SA-Kura消除了PE内的超越函数单元,并实现了具有寄存器级复用的规则脉动执行。SA-Kura以可综合RTL实现,集成到基于RISC-V的轻量级SoC中,在FPGA上原型验证,并通过45nm CMOS综合和功耗分析进行评估。仅对于漂移内核,与同一SoC平台上处理器内核上相同内核的软件执行相比,SA-Kura分别将延迟和能耗降低了193倍和69.4倍。与独立的Jetson Orin Nano CUDA实现相同内核相比,它快6.57倍,并且每像素能耗降低约46.0倍。

英文摘要

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.