arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2509.12266 2026-05-18 q-bio.GN cs.LG

Genome-Factory: A Library for Tuning, Deploying, and Interpreting Genomic Foundation Models

Weimin Wu, Xuefeng Song, Yibo Wen, Qinjie Lin, Zhihan Zhou, Jerry Yao-Chieh Hu, Zhong Wang, Han Liu

发表机构 * Center for Foundation Models Generative AI \& Department of Computer Science, Northwestern University, USA School of Natural Sciences, University of California at Merced, USA Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, USA Systems Biology Division, Lawrence Berkeley National Laboratory, USA Department of Statistics Data Science, Northwestern University, USA

AI总结 本文介绍了 Genome-Factory,一个用于调优、部署和解释基因组基础模型的首个集成 Python 库。该库通过统一数据收集、模型调优、推理、基准测试和可解释性分析的流程,简化了基因组模型的开发工作。其核心贡献包括自动化数据预处理、支持多种模型调优方式、提供嵌入提取与序列生成功能,并引入基于稀疏自编码器的生物解释器,显著提升了基因组模型在实际分析中的实用价值。

详情
英文摘要

We introduce Genome-Factory, the first integrated Python library for tuning, deploying, and interpreting genomic foundation models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. For model tuning, Genome-Factory supports both full and parameter-efficient fine-tuning across diverse genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface to incorporate additional benchmarks. For interpretability, Genome-Factory introduces an open-source biological interpreter based on a sparse auto-encoder. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its practical value for real-world genomic analysis. GitHub: https://github.com/WeiminWu2000/Genome_Factory.

2509.01685 2026-05-18 stat.ML cs.LG math.OC stat.CO

Preconditioned Regularized Wasserstein Proximal Sampling

Hong Ye Tan, Stanley Osher, Wuchen Li

发表机构 * Department of Mathematics, University of California, Los Angeles(加州大学洛杉矶分校数学系) Department of Mathematics, University of South Carolina(南卡罗来纳大学数学系)

AI总结 本文研究如何通过有限粒子的演化从吉布斯分布中进行采样,提出了一种预条件正则化Wasserstein近端采样方法。该方法通过正则化Wasserstein近端算子的数值可计算得分函数来近似得分函数,并基于各向异性热方程的Cole-Hopf变换推导出其核形式。实验表明,该方法在多种对数凹和非对数凹分布以及贝叶斯图像去卷积和神经网络训练任务中表现出加速和稳定性优势。

详情
英文摘要

We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tractable score of a regularized Wasserstein proximal operator. This is derived by a Cole--Hopf transformation on coupled anisotropic heat equations, yielding a kernel formulation for the preconditioned regularized Wasserstein proximal. The diffusion component of the proposed method is also interpreted as a modified self-attention block, as in transformer architectures. For quadratic potentials, we provide a discrete-time non-asymptotic convergence analysis and explicitly characterize the bias, which is dependent on regularization and independent of step-size. Experiments demonstrate acceleration and particle-level stability on various log-concave and non-log-concave toy examples to Bayesian total-variation regularized image deconvolution, and competitive/better performance on non-convex Bayesian neural network training when utilizing variable preconditioning matrices.

2508.16114 2026-05-18 astro-ph.GA astro-ph.IM astro-ph.SR cs.LG

Neural-Network Chemical Emulator for First-Star Formation: Robust Iterative Predictions over a Wide Density Range

Sojun Ono, Kazuyuki Sugimura

发表机构 * Department of Astronomy, Kyoto University(京都大学天文系) Faculty of Science, Hokkaido University(北海道大学理学部)

AI总结 本文提出了一种基于神经网络的化学模拟器,用于研究第一代恒星(Population III)形成过程中的热力学与化学演化。该模拟器能够覆盖21个数量级的密度范围(10⁻³–10¹⁸ cm⁻³),准确追踪六种原始物质的演化。为提高预测的鲁棒性和效率,研究引入了基于时间尺度的更新方法,并在不同密度区间分别训练深度算子网络,显著提升了计算速度并保证了多步迭代下的预测精度。

Comments 19 pages, 7 figures, Accepted for publication in ApJ

Journal ref ApJ, 996, 9 (2026)

详情
英文摘要

We present a neural-network emulator for the thermal and chemical evolution in Population III star formation. The emulator accurately reproduces the thermochemical evolution over a wide density range spanning 21 orders of magnitude (10$^{-3}$-10$^{18}$ cm$^{-3}$), tracking six primordial species: H, H$_2$, e$^{-}$, H$^{+}$, H$^{-}$, and H$_2^{+}$. To handle the broad dynamic range, we partition the density range into five subregions and train separate deep operator networks (DeepONets) in each region. When applied to randomly sampled thermochemical states, the emulator achieves relative errors below 10% in over 90% of cases for both temperature and chemical abundances (except for the rare species H$_2^{+}$). The emulator is roughly ten times faster on a CPU and more than 1000 times faster for batched predictions on a GPU, compared with conventional numerical integration. Furthermore, to ensure robust predictions under many iterations, we introduce a novel timescale-based update method, where a short-timestep update of each variable is computed by rescaling the predicted change over a longer timestep equal to its characteristic variation timescale. In one-zone collapse calculations, the results from the timescale-based method agree well with traditional numerical integration even with many iterations at a timestep as short as 10$^{-4}$ of the free-fall time. This proof-of-concept study suggests the potential for neural network-based chemical emulators to accelerate hydrodynamic simulations of star formation.

2508.03810 2026-05-18 hep-th cs.LG

Viability of perturbative expansion for quantum field theories on neurons

Srimoyee Sen, Varun Vaidya

发表机构 * Department of Physics and Astronomy, Iowa State University, Ames, Iowa 50011, USA(物理学与天文学系,爱荷华州立大学,爱荷华州阿姆斯,爱荷华50011,美国) Department of Physics, University of South Dakota, Vermillion, SD 57069, USA(物理学系,南达科他大学,韦尔米伦,SD 57069,美国)

AI总结 本文研究了在有限神经元数量下,使用神经网络架构进行局部量子场论微扰计算的可行性,以$d$维欧几里得空间中的标量$ϕ^4$理论为例。研究发现,二点和四点关联函数的重整化$O(1/N)$修正所形成的微扰级数对紫外截断敏感,收敛性较弱。为此,作者提出对网络结构进行改进,并探讨了理论参数和神经元数量的标度关系,以更准确地提取场论结果。

Comments Published version

详情
英文摘要

Neural Network (NN) architectures that break statistical independence of parameters have been proposed as a new approach for simulating local quantum field theories (QFTs). In the infinite neuron number limit, single-layer NNs can exactly reproduce QFT results. This paper examines the viability of this architecture for perturbative calculations of local QFTs for finite neuron number $N$ using scalar $ϕ^4$ theory in $d$ Euclidean dimensions as an example. We find that the renormalized $O(1/N)$ corrections to two- and four-point correlators yield perturbative series which are sensitive to the ultraviolet cut-off and therefore have a weak convergence. We propose a modification to the architecture to improve this convergence and discuss constraints on the parameters of the theory and the scaling of N which allow us to extract accurate field theory results.

2506.14829 2026-05-18 cs.HC cs.AI cs.LG

The Hardness of Achieving Impact in AI for Social Impact Research: A Ground-Level View of Challenges & Opportunities

Aditya Majumdar, Wenbo Zhang, Kashvi Prawal, Amulya Yadav

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文探讨了人工智能用于社会影响研究(AI4SI)在实际应用中面临的主要挑战与机遇。研究通过访谈26位AI4SI领域的研究者,分析了在结构性、组织性、沟通与协作等方面阻碍AI4SI落地的障碍,并总结了可行的合作策略与实践经验。该研究为希望推动社会影响的AI研究者和机构提供了实用指导。

Comments To be published in FAccT'26

详情
英文摘要

AI for Social Impact (AI4SI) is an emergent field harnessing interdisciplinarities between the fields of artificial intelligence (AI), machine learning (ML), and the social sciences to address societal issues aligned with the United Nations Sustainable Development Goals (UN SDGs), such as universal healthcare, climate action, etc. Despite AI4SI's rising popularity, achieving tangible, on-the-ground impact remains a significant challenge. In particular, identifying collaborators open to co-designing and deploying AI4SI-based solutions in real-world settings is often difficult. Thus, many projects stall at the proof-of-concept stage, unable to scale to production-level deployment. Drawing on twenty-six AI4SI researchers' interviews, primarily from academic institutions though also including some industry researchers and practitioners, and the authors' own lived experiences, this paper employs thematic analysis to highlight structural, organizational, communication, collaboration, and operational challenges hindering socially impactful AI4SI deployments. While there are no easy fixes, the authors synthesize best practices and actionable strategies from interviews and personal experiences, positioning this paper as a practical guide for AI4SI researchers and organizations pursuing socially impactful collaborations$^1$. $^1$We note that our findings are most directly applicable to academic research groups in the global north, as governmental, startup, and global south researchers' perspectives are underrepresented in our sample.

2506.00182 2026-05-18 stat.ML cs.IT cs.LG math.IT math.ST stat.TH

Overfitting has a limitation: a model-independent generalization gap bound based on Rényi entropy

Atsushi Suzuki, Jing Wang

发表机构 * Department of Mathematics Faculty of Science The University of Hong Kong Hong Kong SAR(香港大学数学系) School of Computing and Mathematical Sciences Faculty of Engineering and Science. University of Greenwich London, United Kingdom(格林威治大学工程与科学学院)

AI总结 本文研究了机器学习模型泛化能力的限制,提出了一个与模型无关的泛化间隙上界,该上界仅依赖于数据生成分布的Rényi熵。研究指出,即使模型规模无限增大,只要数据量相对于Rényi熵足够,仍可保持较小的泛化间隙。该框架不仅解释了数据中注入噪声导致性能下降的现象,还拓展了无免费午餐定理,强调了数据分布熵在成功学习中的关键作用。

详情
英文摘要

Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization gap, which is the impact of overfitting. Understanding generalization gap behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization gap applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the Rényi entropy of the data-generating distribution, suggesting that a small generalization gap can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's Rényi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the Rényi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

2505.11708 2026-05-18 cs.CR cs.LG

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Diksha Goel, Kristen Moore, Jeff Wang, Minjune Kim, Thanh Thi Nguyen

发表机构 * Monash University(墨尔本大学)

AI总结 随着强化学习(RL)在模拟复杂网络攻击中的应用日益广泛,其决策过程的不透明性成为阻碍信任建立、调试和防御准备的关键问题。本文提出了一种统一的多层级解释框架,用于揭示基于RL的攻击代理在战略(MDP层)和战术(策略层)层面的决策逻辑,通过将网络攻击建模为部分可观测马尔可夫决策过程(POMDP)并分析Q值的动态变化,实现了对攻击行为演变的深入解释。该框架具有通用性,适用于多种攻击代理和环境,为红队模拟、策略调试、威胁建模和前瞻防御等场景提供了可解释的行为洞察。

详情
英文摘要

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are {predominantly} post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, {supporting use cases such as red-team simulation, RL policy debugging, phase-aware threat modelling and anticipatory defence planning.} By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

2504.13850 2026-05-18 cs.DC cs.LG

FedOptima: Optimizing Resource Utilization in Federated Learning

Zihan Zhang, Leon Wong, Blesson Varghese

发表机构 * organization= School of Computer Science, University of St Andrews , addressline= Jack Cole Building , city= St Andrews , postcode= KY16 9SX , state= Fife , country= Scotland, United Kingdom organization= Autonomous Networking Research \& Innovation Department, Rakuten Mobile, Inc. , addressline= Rakuten Crimson House, 1-14-1 Tamagawa, Setagaya-ku , city= Tokyo , postcode= 158-0094 , state= Tokyo , country= Japan

AI总结 本文提出 FedOptima,一种优化联邦学习中资源利用的系统,旨在解决服务器和设备资源利用率低的问题。该系统通过异步聚合、辅助网络和集中式任务调度等创新方法,同时减少由任务依赖和设备异步导致的空闲时间,显著提升了训练效率和模型准确性。实验表明,FedOptima 在保持高精度的同时,大幅提升了训练速度和系统吞吐量。

Comments Accepted for publication in Future Generation Computer Systems

Journal ref Future Generation Computer Systems, Volume 183, October 2026, 108551

详情
英文摘要

Federated learning (FL) systems facilitate distributed machine learning across a server and multiple devices. However, FL systems have low resource utilization on servers and devices, limiting their practical use in the real world. This inefficiency primarily arises from two types of idle time: (i) task dependency between the server and devices, and (ii) stragglers among heterogeneous devices. This paper introduces FedOptima, a resource-optimized FL system designed to simultaneously minimize both types of idle time; existing systems do not eliminate or reduce both at the same time. FedOptima offloads the training of certain layers of a neural network from a device to a server using three innovations. First, devices operate independently of each other using asynchronous aggregation to eliminate straggler effects, and independently of the server by utilizing auxiliary networks to minimize idle time caused by task dependency. Second, the server performs centralized training using a task scheduler that ensures balanced contributions from all devices, improving model accuracy. Third, an efficient memory management mechanism on the server increases the scalability of the number of participating devices. Extensive experiments are conducted on multiple lab-based testbeds, evaluated on image classification and sentiment analysis tasks with CNNs and Transformers. Compared to four state-of-the-art offloading-based and asynchronous FL baselines, FedOptima (i) achieves higher or comparable accuracy, (ii) accelerates training by 1.9x to 21.8x, (iii) reduces server and device idle time by up to 93.9% and 81.8%, respectively, and (iv) increases throughput by 1.1x to 2.0x.

2501.13188 2026-05-18 cond-mat.stat-mech cs.LG nlin.AO q-bio.CB

Topological constraints on self-organisation in locally interacting systems

Francesco Sacco, Dalton A R Sakthivadivel, Michael Levin

发表机构 * Allen Discovery Center at Tufts University, Medford, MA 02155(塔夫茨大学艾伦发现中心,马萨诸塞州梅德福02155) Department of Mathematics, CUNY Graduate Center, New York, NY 10016(纽约市立大学研究生中心数学系,纽约州纽约市10016) Department of Biology, Tufts University(塔夫茨大学生物学系) Wyss Institute for Biologically Inspired Engineering, Harvard University(哈佛大学生物启发工程研究所)

AI总结 本文研究了局部相互作用系统中自组织行为的拓扑限制,探讨了在平面图结构下,系统能否形成有序相的必要条件。通过分析三个模型系统(Potts模型、自回归模型和分层网络)中自由能随领域壁形成的缩放行为,揭示了图结构中的相互作用组合如何影响自发有序的产生。研究结果为理解生物多尺度系统能够形成复杂模式,而基础语言模型在处理长序列时面临挑战提供了理论依据。

Comments 11+3 pages, four figures, four tikzpictures. This version to appear in Philos Trans R Soc A

Journal ref Philosophical Transactions A, 384(2320), 2026

详情
英文摘要

All intelligence is collective intelligence, in the sense that it is made of parts which must align with respect to system-level goals. Understanding the dynamics which facilitate or limit navigation of problem spaces by aligned parts thus impacts many fields ranging across life sciences and engineering. To that end, consider a system on the vertices of a planar graph, with pairwise interactions prescribed by the edges of the graph. Such systems can sometimes exhibit long-range order, distinguishing one phase of macroscopic behaviour from another. In networks of interacting systems we may view spontaneous ordering as a form of self-organisation, modelling neural and basal forms of cognition. Here, we discuss necessary conditions on the topology of the graph for an ordered phase to exist, with an eye towards finding constraints on the ability of a system with local interactions to maintain an ordered target state. By studying the scaling of free energy under the formation of domain walls in three model systems -- the Potts model, autoregressive models, and hierarchical networks -- we show how the combinatorics of interactions on a graph prevent or allow spontaneous ordering. As an application we are able to analyse why multiscale systems like those prevalent in biology are capable of organising into complex patterns, whereas rudimentary language models are challenged by long sequences of outputs.

2412.12636 2026-05-18 cs.DC cs.AI cs.LG cs.PF

TrainMover: An Interruption-Resilient Runtime for ML Training

ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Zhengping Qian, Aditya Akella, Minlan Yu, Ennan Zhai, Dennis Cai, Jingren Zhou

发表机构 * Harvard University(哈佛大学) Alibaba Group(阿里巴巴集团) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 大规模机器学习训练任务常因硬件、软件故障或管理事件而中断,现有方法如检查点重启或运行时重新配置往往导致较长的停机时间和性能下降。本文提出TrainMover,一种具有高弹性的大语言模型训练运行时系统,通过利用弹性与备用机器实现最小停机时间和零内存开销的中断处理。TrainMover引入了两阶段基于增量的通信组构建、无通信沙箱预热以及通用备用设计等关键技术,实验表明其在千GPU规模下处理中断的停机时间可稳定控制在约20秒,相比现有最佳方案可减少55%的GPU空转时间。

Comments 14 pages body, 19 pages total

详情
英文摘要

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

2410.02832 2026-05-18 cs.CR cs.AI

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, Yingwei Ma, Jiaheng Zhang, Bryan Hooi

发表机构 * Engineering Programme, NUS Graduate School, National University of Singapore(国立新加坡大学整合科学与工程计划) Institute of Data Science (IDS), National University of Singapore(国立新加坡大学数据科学研究所) Department of Computer Science, School of Computing, National University of Singapore(国立新加坡大学计算机科学系)

AI总结 本文提出了一种简单而有效的黑盒大语言模型越狱攻击方法FlipAttack。该方法利用大语言模型从左到右理解文本的特性,通过在提示左侧添加噪声干扰模型理解,从而隐藏有害指令,并进一步扩展出四种翻转模式。实验表明,FlipAttack具有高度通用性、隐蔽性和简洁性,仅需一次查询即可成功越狱,对包括GPT-4o在内的多个模型均取得了高达约98%的攻击成功率。

Comments 43 pages, 31 figures

详情
英文摘要

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$98\% attack success rate on GPT-4o, and $\sim$98\% bypass rate against 5 guardrail models on average. The codes are available at GitHub\footnote{https://github.com/yueliu1999/FlipAttack}.

2407.08094 2026-05-18 stat.ML cs.LG physics.chem-ph physics.data-an

Density Estimation via Binless Multidimensional Integration

Matteo Carli, Alex Rodriguez, Alessandro Laio, Aldo Glielmo

发表机构 * SISSA Harvard University(哈佛大学) University of Trieste(特里斯特大学) ICTP(国际理论物理中心) Banca d’Italia(意大利银行)

AI总结 本文提出了一种名为无箱多维热力学积分(BMTI)的非参数密度估计方法,用于高效、稳健地估计高维数据的密度。该方法通过计算相邻数据点之间的对数密度差异,并结合最大似然框架对其进行加权积分,从而估计密度的对数。BMTI无需对数据进行分箱或空间划分,而是基于自适应带宽选择构建邻域图,利用流形假设在数据的内在流形上进行估计,有效克服了传统非参数密度估计方法的局限性,并在高维空间中表现出优越的性能。

详情
英文摘要

We introduce the Binless Multidimensional Thermodynamic Integration (BMTI) method for nonparametric, robust, and data-efficient density estimation. BMTI estimates the logarithm of the density by initially computing log-density differences between neighbouring data points. Subsequently, such differences are integrated, weighted by their associated uncertainties, using a maximum-likelihood formulation. This procedure can be seen as an extension to a multidimensional setting of the thermodynamic integration, a technique developed in statistical physics. The method leverages the manifold hypothesis, estimating quantities within the intrinsic data manifold without defining an explicit coordinate map. It does not rely on any binning or space partitioning, but rather on the construction of a neighbourhood graph based on an adaptive bandwidth selection procedure. BMTI mitigates the limitations commonly associated with traditional nonparametric density estimators, effectively reconstructing smooth profiles even in high-dimensional embedding spaces. The method is tested on a variety of complex synthetic high-dimensional datasets, where it is shown to outperform traditional estimators, and is benchmarked on realistic datasets from the chemical physics literature.

2605.16153 2026-05-18 cs.AI

An Algebraic Exposition of the Theory of Dyadic Morality

双人道德理论的代数阐释

Kush R. Varshney

AI总结 本文通过代数方法阐述双人道德理论,提出三种心理运算符以扩展结构因果模型,解决双人限制下的可扩展性问题,并应用于AI政策设计,通过节点压缩和顺序处理实现道德认知。

详情
AI中文摘要

本文提供双人道德理论(TDM)的代数阐释,该理论是一种基于简单双节点模板的心理道德判断模型:一个意图行为者对脆弱患者造成伤害。我们使用结构因果建模(SCM)符号形式化TDM,并识别三种心理运算符(类型化运算符、完成运算符和价值依赖推理机制)以扩展标准SCM,以捕捉人们在约束下如何计算道德判断。我们解决了TDM双人限制带来的可扩展性挑战,展示道德认知如何通过节点压缩和顺序处理压缩多节点场景。基于此代数框架,我们展示了具体应用于AI政策设计:检测冲突义务、构建保留用户自主性的有益政策、以及设计故障后沟通作为因果干预。最后,我们推荐对心智感知进行范围化的、情境化的测量,而非普遍平均,以实证化该理论。这种代数形式化使神经符号AI系统能够以数学严谨且符合人类道德认知的方式计算道德。

英文摘要

This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM's dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.

2605.10100 2026-05-18 cs.CV cs.AI

HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation

HYPERPOSE:超几何运动相空间注意力用于3D人体姿态估计

Vinduja Thekkath, Ashish Musale, Ajay Waghumbare, Upasna Singh

AI总结 HYPERPOSE提出一种在双曲空间内进行时空推理的3D人体姿态估计框架,通过超几何运动相空间注意力机制保留人体骨骼的树状结构,提升几何精度和时间动态建模。

详情
AI中文摘要

我们引入HYPERPOSE,一种新颖的3D人体姿态估计框架,其通过在洛伦兹模型的双曲空间$\mathbb{H}^d$中进行时空推理,原生保持人体骨骼的层次树状拓扑结构。当前最先进的姿态估计器依赖于transformers和图卷积网络来捕捉复杂的关节动态,但这些架构仅在欧几里得空间中操作,与人体固有的树状结构根本不匹配,导致指数体积扭曲和结构不一致。为此,我们脱离平坦空间,引入超几何运动相空间注意力(HKPSA)机制,原生嵌入复杂关节关系,同时结合多尺度窗口双曲注意力机制,以$O(TW)$复杂度高效建模时间动态。此外,为克服非欧几里得流形训练的已知不稳定性,HYPERPOSE引入新的黎曼损失套件和不确定性加权课程学习,强制物理测地线约束,如骨骼长度和速度一致性。在Human3.6M和MPI-INF-3DHP数据集上的广泛评估表明,HYPERPOSE在结构和时间一致性上达到最先进的水平,显著减少体积扭曲和速度误差,同时在整体位置准确性上建立新的最先进基准。

英文摘要

We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.

2603.13452 2026-05-18 cs.AI cs.CY cs.LG

MESD: A Risk-Sensitive Metric for Explanation Fairness Across Intersectional Subgroups

MESD:一种用于跨交集子组解释公平性的风险敏感度度量

Gideon Popoola, John Sheppard

AI总结 本文提出MESD,一种衡量不同交集子组解释质量差异的程序公平度量,结合标签感知聚合、经验贝叶斯收缩和CVaR加权,通过多目标优化框架UEF优化效用、结果公平和程序公平。

详情
AI中文摘要

机器学习中的公平性主要通过结果导向指标,如人口统计学均等性,来评估预测是否在受保护群体中统计上一致。然而,这些指标无法检测模型是否对不同人口群体使用系统性不同的推理,这违反了程序公平原则。这个问题被交集性加剧,其中模型可能在个别属性(如种族)上显得公平,但在交集子群(如种族×性别)上表现出显著差异,即公平性红区划分。本文引入多类别解释稳定性差异(MESD),一种程序公平度量,量化由多个受保护属性的笛卡尔积形成的交集子组中的解释质量差异。MESD整合了三个组件,即标签感知聚合,与结果条件公平对齐,经验贝叶斯收缩以稳定小交集群体的估计,以及条件价值-at-风险(CVaR)加权以强调最坏情况子群差异。我们将MESD整合到多目标优化框架(UEF)中,通过NSGA-II联合优化效用、结果公平和程序公平。我们在三个基准数据集和四种最先进方法上评估了MESD和UEF,证明MESD揭示了仅靠结果指标无法察觉的程序差异。我们将我们的贡献置于程序正义理论中,并讨论了对监管合规和交集公平性的意义。

英文摘要

Fairness in machine learning is predominantly evaluated through outcome-oriented metrics, such as Demographic parity, which measure whether predictions are statistically consistent across protected groups. However, these metrics cannot detect whether a model uses systematically different reasoning for different demographic groups, which violates procedural fairness principles. This problem is compounded by intersectionality, where models may appear fair on individual attributes (e.g., race) while exhibiting significant disparities for intersectional subgroups (e.g., race $\times$ gender), a phenomenon known as fairness gerrymandering. In this work, we introduce Multi-category Explanation Stability Disparity (MESD), a procedural fairness metric that quantifies disparities in explanation quality across intersectional subgroups formed by the Cartesian product of multiple protected attributes. MESD integrates three components, which are label-aware aggregation aligned with outcome-conditional fairness, empirical-Bayes shrinkage to stabilize estimates for small intersectional groups, and Conditional Value-at-Risk (CVaR) weighting to emphasize worst-case subgroup disparities. We integrate MESD within a multi-objective optimization framework (UEF) that jointly optimizes utility, outcome fairness, and procedural fairness using NSGA-II. We evaluated MESD and UEF on three benchmark datasets along with four state-of-the-art methods in several experiments, and we demonstrate that MESD reveals procedural disparities invisible to outcome metrics alone. We position our contribution within procedural justice theory and discuss implications for regulatory compliance and intersectional equity.

2602.06932 2026-05-18 cs.LG

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

当强化学习遇见自适应推测训练:一个统一的训练-服务系统

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

AI总结 本文提出Aurora系统,通过强化学习实时学习推测器,解决传统方法中部署延迟和领域漂移问题,实验显示在多个模型上实现显著加速。

详情
AI中文摘要

推测解码可以显著加速大语言模型服务,但目前大多数部署将推测器训练与服务分离,将其视为独立的离线建模问题。我们证明这种解耦方法引入了显著的部署和适应延迟:(1)高服务时间,因为推测器必须在部署前长时间离线训练;(2)延迟的效用反馈,因为真正的端到端解码加速只有在训练后才能知道,不能可靠地从接受率推断;(3)领域漂移退化,因为目标模型被重新用于新领域,推测器变得过时且效果下降。为了解决这些问题,我们提出了Aurora,一个统一的训练-服务系统,通过持续学习活推理轨迹直接学习推测器。Aurora将在线推测器学习重新定义为异步强化学习问题:接受的令牌提供正反馈,而被拒绝的推测器提案提供隐含的负反馈,用于提高样本效率。我们的设计集成了基于SGLang的推理服务器和异步训练服务器,使推测器更新能够热交换而不停止服务。关键的是,Aurora支持零日部署:推测器可以立即服务并快速适应实时流量,提高系统性能同时提供即时效用反馈。在实验中,Aurora在最近发布的前沿模型(如MiniMax M2.1 229B和Qwen3-Coder-Next 80B)上实现了1.5倍的零日加速。Aurora还有效适应用户流量的分布变化,在广泛使用的模型(如Qwen3和Llama3)上,相对于经过良好训练但静态的推测器,提供了额外1.25倍的加速。

英文摘要

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

2511.15887 2026-05-18 cs.CL

Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

留意动作:在日常肢体语言中评估共情理论

Seungbeen Lee, Jinhong Jeong, Donghyun Kim, Yejin Son, Youngjae Yu

AI总结 本文提出Motion2Mind框架,通过专家编纂的肢体语言参考库评估机器解读非言语线索的能力,发现现有AI在非言语解读上存在显著差距。

Comments The authors identified issues in the current version and would like to withdraw the manuscript for substantial revision

详情
AI中文摘要

我们通过非言语线索(NVCs)解读他人心理状态的能力对生存和社会凝聚力至关重要。尽管现有的共情理论(ToM)基准测试主要集中在虚假信念任务和不对称信息推理上,但它们忽略了除了信念之外的其他心理状态以及人类非言语交流的丰富图景。我们提出了Motion2Mind框架,用于评估机器解读NVCs的共情能力。利用专家编纂的肢体语言参考作为代理知识库,我们构建了Motion2Mind,一个精心编纂的视频数据集,包含精细的非言语线索标注和手动验证的心理学解释。它涵盖了222种非言语线索和397种心理状态。我们的评估发现,当前AI系统在NVC解读上存在显著困难,不仅在检测方面存在较大的性能差距,而且在解释方面也表现出比人类标注者更高的过度解读模式。

英文摘要

Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

2605.15708 2026-05-18 cs.CV

3D Segmentation Using Viewpoint-Dependent Spatial Relationships

基于视角依赖空间关系的3D分割

Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam, Asako Kanezaki

AI总结 本文提出一个包含22万样本的3D参照分割数据集,通过密集视角采样扩展至数千万样本,研究视角依赖空间关系对3D大模型的影响,提升分割精度并提高mIoU至0.47。

详情
AI中文摘要

近期3D数据集和多模态模型的进步显著提升了自然语言3D场景理解。然而,大多数3D参照分割方法未显式表示观察者视角,导致

英文摘要

Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.

2605.15680 2026-05-18 cs.CL cs.LG q-bio.QM

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

少样本大语言模型在在线患者咨询可操作分诊中的应用

Liqi Zhou, Jiafu Li

AI总结 本文研究少样本条件下大语言模型在在线患者咨询分诊中的应用,通过构建不同数据集比较TF-IDF和BioBERT与六个LLM在0-shot、4-shot和12-shot条件下的表现,发现Claude Haiku 4.5在12-shot条件下达到0.475的宏F1值,优于监督基线模型。

Comments 4 figures, 19 tables, 23 pages (including appendix and reference)

详情
AI中文摘要

在线患者咨询通常非正式、不完整且在专业评估前撰写,但仍需路由至适当的临床随访级别。我们将此任务定义为四类可操作分诊任务——自我护理、预约就诊、紧急医生审查或紧急转诊,并探讨在低资源标注条件下,提示式大语言模型(LLMs)是否能支持此类路由。使用公开的HealthCareMagic-100K语料库,我们构建了300例人工校准的金标准评估集、700例自动标注的银色训练集和40例少样本池。我们比较了在银色标签上训练的TF-IDF和BioBERT基线模型与六个提示式LLM在0-shot、4-shot和12-shot条件下的表现。我们通过宏F1值以及安全意识指标,包括紧急召回率、漏诊率和严重漏诊率进行评估。最强的LLM(Claude Haiku 4.5,12-shot)达到宏F1值0.475,优于最佳监督基线模型(BioBERT,0.378)的点估计,且置信区间有重叠。少样本提示和两模型一致性在标签依赖方式上有所帮助:自我护理一致性可靠,紧急医生审查不可靠。我们得出结论,LLM可以支持分诊优先级和选择性的人类审核,但不能自主部署。

英文摘要

Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

2605.15665 2026-05-18 cs.AI

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

PRISM:通过迭代模拟和监控实现提示的可靠性用于企业对话式AI

Keshava Chaitanya, Jahnavi Gundakaram

AI总结 PRISM通过持续模拟和监控,将提示工程视为可靠性工程问题,提升企业对话式AI的可靠性,减少提示开发时间并修复生产中的回归问题。

Comments 12 pages, 1 figure, 5 tables. arXiv preprint

详情
AI中文摘要

在企业环境中部署基于大型语言模型(LLM)的对话代理需要同时正确且具有抗非确定性行为漂移能力的提示。现有提示优化框架将提示质量视为一次性的编译时问题,未能解决如何检测和修复由时间推移导致的LLM行为变化引起的提示回归问题。我们提出了PRISM(通过迭代模拟和监控实现提示的可靠性),一个闭环框架,将提示工程视为持续的可靠性工程问题而非一次性创作任务。PRISM输入自然语言代理需求、配置的工具和内存变量集以及初始草稿提示。它自动从需求生成测试用例,模拟完整的多轮对话以对抗平台忠实的LLM环境,使用LLM作为判断者评估通过/失败,并诊断失败的根本原因,然后对提示进行手术性修复——迭代直到所有测试通过。关键的是,PRISM设计为定期运行(每日),将LLM行为漂移视为首要的可靠性问题。我们评估了PRISM在Yellow.ai V3平台上的35个企业对话代理,持续三周部署。PRISM将中位提示开发时间从2天减少到30分钟以内,实现了所有评估代理99%的生产可靠性,并在24小时内成功识别和修复由LLM行为漂移引起的生产回归问题。我们的结果表明,持续的、基于模拟的提示优化在大规模可靠的企业对话式AI中是可行且必要的。

英文摘要

Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.

2605.15341 2026-05-18 cs.LG cs.AI

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP:LLM在迭代科学设计中的轨迹级评估

Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting

AI总结 本文提出LEAPBench框架,通过轨迹级评估方法揭示LLM在迭代科学设计中的学习效率,发现传统基于结果的评估方法存在偏差,轨迹指标能更准确反映效率提升。

详情
AI中文摘要

LLMs正被越来越多地应用于自主实验室,其假设是领域先验知识和迭代反馈使它们在更少的迭代中收敛到好的设计。然而,当前的迭代科学设计基准仅评估固定时间范围内的结果快照,忽略了学习轨迹。为此,本文探讨了三种评估选择:测量什么、比较什么基准以及以什么为基础。引入LEAPBench,一个包含55个任务的框架,结合最佳到目前为止的曲线下面积(AUC)轨迹指标、经典贝叶斯优化基准和基于发表文献的审计。在八个现代LLMs上应用后,从最终结果到轨迹评分的切换在匹配时间范围内改变了53%的任务最佳模型决策,并揭示了被传统评分忽视的效率提升。LLMs在经典贝叶斯基准下并不表现更好。在16个生物学任务中,当oracle的奖励信号与发表最佳设计配置一致时,领域感知提示导致LLM选择匹配发表最佳的频率比领域无关提示低约10个百分点。这种模式在6个任务中最为明显,其中领域无关提示在所有6个任务中更常匹配发表最佳。轨迹指标还充当了可训练的目标。使用轨迹指标作为奖励的离线强化学习在14个21个保留任务中提升了性能。

英文摘要

LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.

2605.15228 2026-05-18 cs.AI cs.LG

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Jun He, Deying Yu

AI总结 本文研究了主权AI系统中自主智能体执行操作时的授权验证问题,提出了一种基于可信证明的分布式授权框架(DTF)。该框架通过结构化、可验证的证明对象来动态生成执行权限,确保所有高风险操作都必须基于共识验证的证明,并与证据链绑定,从而实现对智能体行为的可控、可审计和可追溯。该方法为云原生环境中的自主AI系统提供了安全、去中心化的授权基础设施。

Comments 19 pager, 2 figures, 4 tables

详情
英文摘要

Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically valid but semantically unsafe actions, making standing privileges a significant operational risk. This risk becomes especially acute in sovereign AI systems, where autonomous agents may interact with cloud infrastructure, regulated data, financial workflows, and national-scale digital services. Governed mutation substrates reduce this risk by interposing on agent actions: agents submit intents, infrastructure evaluates context and policy, and execution is mediated. However, this shifts the trust boundary: how can the decision to authorize an intent be made verifiable, distributed, and replayable? We introduce a Distributed Trust Framework (DTF), a verification framework for governed mutation systems that computes execution authority from structured, verifiable artifacts. DTF introduces a Justification Proof to encode the admissibility basis of an action, a consensus model for independent evaluation, an ephemeral Execution Identity derived from the approved proof, and an append-only Evidence Chain that preserves the authorization lifecycle. Under stated substrate assumptions, this architecture enforces a compact authorization invariant: no high-stakes execution without a proof object, no derived authority without consensus, and no valid mutation detached from evidence. We define the model, instantiate it over an OpenKedge-based governed mutation substrate, and show how it maps onto cloud-native environments. By shifting authorization from standing identity to proof-derived authority, DTF provides an infrastructure foundation for making agentic execution governable, auditable, and bounded in sovereign AI deployments.

2605.14978 2026-05-18 cs.CL

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

Jie Jiang, Xing Sun, Ruotian Chen, Jianan Su, Kaixin Shen

AI总结 本文研究了如何通过性能驱动的策略优化提升推测解码的效率,提出了一种基于强化学习的框架PPOW,该方法将草案模型的优化从传统的词元级模仿转向窗口级优化。PPOW结合了成本感知加速奖励、分布基于的接近奖励以及自适应发散感知窗口机制,优先优化具有高置信度的窗口。实验表明,PPOW在多个模型和基准测试中显著提升了推测解码的接受长度和加速效果。

详情
英文摘要

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

2602.18801 2026-05-18 cs.LG

SGNO: Spectral Generator Neural Operators for Stable Long Horizon PDE Rollouts

Jiayi Li, Penghao Jiang, Hira Saleem, Zhaonan Wang, Piotr Koniusz, Flora D. Salim

AI总结 本文提出了一种名为SGNO的频谱生成神经算子,用于解决长期时间演化偏微分方程(PDE)预测中的累积误差问题。SGNO通过结构化的频谱演化更新机制,结合实值非正对角生成器和复值频谱混合修正路径,实现了对耗散、色散、输运主导及非线性PDE的稳定长期预测。实验表明,SGNO在多个匹配机制的APEBench任务中显著优于现有单步自回归方法,尤其在色散和非线性耦合任务中表现突出。

详情
英文摘要

Autoregressive neural PDE surrogates predict future states by repeatedly applying a learned one-step operator. This is a simple and widely used method, but small one-step errors can accumulate during long rollouts. The resulting drift often appears as spectral amplitude distortion, phase misalignment, and nonlinear mode-interaction error. These effects are especially important for time-dependent PDEs with clear Fourier structure. We introduce the Spectral Generator Neural Operator (SGNO), a structured autoregressive neural operator for long-horizon PDE forecasting. SGNO organizes each learned one-step map as a structured spectral evolution update. A real-valued nonpositive diagonal generator provides a gain-controlled spectral backbone, while a learned correction pathway with complex-valued spectral mixing completes the residual evolution. This design gives the autoregressive step an evolution-like structure while retaining the flexibility needed for dissipative, dispersive, transport-dominated, and nonlinear PDEs. SGNO is designed for periodic linear and semilinear evolution PDEs with Fourier multiplier linear dynamics. Across ten mechanism-matched APEBench tasks spanning this regime, SGNO consistently outperforms strong single-step autoregressive baselines in long-horizon rollout accuracy, reducing GMean100 by a median of 74.8% relative to the strongest available non-SGNO baseline, with per-task reductions ranging from 13.6% to 92.9%. The gains are strongest on dispersive and transport-dominated tasks, as well as tasks involving nonlinear closure and mode coupling. Spectral diagnostics show lower spectral energy error and improved rollout-level phase fidelity. Ablations show that the constrained generator, the structured update, and the learned correction pathway each contribute to performance. The code is available at https://github.com/cruiseresearchgroup/SGNO.

2605.12509 2026-05-18 cs.SI cs.AI cs.CE math.CO

Representing Higher-Order Networks: A Survey of Graph-Based Frameworks

表示高阶网络:基于图的框架综述

Takaaki Fujita, Florentin Smarandache

AI总结 本文综述了用于表示高阶网络的图基框架,探讨了多方式、分层、时间、多层、递归和张量交互等方法,旨在提供统一视角以比较不同模型并识别合适工具。

Comments 170 pages. Peer-Reviewed Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-881-9

详情
AI中文摘要

许多现实世界现象自然地通过图和网络建模。然而,经典图模型通常局限于成对交互,可能无法充分捕捉实践中更丰富的结构。高阶图形式化通过引入多方式、分层、时间、多层、递归和张量基的交互,从而提供更丰富的复杂系统表示。本书全面概述了可用于建模高阶网络的数学概念,回顾了基础概念、扩展框架和新引入的正式化,强调其结构原理、关系和建模作用。目的是提供一种统一的视角,帮助读者比较不同的高阶网络模型,并识别适用于理论研究和实际应用的合适工具。本书是第2.0版,主要包含新增概念以及对错别字和解释的修正和改进。

英文摘要

Many real-world phenomena are naturally modeled by graphs and networks. However, classical graph models are often limited to pairwise interactions and may not adequately capture the richer structures that arise in practice. Higher-order graph formalisms extend this framework by incorporating multiway, hierarchical, temporal, multilayer, recursive, and tensor-based interactions, thereby providing more expressive representations of complex systems. This book presents a comprehensive overview of mathematical notions that can be used to model higher-order networks. It surveys foundational concepts, extensional frameworks, and newly introduced formalisms, with an emphasis on their structural principles, relationships, and modeling roles. The aim is to provide a unified perspective that helps readers compare diverse higher-order network models and identify appropriate tools for theoretical study and practical applications. This book is Edition 2.0. It mainly includes the addition of several concepts, as well as corrections and improvements of typographical errors and explanations.

2605.10867 2026-05-18 cs.CR cs.AI cs.CV cs.LG cs.NI

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON:一个用于从游戏数据中学习行为指纹的多模态数据集

Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh, Maninder Singh

AI总结 BEACON数据集通过高精度运动技能和认知负荷,为行为生物特征的鲁棒性测试提供严格压力测试,支持连续认证、行为建模和多模态学习。

详情
AI中文摘要

在高风险数字环境中,连续认证需要具有细粒度行为信号的高质量数据集,但现有基准往往受限于规模小、单模态传感或缺乏同步环境上下文。为此,本文引入BEACON(行为认证与连续监控行为引擎),一个大规模多模态数据集,捕捉竞技Valorant游戏中的多样化技能层级。BEACON包含约430GB同步多模态数据(461GB总存储量,包括辅助Valorant配置捕获),来自79个会话的28名不同玩家,估计102.51小时的活跃游戏时间,包括高频鼠标动态、按键事件、网络数据包捕获、屏幕录制、硬件元数据和游戏内配置上下文。BEACON利用战术射击游戏固有的高精度运动技能和高认知负荷,使其成为评估行为生物特征鲁棒性的严格压力测试。该数据集允许在高保真的电子竞技环境中研究连续认证、行为建模、用户漂移和多模态表示学习。作者在Hugging Face和GitHub上发布数据集和代码,以创建可重复的基准,用于评估下一代行为指纹和安全模型。

英文摘要

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON (Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive Valorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary Valorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models.

2503.15107 2026-05-18 stat.ML cs.LG

Interpretability of Graph Neural Networks to Assess Effects of Global Change Drivers on Ecological Networks

图神经网络的可解释性:评估全球变化驱动因素对生态网络的影响

Emre Anakok, Pierre Barbillon, Colin Fontaine, Elisa Thebault

AI总结 研究通过图神经网络分析全球变化驱动因素对传粉网络连接性的影响,探讨环境变量与植物属的交互作用,并验证去偏技术对估计效果的影响。

详情
AI中文摘要

传粉者在植物繁殖中起关键作用,无论是自然生态系统还是人类修改的景观。全球变化驱动因素,如气候变化或土地利用修改,会改变植物-传粉者相互作用。为了评估全球变化驱动因素对传粉的影响,需要大规模的相互作用、气候和土地利用数据。尽管最近的机器学习方法,如图神经网络(GNNs),允许分析此类数据集,但解释其结果具有挑战性。我们探索现有的GNN解释方法,以突出各种环境协变量对传粉网络连接性的影响。进行了广泛的模拟研究,以确认这些方法能否检测协变量与植物属之间的交互作用,以及去偏技术的应用是否影响这些效果的估计。对Spipoll数据集的应用,包括和不包括考虑采样效应,突显了土地利用对网络连接性潜在影响,并显示考虑采样效应部分改变了这些效果的估计。

英文摘要

Pollinators play a crucial role for plant reproduction, either in natural ecosystem or in human-modified landscape. Global change drivers,including climate change or land use modifications, can alter the plant-pollinator interactions. To assess the potential influence of global change drivers on pollination, large-scale interactions, climate and land use data are required. While recent machine learning methods, such as graph neural networks (GNNs), allow the analysis of such datasets, interpreting their results can be challenging. We explore existing methods for interpreting GNNs in order to highlight the effects of various environmental covariates on pollination network connectivity. An extensive simulation study is performed to confirm whether these methods can detect the interactive effect between a covariate and a genus of plant on connectivity, and whether the application of debiasing techniques influences the estimation of these effects. An application on the Spipoll dataset, with and without accounting for sampling effects, highlights the potential impact of land use on network connectivity and shows that accounting for sampling effects partially alters the estimation of these effects.

2605.15620 2026-05-18 stat.ML cs.LG

Pessimistic Risk-Aware Policy Learning in Contextual Bandits

悲观风险感知策略学习在上下文老虎机中

Yilong Wan, Yuqiang Li, Xianyi Wu

AI总结 本文提出统一框架优化Lipschitz连续风险函数,涵盖均值-方差、熵风险等,通过新型经验集中不等式推导数据依赖的次优界,无须强叠加假设,达到最小最大最优。

详情
AI中文摘要

我们研究风险感知的离线策略学习,旨在从记录数据中学习最优决策规则,满足一般风险标准。在高风险领域,线上交互不可行且需严格控制不利结果。现有离线上下文老虎机文献要么聚焦预期奖励标准,要么仅限于策略评估而非优化。本文提出统一分布框架优化Lipschitz连续风险函数,涵盖均值-方差、熵风险、条件风险价值等。通过开发新型经验集中不等式用于重要性采样分布估计,分析推导数据依赖的次优界,无须强叠加假设,该速率最小最大最优,与风险中性离线策略优化一致,表明优化一般Lipschitz风险标准无额外统计成本。

英文摘要

We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.

2605.15312 2026-05-18 cs.CY cs.CV

Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA

超越表现差异:对CelebrA中表征性伤害的三级审计

Sieun Park, Yuanmo He

AI总结 本文通过三级审计揭示CelebrA数据集中性别化的年龄和美貌标准如何在数据和模型中再现,指出表征性伤害导致女性被过度审视而老年男性被排除在外。

Comments 15 pages, 8 figures

详情
AI中文摘要

大规模面部数据集如CelebrA在计算机视觉中广泛应用,但其标签中的文化偏见仍被忽视。公平性研究区分了表征性与分配性伤害,但对计算机视觉数据集的审计多关注分类标签,未探讨此类伤害如何在学习特征和模型注意力中体现。本文从数据集结构、学习特征权重和空间注意力三级层面分析CelebrA,聚焦性别化的年龄和美貌标准如何在数据中编码并在模型行为中再现。首先,202599张图像的分层聚类显示39个属性组织成与文化原型一致的潜在特质束:表演性女性(年轻、化妆、装饰)和专业男性(老化、面部毛发、正式着装)。尽管女性整体更常被评价为有吸引力,但被分配到老化或男性化簇时会遭受严重惩罚。其次,XGBoost结合SHAP分析揭示性别特定效应,如脂肪减少吸引力仅对女性有效。第三,Grad-CAM发现女性和年轻男性子群的预测集中在中面部线索,而老年男性的预测则偏向外围线索如头发和服装。老年男性获得最高准确率但最低平均精度,表明被数据集评估模板排除。文化双重标准由此从媒体代表进入数据标签、特征权重和模型注意力,产生两种表征性伤害:在狭窄评估模板下对女性的过度审视,以及完全排除老年男性。聚焦性能差异的公平性指标掩盖了这两种伤害,强调在公平性研究中需解决表征性伤害。

英文摘要

Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset's evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.

2605.15225 2026-05-18 q-bio.QM cs.AI

Do Biological Structural Guarantees Earn Their Complexity?

Bogdan Banu

AI总结 本文探讨了生物学结构保证是否值得其复杂性,通过构建三个深度基准测试,比较了基于生物机制(如代谢优先门控、自动诱导物群体感应和贝叶斯停滞检测)的AI框架与非生物替代方案及简化对照在数千次试验中的表现,验证了生物结构在可靠性上的实际优势与代价。

详情
英文摘要

Biologically-inspired AI agent frameworks claim reliability benefits through structural guarantees adapted from gene regulatory networks, immune systems, and metabolic control. These claims are rarely tested empirically against simpler alternatives. We present three deep benchmarks: metabolic priority gating, autoinducer-based quorum sensing, and Bayesian stagnation detection, each comparing a biologically-grounded implementation against a naive non-biological alternative and an ablated control, across 1,000 trials per seed and 10 seeds (10M+ data points total).